Hello,

I'm facing a problem with custom RDD transformations.

I would like to transform a RDD[K, V] into a Map[K, RDD[V]], meaning a map
of RDD by key.

This would be great, for example, in order to process mllib clustering on V
values grouped by K.

I know I could do it using filter() on my RDD as many times I have keys,
but I'm afraid this would not be efficient (the entire RDD would be read
each time, right ?). Then, I could mapByPartition my RDD before filtering,
but the code is finally huge...

So, I tried to create a CustomRDD to implement a splitByKey(rdd: RDD[K,
V]): Map[K, RDD[V]] method, which would iterate on the RDD once time only,
but I cannot achieve my development.

Please, could you tell me first if this is really faisable, and then, could
you give me some pointers ?

Thank you,
Regards,
Sebastien

Reply via email to