Thanks for answer! Could you please answer for one more question? Will we have in memory original rdd and grouped rdd in the same time?
2015-07-19 21:04 GMT+03:00 Sandy Ryza <sandy.r...@cloudera.com>: > Edit: the first line should read: > > val groupedRdd = rdd.map((_, 1)).reduceByKey(_ + _) > > On Sun, Jul 19, 2015 at 11:02 AM, Sandy Ryza <sandy.r...@cloudera.com> > wrote: > >> This functionality already basically exists in Spark. To create the >> "grouped RDD", one can run: >> >> val groupedRdd = rdd.reduceByKey(_ + _) >> >> To get it back into the original form: >> >> groupedRdd.flatMap(x => List.fill(x._1)(x._2)) >> >> -Sandy >> >> -Sandy >> >> On Sun, Jul 19, 2015 at 10:40 AM, Сергей Лихоман <sergliho...@gmail.com> >> wrote: >> >>> Hi, >>> >>> I am looking for suitable issue for Master Degree project(it sounds like >>> scalability problems and improvements for spark streaming) and seems like >>> introduction of grouped RDD(for example: don't store >>> "Spark", "Spark", "Spark", instead store ("Spark", 3)) can: >>> >>> 1. Reduce memory needed for RDD (roughly, used memory will be: % of >>> uniq messages) >>> 2. Improve performance(no need to apply function several times for the >>> same message). >>> >>> Can I create ticket and introduce API for grouped RDDs? Is it make >>> sense? Also I will be very appreciated for critic and ideas >>> >> >> >