Edit: the first line should read: val groupedRdd = rdd.map((_, 1)).reduceByKey(_ + _)
On Sun, Jul 19, 2015 at 11:02 AM, Sandy Ryza <sandy.r...@cloudera.com> wrote: > This functionality already basically exists in Spark. To create the > "grouped RDD", one can run: > > val groupedRdd = rdd.reduceByKey(_ + _) > > To get it back into the original form: > > groupedRdd.flatMap(x => List.fill(x._1)(x._2)) > > -Sandy > > -Sandy > > On Sun, Jul 19, 2015 at 10:40 AM, Сергей Лихоман <sergliho...@gmail.com> > wrote: > >> Hi, >> >> I am looking for suitable issue for Master Degree project(it sounds like >> scalability problems and improvements for spark streaming) and seems like >> introduction of grouped RDD(for example: don't store >> "Spark", "Spark", "Spark", instead store ("Spark", 3)) can: >> >> 1. Reduce memory needed for RDD (roughly, used memory will be: % of uniq >> messages) >> 2. Improve performance(no need to apply function several times for the >> same message). >> >> Can I create ticket and introduce API for grouped RDDs? Is it make sense? >> Also I will be very appreciated for critic and ideas >> > >