Sorry, maybe I am saying something completely wrong... we have a stream, we digitize it to created rdd. rdd in this case will be just array of any. than we apply transformation to create new grouped rdd and GC should remove original rdd from memory(if we won't persist it). Will we have GC step in val groupedRdd = rdd.map((_, 1)).reduceByKey(_ + _) ? my suggestion is to remove creation and reclaiming of unneeded rdd and create already grouped one
2015-07-19 21:26 GMT+03:00 Sandy Ryza <[email protected]>: > The user gets to choose what they want to reside in memory. If they call > rdd.cache() on the original RDD, it will be in memory. If they call > rdd.cache() on the compact RDD, it will be in memory. If cache() is called > on both, they'll both be in memory. > > -Sandy > > On Sun, Jul 19, 2015 at 11:09 AM, Сергей Лихоман <[email protected]> > wrote: > >> Thanks for answer! Could you please answer for one more question? Will >> we have in memory original rdd and grouped rdd in the same time? >> >> 2015-07-19 21:04 GMT+03:00 Sandy Ryza <[email protected]>: >> >>> Edit: the first line should read: >>> >>> val groupedRdd = rdd.map((_, 1)).reduceByKey(_ + _) >>> >>> On Sun, Jul 19, 2015 at 11:02 AM, Sandy Ryza <[email protected]> >>> wrote: >>> >>>> This functionality already basically exists in Spark. To create the >>>> "grouped RDD", one can run: >>>> >>>> val groupedRdd = rdd.reduceByKey(_ + _) >>>> >>>> To get it back into the original form: >>>> >>>> groupedRdd.flatMap(x => List.fill(x._1)(x._2)) >>>> >>>> -Sandy >>>> >>>> -Sandy >>>> >>>> On Sun, Jul 19, 2015 at 10:40 AM, Сергей Лихоман <[email protected] >>>> > wrote: >>>> >>>>> Hi, >>>>> >>>>> I am looking for suitable issue for Master Degree project(it sounds >>>>> like scalability problems and improvements for spark streaming) and seems >>>>> like introduction of grouped RDD(for example: don't store >>>>> "Spark", "Spark", "Spark", instead store ("Spark", 3)) can: >>>>> >>>>> 1. Reduce memory needed for RDD (roughly, used memory will be: % of >>>>> uniq messages) >>>>> 2. Improve performance(no need to apply function several times for the >>>>> same message). >>>>> >>>>> Can I create ticket and introduce API for grouped RDDs? Is it make >>>>> sense? Also I will be very appreciated for critic and ideas >>>>> >>>> >>>> >>> >> >
