In the Spark model, constructing an RDD does not mean storing all its contents in memory. Rather, an RDD is a description of a dataset that enables iterating over its contents, record by record (in parallel). The only time the full contents of an RDD are stored in memory is when a user explicitly calls "cache" or "persist" on it.
-Sandy On Sun, Jul 19, 2015 at 11:41 AM, Сергей Лихоман <sergliho...@gmail.com> wrote: > Sorry, maybe I am saying something completely wrong... we have a stream, > we digitize it to created rdd. rdd in this case will be just array of any. > than we apply transformation to create new grouped rdd and GC should remove > original rdd from memory(if we won't persist it). Will we have GC step in val > groupedRdd = rdd.map((_, 1)).reduceByKey(_ + _) ? my suggestion is to > remove creation and reclaiming of unneeded rdd and create already grouped > one > > 2015-07-19 21:26 GMT+03:00 Sandy Ryza <sandy.r...@cloudera.com>: > >> The user gets to choose what they want to reside in memory. If they call >> rdd.cache() on the original RDD, it will be in memory. If they call >> rdd.cache() on the compact RDD, it will be in memory. If cache() is called >> on both, they'll both be in memory. >> >> -Sandy >> >> On Sun, Jul 19, 2015 at 11:09 AM, Сергей Лихоман <sergliho...@gmail.com> >> wrote: >> >>> Thanks for answer! Could you please answer for one more question? Will >>> we have in memory original rdd and grouped rdd in the same time? >>> >>> 2015-07-19 21:04 GMT+03:00 Sandy Ryza <sandy.r...@cloudera.com>: >>> >>>> Edit: the first line should read: >>>> >>>> val groupedRdd = rdd.map((_, 1)).reduceByKey(_ + _) >>>> >>>> On Sun, Jul 19, 2015 at 11:02 AM, Sandy Ryza <sandy.r...@cloudera.com> >>>> wrote: >>>> >>>>> This functionality already basically exists in Spark. To create the >>>>> "grouped RDD", one can run: >>>>> >>>>> val groupedRdd = rdd.reduceByKey(_ + _) >>>>> >>>>> To get it back into the original form: >>>>> >>>>> groupedRdd.flatMap(x => List.fill(x._1)(x._2)) >>>>> >>>>> -Sandy >>>>> >>>>> -Sandy >>>>> >>>>> On Sun, Jul 19, 2015 at 10:40 AM, Сергей Лихоман < >>>>> sergliho...@gmail.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I am looking for suitable issue for Master Degree project(it sounds >>>>>> like scalability problems and improvements for spark streaming) and seems >>>>>> like introduction of grouped RDD(for example: don't store >>>>>> "Spark", "Spark", "Spark", instead store ("Spark", 3)) can: >>>>>> >>>>>> 1. Reduce memory needed for RDD (roughly, used memory will be: % of >>>>>> uniq messages) >>>>>> 2. Improve performance(no need to apply function several times for >>>>>> the same message). >>>>>> >>>>>> Can I create ticket and introduce API for grouped RDDs? Is it make >>>>>> sense? Also I will be very appreciated for critic and ideas >>>>>> >>>>> >>>>> >>>> >>> >> >