Re: Compact RDD representation

Sandy Ryza Sun, 19 Jul 2015 11:47:56 -0700

In the Spark model, constructing an RDD does not mean storing all its
contents in memory.  Rather, an RDD is a description of a dataset that
enables iterating over its contents, record by record (in parallel).  The
only time the full contents of an RDD are stored in memory is when a user
explicitly calls "cache" or "persist" on it.


-Sandy

On Sun, Jul 19, 2015 at 11:41 AM, Сергей Лихоман <[email protected]>
wrote:

> Sorry, maybe I am saying something completely wrong...  we have a stream,
> we digitize it to created rdd. rdd in this case will be just array of any.
> than we apply transformation to create new grouped rdd and GC should remove
> original rdd from memory(if we won't persist it). Will we have GC step in  val
> groupedRdd = rdd.map((_, 1)).reduceByKey(_ + _) ? my suggestion is to
> remove creation and reclaiming of unneeded rdd and create already grouped
> one
>
> 2015-07-19 21:26 GMT+03:00 Sandy Ryza <[email protected]>:
>
>> The user gets to choose what they want to reside in memory.  If they call
>> rdd.cache() on the original RDD, it will be in memory.  If they call
>> rdd.cache() on the compact RDD, it will be in memory.  If cache() is called
>> on both, they'll both be in memory.
>>
>> -Sandy
>>
>> On Sun, Jul 19, 2015 at 11:09 AM, Сергей Лихоман <[email protected]>
>> wrote:
>>
>>> Thanks for answer! Could you please answer for one more question? Will
>>> we have in memory original rdd and grouped rdd in the same time?
>>>
>>> 2015-07-19 21:04 GMT+03:00 Sandy Ryza <[email protected]>:
>>>
>>>> Edit: the first line should read:
>>>>
>>>>   val groupedRdd = rdd.map((_, 1)).reduceByKey(_ + _)
>>>>
>>>> On Sun, Jul 19, 2015 at 11:02 AM, Sandy Ryza <[email protected]>
>>>> wrote:
>>>>
>>>>> This functionality already basically exists in Spark.  To create the
>>>>> "grouped RDD", one can run:
>>>>>
>>>>>   val groupedRdd = rdd.reduceByKey(_ + _)
>>>>>
>>>>> To get it back into the original form:
>>>>>
>>>>>   groupedRdd.flatMap(x => List.fill(x._1)(x._2))
>>>>>
>>>>> -Sandy
>>>>>
>>>>> -Sandy
>>>>>
>>>>> On Sun, Jul 19, 2015 at 10:40 AM, Сергей Лихоман <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am looking for suitable issue for Master Degree project(it sounds
>>>>>> like scalability problems and improvements for spark streaming) and seems
>>>>>> like introduction of grouped RDD(for example: don't store
>>>>>> "Spark", "Spark", "Spark", instead store ("Spark", 3)) can:
>>>>>>
>>>>>> 1. Reduce memory needed for RDD (roughly, used memory will be:  % of
>>>>>> uniq messages)
>>>>>> 2. Improve performance(no need to apply function several times for
>>>>>> the same message).
>>>>>>
>>>>>> Can I create ticket and introduce API for grouped RDDs? Is it make
>>>>>> sense? Also I will be very appreciated for critic and ideas
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Compact RDD representation

Reply via email to