Hi,

My two cents is that that could be interesting if all RDD and pair
RDD operations would be lifted to work on groupedRDD. For example as
suggested a map on grouped RDDs would be more efficient if the original RDD
had lots of duplicate entries, but for RDDs with little repetitions I guess
you in fact lose efficiency. The same applies to filter, sortBy, count,
max, ... but for example I guess there is no gain for reduce and other
operations. Also note the order is lost when passing to grouped RDD, so the
semantics is not exactly the same, but would be good enough for
many applications. Also I would look for suitable use cases where RDD with
many repetitions arise naturally, and the transformations with performance
gain like map are used often, and I would do some experiments to compare
performance between a computation with grouped RDD and the same computation
without grouping, for different input sizes


El domingo, 19 de julio de 2015, Sandy Ryza <sandy.r...@cloudera.com>
escribió:

> This functionality already basically exists in Spark.  To create the
> "grouped RDD", one can run:
>
>   val groupedRdd = rdd.reduceByKey(_ + _)
>
> To get it back into the original form:
>
>   groupedRdd.flatMap(x => List.fill(x._1)(x._2))
>
> -Sandy
>
> -Sandy
>
> On Sun, Jul 19, 2015 at 10:40 AM, Сергей Лихоман <sergliho...@gmail.com
> <javascript:_e(%7B%7D,'cvml','sergliho...@gmail.com');>> wrote:
>
>> Hi,
>>
>> I am looking for suitable issue for Master Degree project(it sounds like
>> scalability problems and improvements for spark streaming) and seems like
>> introduction of grouped RDD(for example: don't store
>> "Spark", "Spark", "Spark", instead store ("Spark", 3)) can:
>>
>> 1. Reduce memory needed for RDD (roughly, used memory will be:  % of uniq
>> messages)
>> 2. Improve performance(no need to apply function several times for the
>> same message).
>>
>> Can I create ticket and introduce API for grouped RDDs? Is it make sense?
>> Also I will be very appreciated for critic and ideas
>>
>
>

Reply via email to