This functionality already basically exists in Spark.  To create the
"grouped RDD", one can run:

  val groupedRdd = rdd.reduceByKey(_ + _)

To get it back into the original form:

  groupedRdd.flatMap(x => List.fill(x._1)(x._2))

-Sandy

-Sandy

On Sun, Jul 19, 2015 at 10:40 AM, Сергей Лихоман <sergliho...@gmail.com>
wrote:

> Hi,
>
> I am looking for suitable issue for Master Degree project(it sounds like
> scalability problems and improvements for spark streaming) and seems like
> introduction of grouped RDD(for example: don't store
> "Spark", "Spark", "Spark", instead store ("Spark", 3)) can:
>
> 1. Reduce memory needed for RDD (roughly, used memory will be:  % of uniq
> messages)
> 2. Improve performance(no need to apply function several times for the
> same message).
>
> Can I create ticket and introduce API for grouped RDDs? Is it make sense?
> Also I will be very appreciated for critic and ideas
>

Reply via email to