Hi, I'm not an authority in the Spark community, but what I would do is adding the project to spark packages http://spark-packages.org/. In fact I think this case is similar to IndexedRDD, which is also in spark packages http://spark-packages.org/package/amplab/spark-indexedrdd
2015-07-19 21:49 GMT+02:00 Сергей Лихоман <sergliho...@gmail.com>: > Hi Juan, > > It's exactly what I meant. if we will have high load with many repetitions it > can significantly reduce rdd size and improve performance. in real use > cases application frequently need to enrich data from cache or external > system, so we will save time on each repetition. > I will also do some experiments. About little repetitions: in what use > cases we will lose efficiency? it will also test it. > What I need to do this commitment? Just create ticket in Jira? > > > > 2015-07-19 21:56 GMT+03:00 Juan Rodríguez Hortalá < > juan.rodriguez.hort...@gmail.com>: > >> Hi, >> >> My two cents is that that could be interesting if all RDD and pair >> RDD operations would be lifted to work on groupedRDD. For example as >> suggested a map on grouped RDDs would be more efficient if the original RDD >> had lots of duplicate entries, but for RDDs with little repetitions I guess >> you in fact lose efficiency. The same applies to filter, sortBy, count, >> max, ... but for example I guess there is no gain for reduce and other >> operations. Also note the order is lost when passing to grouped RDD, so the >> semantics is not exactly the same, but would be good enough for >> many applications. Also I would look for suitable use cases where RDD with >> many repetitions arise naturally, and the transformations with performance >> gain like map are used often, and I would do some experiments to compare >> performance between a computation with grouped RDD and the same computation >> without grouping, for different input sizes >> >> >> El domingo, 19 de julio de 2015, Sandy Ryza <sandy.r...@cloudera.com> >> escribió: >> >>> This functionality already basically exists in Spark. To create the >>> "grouped RDD", one can run: >>> >>> val groupedRdd = rdd.reduceByKey(_ + _) >>> >>> To get it back into the original form: >>> >>> groupedRdd.flatMap(x => List.fill(x._1)(x._2)) >>> >>> -Sandy >>> >>> -Sandy >>> >>> On Sun, Jul 19, 2015 at 10:40 AM, Сергей Лихоман <sergliho...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> I am looking for suitable issue for Master Degree project(it sounds >>>> like scalability problems and improvements for spark streaming) and seems >>>> like introduction of grouped RDD(for example: don't store >>>> "Spark", "Spark", "Spark", instead store ("Spark", 3)) can: >>>> >>>> 1. Reduce memory needed for RDD (roughly, used memory will be: % of >>>> uniq messages) >>>> 2. Improve performance(no need to apply function several times for the >>>> same message). >>>> >>>> Can I create ticket and introduce API for grouped RDDs? Is it make >>>> sense? Also I will be very appreciated for critic and ideas >>>> >>> >>> >