Hi Nilesh, That change from Matei to change (Key, Seq[Value]) into (Key, Iterable[Value]) was to enable the optimization in future releases without breaking the API. Currently though, all values on a single key are still held in memory on a single machine.
The way I've gotten around this is by introducing another value to my Key that goes from (Key) to (Key, randomValue % 10) for example. Using this you can further shard an individual key and keep from holding as much data in memory at once. The workaround is an ugly hack, but if it works then it works. Hope that helps! Andrew On Sun, May 25, 2014 at 6:55 PM, Nilesh <nil...@nileshc.com> wrote: > I would like to clarify something. Matei mentioned that in Spark 1.0 > groupBy > returns an (Key, Iterable[Value]) instead of (Key, Seq[Value]). Does this > also automatically assure us that the whole Iterable[Value] is not in fact > stored in memory? That is to say, with 1.0, will it be possible to do > groupByKey().values.map(x => while(x.hasNext) ... ) assuming x : > Iterable[Value] is larger than the RAM on a single machine? Or will this be > possible later, in subsequent versions? > > Could you please propose a workaround for this for the meantime? I'm out of > ideas. > > Thanks, > Nilesh > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/all-values-for-a-key-must-fit-in-memory-tp6342p6791.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. >