Hi Nilesh,

That change from Matei to change (Key, Seq[Value]) into (Key,
Iterable[Value]) was to enable the optimization in future releases without
breaking the API.  Currently though, all values on a single key are still
held in memory on a single machine.

The way I've gotten around this is by introducing another value to my Key
that goes from (Key) to (Key, randomValue % 10) for example.  Using this
you can further shard an individual key and keep from holding as much data
in memory at once.  The workaround is an ugly hack, but if it works then it
works.

Hope that helps!
Andrew


On Sun, May 25, 2014 at 6:55 PM, Nilesh <nil...@nileshc.com> wrote:

> I would like to clarify something. Matei mentioned that in Spark 1.0
> groupBy
> returns an (Key, Iterable[Value]) instead of (Key, Seq[Value]). Does this
> also automatically assure us that the whole Iterable[Value] is not in fact
> stored in memory? That is to say, with 1.0, will it be possible to do
> groupByKey().values.map(x => while(x.hasNext) ... ) assuming x :
> Iterable[Value] is larger than the RAM on a single machine? Or will this be
> possible later, in subsequent versions?
>
> Could you please propose a workaround for this for the meantime? I'm out of
> ideas.
>
> Thanks,
> Nilesh
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/all-values-for-a-key-must-fit-in-memory-tp6342p6791.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>

Reply via email to