Re: all values for a key must fit in memory

Sandy Ryza Sun, 20 Apr 2014 17:56:32 -0700

The issue isn't that the Iterator[P] can't be disk-backed.  It's that, with
a groupBy, each P is a (Key, Values) tuple, and the entire tuple is read
into memory at once.  The ShuffledRDD is agnostic to what goes inside P.


On Sun, Apr 20, 2014 at 11:36 AM, Mridul Muralidharan <mri...@gmail.com>wrote:

> An iterator does not imply data has to be memory resident.
> Think merge sort output as an iterator (disk backed).
>
> Tom is actually planning to work on something similar with me on this
> hopefully this or next month.
>
> Regards,
> Mridul
>
>
> On Sun, Apr 20, 2014 at 11:46 PM, Sandy Ryza <sandy.r...@cloudera.com>
> wrote:
> > Hey all,
> >
> > After a shuffle / groupByKey, Hadoop MapReduce allows the values for a
> key
> > to not all fit in memory.  The current ShuffleFetcher.fetch API, which
> > doesn't distinguish between keys and values, only returning an
> Iterator[P],
> > seems incompatible with this.
> >
> > Any thoughts on how we could achieve parity here?
> >
> > -Sandy
>

Re: all values for a key must fit in memory

Reply via email to