As Matei mentioned, the Values is now an Iterable : which can be disk backed.
Does that not address the concern ?

@Patrick - we do have cases where the length of the sequence is large
and size per value is also non trivial : so we do need this :-)
Note that join is a trivial example where this is required (in our
current implementation).

Regards,
Mridul

On Mon, Apr 21, 2014 at 6:25 AM, Sandy Ryza <sandy.r...@cloudera.com> wrote:
> The issue isn't that the Iterator[P] can't be disk-backed.  It's that, with
> a groupBy, each P is a (Key, Values) tuple, and the entire tuple is read
> into memory at once.  The ShuffledRDD is agnostic to what goes inside P.
>
> On Sun, Apr 20, 2014 at 11:36 AM, Mridul Muralidharan <mri...@gmail.com>wrote:
>
>> An iterator does not imply data has to be memory resident.
>> Think merge sort output as an iterator (disk backed).
>>
>> Tom is actually planning to work on something similar with me on this
>> hopefully this or next month.
>>
>> Regards,
>> Mridul
>>
>>
>> On Sun, Apr 20, 2014 at 11:46 PM, Sandy Ryza <sandy.r...@cloudera.com>
>> wrote:
>> > Hey all,
>> >
>> > After a shuffle / groupByKey, Hadoop MapReduce allows the values for a
>> key
>> > to not all fit in memory.  The current ShuffleFetcher.fetch API, which
>> > doesn't distinguish between keys and values, only returning an
>> Iterator[P],
>> > seems incompatible with this.
>> >
>> > Any thoughts on how we could achieve parity here?
>> >
>> > -Sandy
>>

Reply via email to