As Matei mentioned, the Values is now an Iterable : which can be disk backed. Does that not address the concern ?
@Patrick - we do have cases where the length of the sequence is large and size per value is also non trivial : so we do need this :-) Note that join is a trivial example where this is required (in our current implementation). Regards, Mridul On Mon, Apr 21, 2014 at 6:25 AM, Sandy Ryza <sandy.r...@cloudera.com> wrote: > The issue isn't that the Iterator[P] can't be disk-backed. It's that, with > a groupBy, each P is a (Key, Values) tuple, and the entire tuple is read > into memory at once. The ShuffledRDD is agnostic to what goes inside P. > > On Sun, Apr 20, 2014 at 11:36 AM, Mridul Muralidharan <mri...@gmail.com>wrote: > >> An iterator does not imply data has to be memory resident. >> Think merge sort output as an iterator (disk backed). >> >> Tom is actually planning to work on something similar with me on this >> hopefully this or next month. >> >> Regards, >> Mridul >> >> >> On Sun, Apr 20, 2014 at 11:46 PM, Sandy Ryza <sandy.r...@cloudera.com> >> wrote: >> > Hey all, >> > >> > After a shuffle / groupByKey, Hadoop MapReduce allows the values for a >> key >> > to not all fit in memory. The current ShuffleFetcher.fetch API, which >> > doesn't distinguish between keys and values, only returning an >> Iterator[P], >> > seems incompatible with this. >> > >> > Any thoughts on how we could achieve parity here? >> > >> > -Sandy >>