The issue isn't that the Iterator[P] can't be disk-backed. It's that, with a groupBy, each P is a (Key, Values) tuple, and the entire tuple is read into memory at once. The ShuffledRDD is agnostic to what goes inside P.
On Sun, Apr 20, 2014 at 11:36 AM, Mridul Muralidharan <mri...@gmail.com>wrote: > An iterator does not imply data has to be memory resident. > Think merge sort output as an iterator (disk backed). > > Tom is actually planning to work on something similar with me on this > hopefully this or next month. > > Regards, > Mridul > > > On Sun, Apr 20, 2014 at 11:46 PM, Sandy Ryza <sandy.r...@cloudera.com> > wrote: > > Hey all, > > > > After a shuffle / groupByKey, Hadoop MapReduce allows the values for a > key > > to not all fit in memory. The current ShuffleFetcher.fetch API, which > > doesn't distinguish between keys and values, only returning an > Iterator[P], > > seems incompatible with this. > > > > Any thoughts on how we could achieve parity here? > > > > -Sandy >