Thanks for the recommendation. I can definitely run this job with the proposed setting. In addition, I created a patch in https://issues.apache.org/jira/browse/TEZ-3076 that reduces the memory need in MapOutput and InMemoryReader that allows this job to run without the need for tez.runtime.shuffle.memory-to-memory.enable setting enabled. I'll update the jira with the overall reduction per mapoutput entry and inmemoryreader.
Please have a look. Jon On Wed, Jan 20, 2016 at 5:18 PM, Gopal Vijayaraghavan <[email protected]> wrote: > > > around 1,000,000 spills were fetched committing around 100MB to the > >memory budget (500,000 in memory). However, actual memory used for 500,000 > >segments (50-350 bytes) is 480MB (expected 100-200MB) > > This is effectively the problem the mem2merger solves - but is not enabled > by default. > > I noticed that this build up of >100 segment in-memory is generally a bad > thing and merging it back into 1 segment in-memory was a significant boost > to perf when producing the iterators for the reducers. > > can you re-run the scenario with in-mem merge enabled with an > io.sort.factor = 100 ? > > Cheers, > Gopal > > >
