I added the option that doesn't require the caller to specify the mergeCombiner closure a while ago when I wanted to disable mapSideCombine. In virtually all use cases I know of, it is fine & easy to specify a mergeCombiner, so I'm all for this given it simplifies the codebase.
On Tue, Dec 31, 2013 at 5:05 PM, Patrick Wendell <pwend...@gmail.com> wrote: > Hey All, > > There is a small API change that we are considering for the external > sort patch. Previously we allowed mergeCombiner to be null when map > side aggregation was not enabled. This is because it wasn't necessary > in that case since mappers didn't ship pre-aggregated values to > reducers. > > Because the external sort capability also relies on the mergeCombiner > function to merge partially-aggregated on-disk segments, we now need > it all the time, even if map side aggregation is enabled. This is a > fairly esoteric thing that I'm not sure anyone other than Shark ever > used, but I want to check in case anyone had feelings about this. > > The relevant code is here: > > > https://github.com/apache/incubator-spark/pull/303/files#diff-f70e97c099b5eac05c75288cb215e080R72 > > - Patrick >