I view this behavior as a bug in our code. The default behavior should be reasonable. When it is not, that isn't evidence that the user needs flags to fix the behavior ... it is evidence that we should fix the default behavior.
(I hate buying products where the default for -Ddo-something-stupid is true) On Fri, Jun 11, 2010 at 12:29 PM, Jeff Eastman <[email protected]>wrote: > Do we have evidence the other way, that users regularly need to >> control this to achieve best performance? I personally actually never >> set it and let Hadoop base it on the file splits and blocks and such, >> which is a pretty good heuristic. >> >> >> > Anecdotal: When I ran PFPGrowth on the accidents.dat database on a 4 data > node cluster it only used a single reducer. Haven't yet tried that with -D > but I think others have. Before I added numReducers propagation to > seq2sparse it only launched a single reducer for the back-end steps doing > Reuters and LDA took 3x longer than necessary on my cluster. > DistributedRowMatrix requires -D to achieve parallelism at the top of this > thread. I suspect there are others. I've observed Hadoop does a pretty good > job with mappers based upon file splits etc but not so well at reducers > which is why we have --numReducers in the first place.
