Re: Setting Number of Mappers and Reducers in DistributedRowMatrix Jobs

Jeff Eastman Sun, 13 Jun 2010 09:53:56 -0700

I agree this is a usability defect at least. If setting the number ofreducers is a common activity which users need to perform in runningMahout applications then we ought to have a standard way of specifyingthis in our APIs without exposing the full set of Hadoop options,especially to our non-power-users. This is the case for someapplications already but others require the use of Hadoop-level -Darguments to achieve reasonable out-of-the-box parallelism even whenrunning our examples. I think the usability defect is that some of ouralgorithms won't scale without it and that we don't have a standard wayto specify this in our APIs.

If exposing --numReducers (and in one case at least --numMappers too) isduplicating the Hadoop-level arguments then perhaps instead adding--desiredParallelism or even --howManyNodesImRunningThisOn to all Mahoutapplications would give them the ability to optimize the lower-levelHadoop arguments in a more intelligent manner than they can do today.But these are possible solutions, can we agree for now on a statement ofthe problem?


On 6/11/10 1:17 PM, Ted Dunning wrote:

I view this behavior as a bug in our code.  The default behavior should be
reasonable.  When it is not, that isn't evidence that the user needs flags
to fix the behavior ... it is evidence that we should fix the default
behavior.

(I hate buying products where the default for -Ddo-something-stupid is
true)

On Fri, Jun 11, 2010 at 12:29 PM, Jeff Eastman
<[email protected]>wrote:

Do we have evidence the other way, that users regularly need to

control this to achieve best performance? I personally actually never
set it and let Hadoop base it on the file splits and blocks and such,
which is a pretty good heuristic.

Anecdotal: When I ran PFPGrowth on the accidents.dat database on a 4 data
node cluster it only used a single reducer. Haven't yet tried that with -D
but I think others have. Before I added numReducers propagation to
seq2sparse it only launched a single reducer for the back-end steps doing
Reuters and LDA took 3x longer than necessary on my cluster.
DistributedRowMatrix requires -D to achieve parallelism at the top of this
thread. I suspect there are others. I've observed Hadoop does a pretty good
job with mappers based upon file splits etc but not so well at reducers
which is why we have --numReducers in the first place.

Re: Setting Number of Mappers and Reducers in DistributedRowMatrix Jobs

Reply via email to