I agree this is a usability defect at least. If setting the number of
reducers is a common activity which users need to perform in running
Mahout applications then we ought to have a standard way of specifying
this in our APIs without exposing the full set of Hadoop options,
especially to our non-power-users. This is the case for some
applications already but others require the use of Hadoop-level -D
arguments to achieve reasonable out-of-the-box parallelism even when
running our examples. I think the usability defect is that some of our
algorithms won't scale without it and that we don't have a standard way
to specify this in our APIs.
If exposing --numReducers (and in one case at least --numMappers too) is
duplicating the Hadoop-level arguments then perhaps instead adding
--desiredParallelism or even --howManyNodesImRunningThisOn to all Mahout
applications would give them the ability to optimize the lower-level
Hadoop arguments in a more intelligent manner than they can do today.
But these are possible solutions, can we agree for now on a statement of
the problem?
On 6/11/10 1:17 PM, Ted Dunning wrote:
I view this behavior as a bug in our code. The default behavior should be
reasonable. When it is not, that isn't evidence that the user needs flags
to fix the behavior ... it is evidence that we should fix the default
behavior.
(I hate buying products where the default for -Ddo-something-stupid is
true)
On Fri, Jun 11, 2010 at 12:29 PM, Jeff Eastman
<[email protected]>wrote:
Do we have evidence the other way, that users regularly need to
control this to achieve best performance? I personally actually never
set it and let Hadoop base it on the file splits and blocks and such,
which is a pretty good heuristic.
Anecdotal: When I ran PFPGrowth on the accidents.dat database on a 4 data
node cluster it only used a single reducer. Haven't yet tried that with -D
but I think others have. Before I added numReducers propagation to
seq2sparse it only launched a single reducer for the back-end steps doing
Reuters and LDA took 3x longer than necessary on my cluster.
DistributedRowMatrix requires -D to achieve parallelism at the top of this
thread. I suspect there are others. I've observed Hadoop does a pretty good
job with mappers based upon file splits etc but not so well at reducers
which is why we have --numReducers in the first place.