Re: Setting Number of Mappers and Reducers in DistributedRowMatrix Jobs

Jeff Eastman Fri, 11 Jun 2010 11:34:31 -0700

I completely agree that many of the Hadoop options are inappropriate asstandard Mahout arguments. The challenge I see from a usabilityperspective is that the -D option introduces two different levels ofabstraction into our user APIs. It's like exposing the full engine andtransmission APIs in an automobile on the dashboard next to the cruisecontrol buttons. I would argue that the Mahout APIs (our standardcommand line arguments) ought to be complete enough for 'neophyte users'and 'regular users' and that only 'power users' should be using the -Dabstractions (and with that accepting any idiosyncrasies that may resultsince we cannot guarantee how they may interact).

Since the degree of parallelism obtained is often a function of thenumber of mappers/reducers specified, and since the degree ofparallelism is something our 'regular users' would reasonably need tocontrol, perhaps replacing the --numReducers options with--desiredParallelism (or something) and having reasonable defaults onthat for our neophytes would be better. Then the implementation couldtake the user's desires into account and internally manage the numbersof map and reduce tasks where it makes sense to do so.

Said a little differently, the Configuration values set in the Driversclearly need to come from our standard command arguments. So too do someof the Job values, but more indirectly as you note with --input and--output handling being managed internally to each job step. I thinkthis also applies to --numMappers and --numReducers settings and thatmanaging them internally via an application-level --desiredParallelismargument would be an improvement that would keep our API abstractionlayers distinct.


On 6/11/10 10:13 AM, Sean Owen wrote:

It's the same question as --input and -Dmapred.input.dir. The latter
is the standard Hadoop parameter, which we have to support if only
because this is something the user may be configuring in the XML
configs, but also because it'll be familiar to Hadoop users I assume.


Jobs can read and change these settings to implement additional
restrictions, sure. For example, the user-supplied input and output
dir are only used to control the first M/R input in a chain of M/Rs
run by a job, and the output of its final M/R. In between, it's
overriding this value on individual M/Rs as needed of course, to
direct intermediate output elsewhere.


So the question is not whether we need our own way to control Hadoop
parameters at times -- we very much do, and this already happens and
works internally. The question is merely one of command-line "UI",
duplicating Hadoop flags with our own.

I personally am inclined to not do this, as it's just more code, more
possibilities to support and debug, more difference from the norm.
However in the case of input and output I think we all agreed that
such a basic flag might as well have its own custom version that works
in the same way as the Hadoop one.

I'd argue we wouldn't want to do the same thing for number of mappers
and reducers. From there, why not duplicate about 10 other flags I can
think of? compressing map output, reducer output, IO sort buffer size,
etc etc.


On Fri, Jun 11, 2010 at 6:01 PM, Jeff Eastman
<[email protected]>  wrote:

Over to dev list:

Sean, we currently have some jobs which accept numbers of mappers and
reducers as optional command arguments and others that require the -D
arguments to control same as you have written. Seems like our usability
would improve if we adopted a consistent policy across all Mahout
components. If so, would you argue that all use -D arguments for this
control? What about situations where our default is not whatever Hadoop does
by default? Would this result in noticable behavior changes? Also, some
algorithms don't work with arbitrary numbers of reducers and some don't use
reducers at all. What would you suggest?

Re: Setting Number of Mappers and Reducers in DistributedRowMatrix Jobs

Reply via email to