I completely agree that many of the Hadoop options are inappropriate as standard Mahout arguments. The challenge I see from a usability perspective is that the -D option introduces two different levels of abstraction into our user APIs. It's like exposing the full engine and transmission APIs in an automobile on the dashboard next to the cruise control buttons. I would argue that the Mahout APIs (our standard command line arguments) ought to be complete enough for 'neophyte users' and 'regular users' and that only 'power users' should be using the -D abstractions (and with that accepting any idiosyncrasies that may result since we cannot guarantee how they may interact).

Since the degree of parallelism obtained is often a function of the number of mappers/reducers specified, and since the degree of parallelism is something our 'regular users' would reasonably need to control, perhaps replacing the --numReducers options with --desiredParallelism (or something) and having reasonable defaults on that for our neophytes would be better. Then the implementation could take the user's desires into account and internally manage the numbers of map and reduce tasks where it makes sense to do so.

Said a little differently, the Configuration values set in the Drivers clearly need to come from our standard command arguments. So too do some of the Job values, but more indirectly as you note with --input and --output handling being managed internally to each job step. I think this also applies to --numMappers and --numReducers settings and that managing them internally via an application-level --desiredParallelism argument would be an improvement that would keep our API abstraction layers distinct.

On 6/11/10 10:13 AM, Sean Owen wrote:
It's the same question as --input and -Dmapred.input.dir. The latter
is the standard Hadoop parameter, which we have to support if only
because this is something the user may be configuring in the XML
configs, but also because it'll be familiar to Hadoop users I assume.


Jobs can read and change these settings to implement additional
restrictions, sure. For example, the user-supplied input and output
dir are only used to control the first M/R input in a chain of M/Rs
run by a job, and the output of its final M/R. In between, it's
overriding this value on individual M/Rs as needed of course, to
direct intermediate output elsewhere.


So the question is not whether we need our own way to control Hadoop
parameters at times -- we very much do, and this already happens and
works internally. The question is merely one of command-line "UI",
duplicating Hadoop flags with our own.

I personally am inclined to not do this, as it's just more code, more
possibilities to support and debug, more difference from the norm.
However in the case of input and output I think we all agreed that
such a basic flag might as well have its own custom version that works
in the same way as the Hadoop one.

I'd argue we wouldn't want to do the same thing for number of mappers
and reducers. From there, why not duplicate about 10 other flags I can
think of? compressing map output, reducer output, IO sort buffer size,
etc etc.


On Fri, Jun 11, 2010 at 6:01 PM, Jeff Eastman
<[email protected]>  wrote:
Over to dev list:

Sean, we currently have some jobs which accept numbers of mappers and
reducers as optional command arguments and others that require the -D
arguments to control same as you have written. Seems like our usability
would improve if we adopted a consistent policy across all Mahout
components. If so, would you argue that all use -D arguments for this
control? What about situations where our default is not whatever Hadoop does
by default? Would this result in noticable behavior changes? Also, some
algorithms don't work with arbitrary numbers of reducers and some don't use
reducers at all. What would you suggest?


Reply via email to