On Fri, Jun 11, 2010 at 7:33 PM, Jeff Eastman
<[email protected]> wrote:
> complete enough for 'neophyte users' and 'regular users' and that only
> 'power users' should be using the -D abstractions (and with that accepting
> any idiosyncrasies that may result since we cannot guarantee how they may
> interact).

That's a reasonable rule. All you really need to specify is input and
output, and Hadoop's defaults should work reasonably from there. So I
view this as an argument to create --input and --output, and that's
done.


> Since the degree of parallelism obtained is often a function of the number
> of mappers/reducers specified, and since the degree of parallelism is
> something our 'regular users' would reasonably need to control, perhaps
> replacing the --numReducers options with --desiredParallelism (or something)
> and having reasonable defaults on that for our neophytes would be better.
> Then the implementation could take the user's desires into account and
> internally manage the numbers of map and reduce tasks where it makes sense
> to do so.

On this flag in particular --

It's an appealing idea, but how do the details work? for example on
the recommender jobs, there are at least 4 mapreduces, each of which
have a fairly different best parallelism setting. The big, last phase
should be parallelized as much as possible; early phases would just be
slowed down by using too many mappers.

What would the neophyte user using this flag do with it? Presumably
the neophyte just wants it set to "optimal" or "as much as is
reasonable" and that's basically what Hadoop is already doing, better
than the user can determine.

Encouraging the non-power-user to set number of mappers and reducers
also has the potential to invite them to hurt performance.

Do we have evidence the other way, that users regularly need to
control this to achieve best performance? I personally actually never
set it and let Hadoop base it on the file splits and blocks and such,
which is a pretty good heuristic.

Reply via email to