On 6/11/10 11:42 AM, Sean Owen wrote:
On Fri, Jun 11, 2010 at 7:33 PM, Jeff Eastman
<[email protected]>  wrote:
complete enough for 'neophyte users' and 'regular users' and that only
'power users' should be using the -D abstractions (and with that accepting
any idiosyncrasies that may result since we cannot guarantee how they may
interact).
That's a reasonable rule. All you really need to specify is input and
output, and Hadoop's defaults should work reasonably from there. So I
view this as an argument to create --input and --output, and that's
done.
+1 The fact that --input and --Dmapred.input.dir both have the word 'input' in them is a coincidence: They are not interchangeable and our arguments have a broader function especially when multiple job steps are invoked.
Since the degree of parallelism obtained is often a function of the number
of mappers/reducers specified, and since the degree of parallelism is
something our 'regular users' would reasonably need to control, perhaps
replacing the --numReducers options with --desiredParallelism (or something)
and having reasonable defaults on that for our neophytes would be better.
Then the implementation could take the user's desires into account and
internally manage the numbers of map and reduce tasks where it makes sense
to do so.
On this flag in particular --

It's an appealing idea, but how do the details work? for example on
the recommender jobs, there are at least 4 mapreduces, each of which
have a fairly different best parallelism setting. The big, last phase
should be parallelized as much as possible; early phases would just be
slowed down by using too many mappers.
Good question and I expect each algorithm would have to do some thinking about how best to honor the user's desired concurrency level. In many cases it would just influence the number of reducers, but file input split size has also been suggested as a control variable. And, it might not be possible to honor it at all in some, multistage jobs such as recommenders. How do you optimize that?
What would the neophyte user using this flag do with it? Presumably
the neophyte just wants it set to "optimal" or "as much as is
reasonable" and that's basically what Hadoop is already doing, better
than the user can determine.
I think that neophytes just want something reasonable and would not be messing with this flag. But some of our jobs do not do something reasonable yet on multi-node clusters (see below).
Encouraging the non-power-user to set number of mappers and reducers
also has the potential to invite them to hurt performance.
Yes, I don't want them messing with those parameters at all. It probably does not make sense to specify --desiredParallelism = 100 on a 4 node cluster but a clever implementation might be able to spawn maybe 8 reducers given that hint.
Do we have evidence the other way, that users regularly need to
control this to achieve best performance? I personally actually never
set it and let Hadoop base it on the file splits and blocks and such,
which is a pretty good heuristic.

Anecdotal: When I ran PFPGrowth on the accidents.dat database on a 4 data node cluster it only used a single reducer. Haven't yet tried that with -D but I think others have. Before I added numReducers propagation to seq2sparse it only launched a single reducer for the back-end steps doing Reuters and LDA took 3x longer than necessary on my cluster. DistributedRowMatrix requires -D to achieve parallelism at the top of this thread. I suspect there are others. I've observed Hadoop does a pretty good job with mappers based upon file splits etc but not so well at reducers which is why we have --numReducers in the first place.

Reply via email to