Re: Setting Number of Mappers and Reducers in DistributedRowMatrix Jobs

Jeff Eastman Fri, 11 Jun 2010 12:30:28 -0700

On 6/11/10 11:42 AM, Sean Owen wrote:

On Fri, Jun 11, 2010 at 7:33 PM, Jeff Eastman
<[email protected]>  wrote:

complete enough for 'neophyte users' and 'regular users' and that only
'power users' should be using the -D abstractions (and with that accepting
any idiosyncrasies that may result since we cannot guarantee how they may
interact).

That's a reasonable rule. All you really need to specify is input and
output, and Hadoop's defaults should work reasonably from there. So I
view this as an argument to create --input and --output, and that's
done.

+1 The fact that --input and --Dmapred.input.dir both have the word'input' in them is a coincidence: They are not interchangeable and ourarguments have a broader function especially when multiple job steps areinvoked.

Since the degree of parallelism obtained is often a function of the number
of mappers/reducers specified, and since the degree of parallelism is
something our 'regular users' would reasonably need to control, perhaps
replacing the --numReducers options with --desiredParallelism (or something)
and having reasonable defaults on that for our neophytes would be better.
Then the implementation could take the user's desires into account and
internally manage the numbers of map and reduce tasks where it makes sense
to do so.

On this flag in particular --


It's an appealing idea, but how do the details work? for example on
the recommender jobs, there are at least 4 mapreduces, each of which
have a fairly different best parallelism setting. The big, last phase
should be parallelized as much as possible; early phases would just be
slowed down by using too many mappers.

Good question and I expect each algorithm would have to do some thinkingabout how best to honor the user's desired concurrency level. In manycases it would just influence the number of reducers, but file inputsplit size has also been suggested as a control variable. And, it mightnot be possible to honor it at all in some, multistage jobs such asrecommenders. How do you optimize that?

What would the neophyte user using this flag do with it? Presumably
the neophyte just wants it set to "optimal" or "as much as is
reasonable" and that's basically what Hadoop is already doing, better
than the user can determine.

I think that neophytes just want something reasonable and would not bemessing with this flag. But some of our jobs do not do somethingreasonable yet on multi-node clusters (see below).

Encouraging the non-power-user to set number of mappers and reducers
also has the potential to invite them to hurt performance.

Yes, I don't want them messing with those parameters at all. It probablydoes not make sense to specify --desiredParallelism = 100 on a 4 nodecluster but a clever implementation might be able to spawn maybe 8reducers given that hint.

Do we have evidence the other way, that users regularly need to
control this to achieve best performance? I personally actually never
set it and let Hadoop base it on the file splits and blocks and such,
which is a pretty good heuristic.

Anecdotal: When I ran PFPGrowth on the accidents.dat database on a 4data node cluster it only used a single reducer. Haven't yet tried thatwith -D but I think others have. Before I added numReducers propagationto seq2sparse it only launched a single reducer for the back-end stepsdoing Reuters and LDA took 3x longer than necessary on my cluster.DistributedRowMatrix requires -D to achieve parallelism at the top ofthis thread. I suspect there are others. I've observed Hadoop does apretty good job with mappers based upon file splits etc but not so wellat reducers which is why we have --numReducers in the first place.

Re: Setting Number of Mappers and Reducers in DistributedRowMatrix Jobs

Reply via email to