Re: [core-user][reduce seems to run only on one machine]

Jean-Pierre OCALAN Fri, 21 Mar 2008 11:03:10 -0700

Thank you guys for all that good answers, I appreciate that.


Jean-Pierre.

On Mar 21, 2008, at 12:47 PM, Ted Dunning wrote:

The default number of reducers is 4. It is unlikely that a user whodoesn't
know about how to set the number of reducers has changed that value.
This phenomenon of apparently having only a single reducer oftenhappens if
you have a very skewed distribution of keys for the reduce phase.
Imagine that you are counting words, but the text is almost entirelymade upof a single word. If you don't have a combiner, then all instancesof thatword will go to a single reducer. The other reducers will finishinstantly
so you may not even notice that they ran.

You have several options:
A) fix your program. Most of the times that I have seen this, itwas due toa mistake on my part where I was outputting the wrong value as thereduce
key.  I shouldn't admit this, but it is true.
B) fix your program. Many times, you can use a combiner to do a lotof thework of the reducer as the data is emitted from the mapper. Thisworks isthe reducer could be done in pieces by summarizing the pieces sothat the
summarization is done by the combiners (very much in parallel) and the
actual reducer only combines the summary (hopefully very quick, evenif not
in parallel).
C) fix your problem. Sometimes, your problem as stated is notparticularly
appropriate for parallel execution, but there is a nearly equivalent
statement that is just fine. For example, suppose that your inputcontainsa large list of keys and numbers and you want to compute the medianof thesenumbers for each key. There is no sufficient statistic forcomputing the(exact) median so you can't really use a combiner and if one keydominates,
your program will run slowly.
On the other hand, if you restate the problem to compute the*approximate*median, then you could use a combiner. For instance, suppose thecombinerfinds the median of the values it is given and emits the median andthecount of samples it processed. Then the reducer can compute themedian ofthe partial medians, respecting the counts in the process. Thisresults is*not* the median and your program will be *wrong*. But not bymuch. If youcan accept the error (and it will probably be quite small), then youwin.
If you can't accept this error, then option (D) is for you.
D) give up. Your problem may not be appropriate for map-reduceexecution oryou may be unable to afford the effort required to restate yourproblem in away that will work. This may not be a bad option ... your programmay noteven need parallelism. Be sure to not overstate the situation.Don't say"my problem is impossible to parallelize" since some clever Jack islikelyto come along 10 minutes later and make you look the fool. Insteadsay "My
problem appeared to be difficult to parallelize so I punted".



On 3/21/08 8:05 AM, "Amar Kamat" <[EMAIL PROTECTED]> wrote:
On Fri, 21 Mar 2008, Jean-Pierre OCALAN wrote:
Hi,
I'm currently working on a project that implies massive logparsing. I have
one master and 6 slaves.
By looking the each slaves logs I've noticed that REDUCE operationjust runs
on one machine.
So does that mean that reduce just runs on one machine ? And ifthat is truehow can I specify that I want the reduce to run also on the othermachines ?
You can set it in your job usingJobConf.setNumReduceTasks(numReducers).How are you submitting your job? Is it from examples or your owncode?
Amar
Thanks for any help,

Jean-Pierre.

Re: [core-user][reduce seems to run only on one machine]

Reply via email to