Thank you guys for all that good answers, I appreciate that.

Jean-Pierre.

On Mar 21, 2008, at 12:47 PM, Ted Dunning wrote:


The default number of reducers is 4. It is unlikely that a user who doesn't
know about how to set the number of reducers has changed that value.

This phenomenon of apparently having only a single reducer often happens if
you have a very skewed distribution of keys for the reduce phase.

Imagine that you are counting words, but the text is almost entirely made up of a single word. If you don't have a combiner, then all instances of that word will go to a single reducer. The other reducers will finish instantly
so you may not even notice that they ran.

You have several options:

A) fix your program. Most of the times that I have seen this, it was due to a mistake on my part where I was outputting the wrong value as the reduce
key.  I shouldn't admit this, but it is true.

B) fix your program. Many times, you can use a combiner to do a lot of the work of the reducer as the data is emitted from the mapper. This works is the reducer could be done in pieces by summarizing the pieces so that the
summarization is done by the combiners (very much in parallel) and the
actual reducer only combines the summary (hopefully very quick, even if not
in parallel).

C) fix your problem. Sometimes, your problem as stated is not particularly
appropriate for parallel execution, but there is a nearly equivalent
statement that is just fine. For example, suppose that your input contains a large list of keys and numbers and you want to compute the median of these numbers for each key. There is no sufficient statistic for computing the (exact) median so you can't really use a combiner and if one key dominates,
your program will run slowly.

On the other hand, if you restate the problem to compute the *approximate* median, then you could use a combiner. For instance, suppose the combiner finds the median of the values it is given and emits the median and the count of samples it processed. Then the reducer can compute the median of the partial medians, respecting the counts in the process. This results is *not* the median and your program will be *wrong*. But not by much. If you can accept the error (and it will probably be quite small), then you win.
If you can't accept this error, then option (D) is for you.

D) give up. Your problem may not be appropriate for map-reduce execution or you may be unable to afford the effort required to restate your problem in a way that will work. This may not be a bad option ... your program may not even need parallelism. Be sure to not overstate the situation. Don't say "my problem is impossible to parallelize" since some clever Jack is likely to come along 10 minutes later and make you look the fool. Instead say "My
problem appeared to be difficult to parallelize so I punted".



On 3/21/08 8:05 AM, "Amar Kamat" <[EMAIL PROTECTED]> wrote:

On Fri, 21 Mar 2008, Jean-Pierre OCALAN wrote:

Hi,

I'm currently working on a project that implies massive log parsing. I have
one master and 6 slaves.
By looking the each slaves logs I've noticed that REDUCE operation just runs
on one machine.
So does that mean that reduce just runs on one machine ? And if that is true how can I specify that I want the reduce to run also on the other machines ?
You can set it in your job using JobConf.setNumReduceTasks(numReducers). How are you submitting your job? Is it from examples or your own code?
Amar

Thanks for any help,

Jean-Pierre.









Reply via email to