Thank you guys for all that good answers, I appreciate that.
Jean-Pierre.
On Mar 21, 2008, at 12:47 PM, Ted Dunning wrote:
The default number of reducers is 4. It is unlikely that a user who
doesn't
know about how to set the number of reducers has changed that value.
This phenomenon of apparently having only a single reducer often
happens if
you have a very skewed distribution of keys for the reduce phase.
Imagine that you are counting words, but the text is almost entirely
made up
of a single word. If you don't have a combiner, then all instances
of that
word will go to a single reducer. The other reducers will finish
instantly
so you may not even notice that they ran.
You have several options:
A) fix your program. Most of the times that I have seen this, it
was due to
a mistake on my part where I was outputting the wrong value as the
reduce
key. I shouldn't admit this, but it is true.
B) fix your program. Many times, you can use a combiner to do a lot
of the
work of the reducer as the data is emitted from the mapper. This
works is
the reducer could be done in pieces by summarizing the pieces so
that the
summarization is done by the combiners (very much in parallel) and the
actual reducer only combines the summary (hopefully very quick, even
if not
in parallel).
C) fix your problem. Sometimes, your problem as stated is not
particularly
appropriate for parallel execution, but there is a nearly equivalent
statement that is just fine. For example, suppose that your input
contains
a large list of keys and numbers and you want to compute the median
of these
numbers for each key. There is no sufficient statistic for
computing the
(exact) median so you can't really use a combiner and if one key
dominates,
your program will run slowly.
On the other hand, if you restate the problem to compute the
*approximate*
median, then you could use a combiner. For instance, suppose the
combiner
finds the median of the values it is given and emits the median and
the
count of samples it processed. Then the reducer can compute the
median of
the partial medians, respecting the counts in the process. This
results is
*not* the median and your program will be *wrong*. But not by
much. If you
can accept the error (and it will probably be quite small), then you
win.
If you can't accept this error, then option (D) is for you.
D) give up. Your problem may not be appropriate for map-reduce
execution or
you may be unable to afford the effort required to restate your
problem in a
way that will work. This may not be a bad option ... your program
may not
even need parallelism. Be sure to not overstate the situation.
Don't say
"my problem is impossible to parallelize" since some clever Jack is
likely
to come along 10 minutes later and make you look the fool. Instead
say "My
problem appeared to be difficult to parallelize so I punted".
On 3/21/08 8:05 AM, "Amar Kamat" <[EMAIL PROTECTED]> wrote:
On Fri, 21 Mar 2008, Jean-Pierre OCALAN wrote:
Hi,
I'm currently working on a project that implies massive log
parsing. I have
one master and 6 slaves.
By looking the each slaves logs I've noticed that REDUCE operation
just runs
on one machine.
So does that mean that reduce just runs on one machine ? And if
that is true
how can I specify that I want the reduce to run also on the other
machines ?
You can set it in your job using
JobConf.setNumReduceTasks(numReducers).
How are you submitting your job? Is it from examples or your own
code?
Amar
Thanks for any help,
Jean-Pierre.