On Sep 17, 2007, at 8:08 PM, Ted Dunning wrote:



This isn't a matter of side effects.

Ok, I was just pointing it out 'cause it might not be straightforward for most users. Side effect is not to be taken for the quality of software, which is remarkable, but for the fact that the Combine step is an additional, intermediate step in the original MapReduce programming model.

Sorry if the words "side effect" might have generated wrong interpretations, that was my fault; I should have just used "different results" instead.


The issue is that the combiner only sees output from a single map task. That means that the counts will be (statistically speaking) smaller than for
the final reduce.


I realized that myself, as in my previous email.

Cheers,
Luca




On 9/17/07 9:34 AM, "Luca Telloli" <[EMAIL PROTECTED]> wrote:

Hello Devaraj,
thanks for your detailed answer. I did indeed try another time running the same tasks and using the same inputs as described in my first email and you're right, the output is the same. This time I used the original
wordcount provided by the distribution.

I realized that my mistake was to modify the Reduce class for post
processing of output in this way:

     if(sum > 100)
      output.collect(key, new IntWritable(sum));
    }

without keeping the original Combiner class (in the WordCount example,
Reducer and Combiner are the same class, named Reduce): I guess that,
because Combiner works with local data in memory, instead of disk files,
this can generate unwanted side effects like the one I experienced.

Then, I tried to run a post-process operation on Reduce values, using
the original Reduce as a Combiner, and a new Reduce (with a threshold
condition such as above) as the Reducer. This worked correctly. I assume
that doing similar operations as above at the Reduce time is reliable
and does not generate side effects.

Cheers,
Luca

On Sun, 2007-09-16 at 16:48 -0700, Devaraj Das wrote:
Hi Luca,
You really raised my curiousity and I went and tried it myself. I had a bunch of files adding up to 591 MB in a dir, and an equivalent single file in a different dir in the hdfs. Ran two MR jobs with #reducers = 2. The
outputs were exactly the same.
The split sizes will not affect the outcome in the wordcount case. The #maps
is a function of the hdfs block size, #maps the user specified,
length/number of files. The RecordReader,
org.apache.hadoop.mapred.LineRecordReader has logic for handling cases where files could be split anywhere (newlines could straddle hdfs block boundary).
If you look at
org.apache.hadoop.mapred.FileInputFormat.getSplits, you can see how all this
info is used.
Hadoop doesn't honor mapred.map.tasks beyond considering it a hint. But it accepts the user specified mapred.reduce.tasks and doesn't manipulate that. You cannot force mapred.map.tasks but can specify mapred.reduce.tasks.
Thanks,
Devaraj.


-----Original Message-----
From: Luca Telloli [mailto:[EMAIL PROTECTED]
Sent: Friday, September 14, 2007 7:17 AM
To: [email protected]
Subject: Strange Hadoop behavior - Different results on
equivalent input

Hello everyone,
I'm new to Hadoop and to this mailing list so: Hello. =)

I'm experiencing a problem that I can't understand; I'm
performing a wordcount task (from the examples in the source)
on a single Hadoop node configured as a pseudo-distributed
environment. My input is a set of document I scratched from
/usr/share/doc.

I have two inputs:
- the first one is a set of three files of 189, 45 and 1.9
MB, named input-compact
- the second one is the same as above, put on a single 236MB
file with cat, named input-single, so I'm talking about
"equivalent" input

Logs report 11 map tasks for one job and 10 for the other,
both having a total of 2 reduce tasks. I expect the outcome
to be the same, but it's not, as it follows from the tail of
my outputs

$ tail /tmp/output-*
==> /tmp/output-compact <==
yet.</td>       164
you     23719
You     4603
your    7097
Zend    111
zero,   101
zero    1637
zero-based      114
zval    140
zval*   191

==> /tmp/output-single <==
Y       289
(Yasuhiro       105
yet.</td>       164
you     23719
You     4622
your    7121
zero,   101
zero    1646
zero-based      114
zval*   191

- Does the way Hadoop splits its input in block on HDFS
influence the possible outcome of the computation?

- Even so: how can the result be so different? I mean, the
word zval, having 140 occurrences in the first run, doesn't
even appear in the second one!

- Third question: I've been seeing that, when files are
small, hadoop tends to make as many maps as the number of
files. My initial input was scattered into 13k different
small files and was not good for the task, as I realized
quite soon, having almost 13k maps running the same task.
At that time, I specified a few parameters in my
initialization file, like mapred.map.tasks = 10 and
mapred.reduce.tasks = 2. I wonder how hadoop decides on the
number of maps; on the help it says that mapred.map.tasks is
a _value per job_ but I wonder if instead is not some
function of <#tasks, #input files> or other parameters.

- Finally, is there a way to completely force these
parameters (numbers of maps and reduce)?

Apologies if any of these questions might sound dumb, I'm
really new to the software and willing to learn more.

Thanks,
Luca






Reply via email to