Re: Strange Hadoop behavior - Different results on equivalent input

Luca Telloli Mon, 17 Sep 2007 12:21:42 -0700


On Sep 17, 2007, at 8:08 PM, Ted Dunning wrote:



This isn't a matter of side effects.

Ok, I was just pointing it out 'cause it might not be straightforwardfor most users. Side effect is not to be taken for the quality ofsoftware, which is remarkable, but for the fact that the Combine stepis an additional, intermediate step in the original MapReduceprogramming model.

Sorry if the words "side effect" might have generated wronginterpretations, that was my fault; I should have just used"different results" instead.

The issue is that the combiner only sees output from a single maptask.That means that the counts will be (statistically speaking) smallerthan for
the final reduce.


I realized that myself, as in my previous email.

Cheers,
Luca



On 9/17/07 9:34 AM, "Luca Telloli" <[EMAIL PROTECTED]> wrote:

Hello Devaraj,

thanks for your detailed answer. I did indeed try another timerunningthe same tasks and using the same inputs as described in my firstemailand you're right, the output is the same. This time I used theoriginal

wordcount provided by the distribution.

I realized that my mistake was to modify the Reduce class for post
processing of output in this way:

     if(sum > 100)
      output.collect(key, new IntWritable(sum));
    }

without keeping the original Combiner class (in the WordCountexample,

Reducer and Combiner are the same class, named Reduce): I guess that,

because Combiner works with local data in memory, instead of diskfiles,

this can generate unwanted side effects like the one I experienced.

Then, I tried to run a post-process operation on Reduce values, using
the original Reduce as a Combiner, and a new Reduce (with a threshold

condition such as above) as the Reducer. This worked correctly. Iassume

that doing similar operations as above at the Reduce time is reliable
and does not generate side effects.

Cheers,
Luca

On Sun, 2007-09-16 at 16:48 -0700, Devaraj Das wrote:

Hi Luca,

You really raised my curiousity and I went and tried it myself. Ihad abunch of files adding up to 591 MB in a dir, and an equivalentsingle filein a different dir in the hdfs. Ran two MR jobs with #reducers =2. The

outputs were exactly the same.

The split sizes will not affect the outcome in the wordcountcase. The #maps

is a function of the hdfs block size, #maps the user specified,
length/number of files. The RecordReader,

org.apache.hadoop.mapred.LineRecordReader has logic for handlingcases wherefiles could be split anywhere (newlines could straddle hdfs blockboundary).

If you look at

org.apache.hadoop.mapred.FileInputFormat.getSplits, you can seehow all this

info is used.

Hadoop doesn't honor mapred.map.tasks beyond considering it ahint. But itaccepts the user specified mapred.reduce.tasks and doesn'tmanipulate that.You cannot force mapred.map.tasks but can specifymapred.reduce.tasks.

Thanks,
Devaraj.

-----Original Message-----
From: Luca Telloli [mailto:[EMAIL PROTECTED]
Sent: Friday, September 14, 2007 7:17 AM
To: [email protected]
Subject: Strange Hadoop behavior - Different results on
equivalent input

Hello everyone,
I'm new to Hadoop and to this mailing list so: Hello. =)

I'm experiencing a problem that I can't understand; I'm
performing a wordcount task (from the examples in the source)
on a single Hadoop node configured as a pseudo-distributed
environment. My input is a set of document I scratched from
/usr/share/doc.

I have two inputs:
- the first one is a set of three files of 189, 45 and 1.9
MB, named input-compact
- the second one is the same as above, put on a single 236MB
file with cat, named input-single, so I'm talking about
"equivalent" input

Logs report 11 map tasks for one job and 10 for the other,
both having a total of 2 reduce tasks. I expect the outcome
to be the same, but it's not, as it follows from the tail of
my outputs

$ tail /tmp/output-*
==> /tmp/output-compact <==
yet.</td>       164
you     23719
You     4603
your    7097
Zend    111
zero,   101
zero    1637
zero-based      114
zval    140
zval*   191

==> /tmp/output-single <==
Y       289
(Yasuhiro       105
yet.</td>       164
you     23719
You     4622
your    7121
zero,   101
zero    1646
zero-based      114
zval*   191

- Does the way Hadoop splits its input in block on HDFS
influence the possible outcome of the computation?

- Even so: how can the result be so different? I mean, the
word zval, having 140 occurrences in the first run, doesn't
even appear in the second one!

- Third question: I've been seeing that, when files are
small, hadoop tends to make as many maps as the number of
files. My initial input was scattered into 13k different
small files and was not good for the task, as I realized
quite soon, having almost 13k maps running the same task.
At that time, I specified a few parameters in my
initialization file, like mapred.map.tasks = 10 and
mapred.reduce.tasks = 2. I wonder how hadoop decides on the
number of maps; on the help it says that mapred.map.tasks is
a _value per job_ but I wonder if instead is not some
function of <#tasks, #input files> or other parameters.

- Finally, is there a way to completely force these
parameters (numbers of maps and reduce)?

Apologies if any of these questions might sound dumb, I'm
really new to the software and willing to learn more.

Thanks,
Luca

Re: Strange Hadoop behavior - Different results on equivalent input

Reply via email to