Strange Hadoop behavior - Different results on equivalent input

Luca Telloli Fri, 14 Sep 2007 07:17:32 -0700

Hello everyone,
I'm new to Hadoop and to this mailing list so: Hello. =)

I'm experiencing a problem that I can't understand; I'm performing a
wordcount task (from the examples in the source) on a single Hadoop node
configured as a pseudo-distributed environment. My input is a set of
document I scratched from /usr/share/doc.


I have two inputs: 
- the first one is a set of three files of 189, 45 and 1.9 MB, named
input-compact
- the second one is the same as above, put on a single 236MB file with
cat, named input-single, so I'm talking about "equivalent" input 

Logs report 11 map tasks for one job and 10 for the other, both having a
total of 2 reduce tasks. I expect the outcome to be the same, but it's
not, as it follows from the tail of my outputs 

$ tail /tmp/output-*
==> /tmp/output-compact <==
yet.</td>       164
you     23719
You     4603
your    7097
Zend    111
zero,   101
zero    1637
zero-based      114
zval    140
zval*   191

==> /tmp/output-single <==
Y       289
(Yasuhiro       105
yet.</td>       164
you     23719
You     4622
your    7121
zero,   101
zero    1646
zero-based      114
zval*   191

- Does the way Hadoop splits its input in block on HDFS influence the
possible outcome of the computation? 

- Even so: how can the result be so different? I mean, the word zval,
having 140 occurrences in the first run, doesn't even appear in the
second one! 

- Third question: I've been seeing that, when files are small, hadoop
tends to make as many maps as the number of files. My initial input was
scattered into 13k different small files and was not good for the task,
as I realized quite soon, having almost 13k maps running the same task.
At that time, I specified a few parameters in my initialization file,
like mapred.map.tasks = 10 and mapred.reduce.tasks = 2. I wonder how
hadoop decides on the number of maps; on the help it says that
mapred.map.tasks is a _value per job_ but I wonder if instead is not
some function of <#tasks, #input files> or other parameters. 

- Finally, is there a way to completely force these parameters (numbers
of maps and reduce)? 

Apologies if any of these questions might sound dumb, I'm really new to
the software and willing to learn more. 

Thanks,
Luca

Strange Hadoop behavior - Different results on equivalent input

Reply via email to