On Sep 17, 2007, at 8:08 PM, Ted Dunning wrote:
This isn't a matter of side effects.
Ok, I was just pointing it out 'cause it might not be straightforward
for most users. Side effect is not to be taken for the quality of
software, which is remarkable, but for the fact that the Combine step
is an additional, intermediate step in the original MapReduce
programming model.
Sorry if the words "side effect" might have generated wrong
interpretations, that was my fault; I should have just used
"different results" instead.
The issue is that the combiner only sees output from a single map
task.
That means that the counts will be (statistically speaking) smaller
than for
the final reduce.
I realized that myself, as in my previous email.
Cheers,
Luca
On 9/17/07 9:34 AM, "Luca Telloli" <[EMAIL PROTECTED]> wrote:
Hello Devaraj,
thanks for your detailed answer. I did indeed try another time
running
the same tasks and using the same inputs as described in my first
email
and you're right, the output is the same. This time I used the
original
wordcount provided by the distribution.
I realized that my mistake was to modify the Reduce class for post
processing of output in this way:
if(sum > 100)
output.collect(key, new IntWritable(sum));
}
without keeping the original Combiner class (in the WordCount
example,
Reducer and Combiner are the same class, named Reduce): I guess that,
because Combiner works with local data in memory, instead of disk
files,
this can generate unwanted side effects like the one I experienced.
Then, I tried to run a post-process operation on Reduce values, using
the original Reduce as a Combiner, and a new Reduce (with a threshold
condition such as above) as the Reducer. This worked correctly. I
assume
that doing similar operations as above at the Reduce time is reliable
and does not generate side effects.
Cheers,
Luca
On Sun, 2007-09-16 at 16:48 -0700, Devaraj Das wrote:
Hi Luca,
You really raised my curiousity and I went and tried it myself. I
had a
bunch of files adding up to 591 MB in a dir, and an equivalent
single file
in a different dir in the hdfs. Ran two MR jobs with #reducers =
2. The
outputs were exactly the same.
The split sizes will not affect the outcome in the wordcount
case. The #maps
is a function of the hdfs block size, #maps the user specified,
length/number of files. The RecordReader,
org.apache.hadoop.mapred.LineRecordReader has logic for handling
cases where
files could be split anywhere (newlines could straddle hdfs block
boundary).
If you look at
org.apache.hadoop.mapred.FileInputFormat.getSplits, you can see
how all this
info is used.
Hadoop doesn't honor mapred.map.tasks beyond considering it a
hint. But it
accepts the user specified mapred.reduce.tasks and doesn't
manipulate that.
You cannot force mapred.map.tasks but can specify
mapred.reduce.tasks.
Thanks,
Devaraj.
-----Original Message-----
From: Luca Telloli [mailto:[EMAIL PROTECTED]
Sent: Friday, September 14, 2007 7:17 AM
To: [email protected]
Subject: Strange Hadoop behavior - Different results on
equivalent input
Hello everyone,
I'm new to Hadoop and to this mailing list so: Hello. =)
I'm experiencing a problem that I can't understand; I'm
performing a wordcount task (from the examples in the source)
on a single Hadoop node configured as a pseudo-distributed
environment. My input is a set of document I scratched from
/usr/share/doc.
I have two inputs:
- the first one is a set of three files of 189, 45 and 1.9
MB, named input-compact
- the second one is the same as above, put on a single 236MB
file with cat, named input-single, so I'm talking about
"equivalent" input
Logs report 11 map tasks for one job and 10 for the other,
both having a total of 2 reduce tasks. I expect the outcome
to be the same, but it's not, as it follows from the tail of
my outputs
$ tail /tmp/output-*
==> /tmp/output-compact <==
yet.</td> 164
you 23719
You 4603
your 7097
Zend 111
zero, 101
zero 1637
zero-based 114
zval 140
zval* 191
==> /tmp/output-single <==
Y 289
(Yasuhiro 105
yet.</td> 164
you 23719
You 4622
your 7121
zero, 101
zero 1646
zero-based 114
zval* 191
- Does the way Hadoop splits its input in block on HDFS
influence the possible outcome of the computation?
- Even so: how can the result be so different? I mean, the
word zval, having 140 occurrences in the first run, doesn't
even appear in the second one!
- Third question: I've been seeing that, when files are
small, hadoop tends to make as many maps as the number of
files. My initial input was scattered into 13k different
small files and was not good for the task, as I realized
quite soon, having almost 13k maps running the same task.
At that time, I specified a few parameters in my
initialization file, like mapred.map.tasks = 10 and
mapred.reduce.tasks = 2. I wonder how hadoop decides on the
number of maps; on the help it says that mapred.map.tasks is
a _value per job_ but I wonder if instead is not some
function of <#tasks, #input files> or other parameters.
- Finally, is there a way to completely force these
parameters (numbers of maps and reduce)?
Apologies if any of these questions might sound dumb, I'm
really new to the software and willing to learn more.
Thanks,
Luca