Hello Devaraj,
thanks for your detailed answer. I did indeed try another time running
the same tasks and using the same inputs as described in my first email
and you're right, the output is the same. This time I used the original
wordcount provided by the distribution.
I realized that my mistake was to modify the Reduce class for post
processing of output in this way:
if(sum > 100)
output.collect(key, new IntWritable(sum));
}
without keeping the original Combiner class (in the WordCount example,
Reducer and Combiner are the same class, named Reduce): I guess that,
because Combiner works with local data in memory, instead of disk files,
this can generate unwanted side effects like the one I experienced.
Then, I tried to run a post-process operation on Reduce values, using
the original Reduce as a Combiner, and a new Reduce (with a threshold
condition such as above) as the Reducer. This worked correctly. I assume
that doing similar operations as above at the Reduce time is reliable
and does not generate side effects.
Cheers,
Luca
On Sun, 2007-09-16 at 16:48 -0700, Devaraj Das wrote:
> Hi Luca,
> You really raised my curiousity and I went and tried it myself. I had a
> bunch of files adding up to 591 MB in a dir, and an equivalent single file
> in a different dir in the hdfs. Ran two MR jobs with #reducers = 2. The
> outputs were exactly the same.
> The split sizes will not affect the outcome in the wordcount case. The #maps
> is a function of the hdfs block size, #maps the user specified,
> length/number of files. The RecordReader,
> org.apache.hadoop.mapred.LineRecordReader has logic for handling cases where
> files could be split anywhere (newlines could straddle hdfs block boundary).
> If you look at
> org.apache.hadoop.mapred.FileInputFormat.getSplits, you can see how all this
> info is used.
> Hadoop doesn't honor mapred.map.tasks beyond considering it a hint. But it
> accepts the user specified mapred.reduce.tasks and doesn't manipulate that.
> You cannot force mapred.map.tasks but can specify mapred.reduce.tasks.
> Thanks,
> Devaraj.
>
>
> > -----Original Message-----
> > From: Luca Telloli [mailto:[EMAIL PROTECTED]
> > Sent: Friday, September 14, 2007 7:17 AM
> > To: [email protected]
> > Subject: Strange Hadoop behavior - Different results on
> > equivalent input
> >
> > Hello everyone,
> > I'm new to Hadoop and to this mailing list so: Hello. =)
> >
> > I'm experiencing a problem that I can't understand; I'm
> > performing a wordcount task (from the examples in the source)
> > on a single Hadoop node configured as a pseudo-distributed
> > environment. My input is a set of document I scratched from
> > /usr/share/doc.
> >
> > I have two inputs:
> > - the first one is a set of three files of 189, 45 and 1.9
> > MB, named input-compact
> > - the second one is the same as above, put on a single 236MB
> > file with cat, named input-single, so I'm talking about
> > "equivalent" input
> >
> > Logs report 11 map tasks for one job and 10 for the other,
> > both having a total of 2 reduce tasks. I expect the outcome
> > to be the same, but it's not, as it follows from the tail of
> > my outputs
> >
> > $ tail /tmp/output-*
> > ==> /tmp/output-compact <==
> > yet.</td> 164
> > you 23719
> > You 4603
> > your 7097
> > Zend 111
> > zero, 101
> > zero 1637
> > zero-based 114
> > zval 140
> > zval* 191
> >
> > ==> /tmp/output-single <==
> > Y 289
> > (Yasuhiro 105
> > yet.</td> 164
> > you 23719
> > You 4622
> > your 7121
> > zero, 101
> > zero 1646
> > zero-based 114
> > zval* 191
> >
> > - Does the way Hadoop splits its input in block on HDFS
> > influence the possible outcome of the computation?
> >
> > - Even so: how can the result be so different? I mean, the
> > word zval, having 140 occurrences in the first run, doesn't
> > even appear in the second one!
> >
> > - Third question: I've been seeing that, when files are
> > small, hadoop tends to make as many maps as the number of
> > files. My initial input was scattered into 13k different
> > small files and was not good for the task, as I realized
> > quite soon, having almost 13k maps running the same task.
> > At that time, I specified a few parameters in my
> > initialization file, like mapred.map.tasks = 10 and
> > mapred.reduce.tasks = 2. I wonder how hadoop decides on the
> > number of maps; on the help it says that mapred.map.tasks is
> > a _value per job_ but I wonder if instead is not some
> > function of <#tasks, #input files> or other parameters.
> >
> > - Finally, is there a way to completely force these
> > parameters (numbers of maps and reduce)?
> >
> > Apologies if any of these questions might sound dumb, I'm
> > really new to the software and willing to learn more.
> >
> > Thanks,
> > Luca
> >
> >
>