Hello Devaraj, 
        thanks for your detailed answer. I did indeed try another time running
the same tasks and using the same inputs as described in my first email
and you're right, the output is the same. This time I used the original
wordcount provided by the distribution. 

I realized that my mistake was to modify the Reduce class for post
processing of output in this way:

     if(sum > 100)
      output.collect(key, new IntWritable(sum));
    }

without keeping the original Combiner class (in the WordCount example,
Reducer and Combiner are the same class, named Reduce): I guess that,
because Combiner works with local data in memory, instead of disk files,
this can generate unwanted side effects like the one I experienced. 

Then, I tried to run a post-process operation on Reduce values, using
the original Reduce as a Combiner, and a new Reduce (with a threshold
condition such as above) as the Reducer. This worked correctly. I assume
that doing similar operations as above at the Reduce time is reliable
and does not generate side effects. 

Cheers,
Luca 

On Sun, 2007-09-16 at 16:48 -0700, Devaraj Das wrote: 
> Hi Luca,
> You really raised my curiousity and I went and tried it myself. I had a
> bunch of files adding up to 591 MB in a dir, and an equivalent single file
> in a different dir in the hdfs. Ran two MR jobs with #reducers = 2. The
> outputs were exactly the same.
> The split sizes will not affect the outcome in the wordcount case. The #maps
> is a function of the hdfs block size, #maps the user specified,
> length/number of files. The RecordReader,
> org.apache.hadoop.mapred.LineRecordReader has logic for handling cases where
> files could be split anywhere (newlines could straddle hdfs block boundary).
> If you look at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits, you can see how all this
> info is used. 
> Hadoop doesn't honor mapred.map.tasks beyond considering it a hint. But it
> accepts the user specified mapred.reduce.tasks and doesn't manipulate that.
> You cannot force mapred.map.tasks but can specify mapred.reduce.tasks.
> Thanks,
> Devaraj.
> 
> 
> > -----Original Message-----
> > From: Luca Telloli [mailto:[EMAIL PROTECTED] 
> > Sent: Friday, September 14, 2007 7:17 AM
> > To: [email protected]
> > Subject: Strange Hadoop behavior - Different results on 
> > equivalent input
> > 
> > Hello everyone,
> > I'm new to Hadoop and to this mailing list so: Hello. =)
> > 
> > I'm experiencing a problem that I can't understand; I'm 
> > performing a wordcount task (from the examples in the source) 
> > on a single Hadoop node configured as a pseudo-distributed 
> > environment. My input is a set of document I scratched from 
> > /usr/share/doc. 
> > 
> > I have two inputs: 
> > - the first one is a set of three files of 189, 45 and 1.9 
> > MB, named input-compact
> > - the second one is the same as above, put on a single 236MB 
> > file with cat, named input-single, so I'm talking about 
> > "equivalent" input 
> > 
> > Logs report 11 map tasks for one job and 10 for the other, 
> > both having a total of 2 reduce tasks. I expect the outcome 
> > to be the same, but it's not, as it follows from the tail of 
> > my outputs 
> > 
> > $ tail /tmp/output-*
> > ==> /tmp/output-compact <==
> > yet.</td>       164
> > you     23719
> > You     4603
> > your    7097
> > Zend    111
> > zero,   101
> > zero    1637
> > zero-based      114
> > zval    140
> > zval*   191
> > 
> > ==> /tmp/output-single <==
> > Y       289
> > (Yasuhiro       105
> > yet.</td>       164
> > you     23719
> > You     4622
> > your    7121
> > zero,   101
> > zero    1646
> > zero-based      114
> > zval*   191
> > 
> > - Does the way Hadoop splits its input in block on HDFS 
> > influence the possible outcome of the computation? 
> > 
> > - Even so: how can the result be so different? I mean, the 
> > word zval, having 140 occurrences in the first run, doesn't 
> > even appear in the second one! 
> > 
> > - Third question: I've been seeing that, when files are 
> > small, hadoop tends to make as many maps as the number of 
> > files. My initial input was scattered into 13k different 
> > small files and was not good for the task, as I realized 
> > quite soon, having almost 13k maps running the same task.
> > At that time, I specified a few parameters in my 
> > initialization file, like mapred.map.tasks = 10 and 
> > mapred.reduce.tasks = 2. I wonder how hadoop decides on the 
> > number of maps; on the help it says that mapred.map.tasks is 
> > a _value per job_ but I wonder if instead is not some 
> > function of <#tasks, #input files> or other parameters. 
> > 
> > - Finally, is there a way to completely force these 
> > parameters (numbers of maps and reduce)? 
> > 
> > Apologies if any of these questions might sound dumb, I'm 
> > really new to the software and willing to learn more. 
> > 
> > Thanks,
> > Luca 
> > 
> > 
> 

Reply via email to