This isn't a matter of side effects.
The issue is that the combiner only sees output from a single map task. That means that the counts will be (statistically speaking) smaller than for the final reduce. On 9/17/07 9:34 AM, "Luca Telloli" <[EMAIL PROTECTED]> wrote: > Hello Devaraj, > thanks for your detailed answer. I did indeed try another time running > the same tasks and using the same inputs as described in my first email > and you're right, the output is the same. This time I used the original > wordcount provided by the distribution. > > I realized that my mistake was to modify the Reduce class for post > processing of output in this way: > > if(sum > 100) > output.collect(key, new IntWritable(sum)); > } > > without keeping the original Combiner class (in the WordCount example, > Reducer and Combiner are the same class, named Reduce): I guess that, > because Combiner works with local data in memory, instead of disk files, > this can generate unwanted side effects like the one I experienced. > > Then, I tried to run a post-process operation on Reduce values, using > the original Reduce as a Combiner, and a new Reduce (with a threshold > condition such as above) as the Reducer. This worked correctly. I assume > that doing similar operations as above at the Reduce time is reliable > and does not generate side effects. > > Cheers, > Luca > > On Sun, 2007-09-16 at 16:48 -0700, Devaraj Das wrote: >> Hi Luca, >> You really raised my curiousity and I went and tried it myself. I had a >> bunch of files adding up to 591 MB in a dir, and an equivalent single file >> in a different dir in the hdfs. Ran two MR jobs with #reducers = 2. The >> outputs were exactly the same. >> The split sizes will not affect the outcome in the wordcount case. The #maps >> is a function of the hdfs block size, #maps the user specified, >> length/number of files. The RecordReader, >> org.apache.hadoop.mapred.LineRecordReader has logic for handling cases where >> files could be split anywhere (newlines could straddle hdfs block boundary). >> If you look at >> org.apache.hadoop.mapred.FileInputFormat.getSplits, you can see how all this >> info is used. >> Hadoop doesn't honor mapred.map.tasks beyond considering it a hint. But it >> accepts the user specified mapred.reduce.tasks and doesn't manipulate that. >> You cannot force mapred.map.tasks but can specify mapred.reduce.tasks. >> Thanks, >> Devaraj. >> >> >>> -----Original Message----- >>> From: Luca Telloli [mailto:[EMAIL PROTECTED] >>> Sent: Friday, September 14, 2007 7:17 AM >>> To: [email protected] >>> Subject: Strange Hadoop behavior - Different results on >>> equivalent input >>> >>> Hello everyone, >>> I'm new to Hadoop and to this mailing list so: Hello. =) >>> >>> I'm experiencing a problem that I can't understand; I'm >>> performing a wordcount task (from the examples in the source) >>> on a single Hadoop node configured as a pseudo-distributed >>> environment. My input is a set of document I scratched from >>> /usr/share/doc. >>> >>> I have two inputs: >>> - the first one is a set of three files of 189, 45 and 1.9 >>> MB, named input-compact >>> - the second one is the same as above, put on a single 236MB >>> file with cat, named input-single, so I'm talking about >>> "equivalent" input >>> >>> Logs report 11 map tasks for one job and 10 for the other, >>> both having a total of 2 reduce tasks. I expect the outcome >>> to be the same, but it's not, as it follows from the tail of >>> my outputs >>> >>> $ tail /tmp/output-* >>> ==> /tmp/output-compact <== >>> yet.</td> 164 >>> you 23719 >>> You 4603 >>> your 7097 >>> Zend 111 >>> zero, 101 >>> zero 1637 >>> zero-based 114 >>> zval 140 >>> zval* 191 >>> >>> ==> /tmp/output-single <== >>> Y 289 >>> (Yasuhiro 105 >>> yet.</td> 164 >>> you 23719 >>> You 4622 >>> your 7121 >>> zero, 101 >>> zero 1646 >>> zero-based 114 >>> zval* 191 >>> >>> - Does the way Hadoop splits its input in block on HDFS >>> influence the possible outcome of the computation? >>> >>> - Even so: how can the result be so different? I mean, the >>> word zval, having 140 occurrences in the first run, doesn't >>> even appear in the second one! >>> >>> - Third question: I've been seeing that, when files are >>> small, hadoop tends to make as many maps as the number of >>> files. My initial input was scattered into 13k different >>> small files and was not good for the task, as I realized >>> quite soon, having almost 13k maps running the same task. >>> At that time, I specified a few parameters in my >>> initialization file, like mapred.map.tasks = 10 and >>> mapred.reduce.tasks = 2. I wonder how hadoop decides on the >>> number of maps; on the help it says that mapred.map.tasks is >>> a _value per job_ but I wonder if instead is not some >>> function of <#tasks, #input files> or other parameters. >>> >>> - Finally, is there a way to completely force these >>> parameters (numbers of maps and reduce)? >>> >>> Apologies if any of these questions might sound dumb, I'm >>> really new to the software and willing to learn more. >>> >>> Thanks, >>> Luca >>> >>> >> >
