Hi Luca, You really raised my curiousity and I went and tried it myself. I had a bunch of files adding up to 591 MB in a dir, and an equivalent single file in a different dir in the hdfs. Ran two MR jobs with #reducers = 2. The outputs were exactly the same. The split sizes will not affect the outcome in the wordcount case. The #maps is a function of the hdfs block size, #maps the user specified, length/number of files. The RecordReader, org.apache.hadoop.mapred.LineRecordReader has logic for handling cases where files could be split anywhere (newlines could straddle hdfs block boundary). If you look at org.apache.hadoop.mapred.FileInputFormat.getSplits, you can see how all this info is used. Hadoop doesn't honor mapred.map.tasks beyond considering it a hint. But it accepts the user specified mapred.reduce.tasks and doesn't manipulate that. You cannot force mapred.map.tasks but can specify mapred.reduce.tasks. Thanks, Devaraj.
> -----Original Message----- > From: Luca Telloli [mailto:[EMAIL PROTECTED] > Sent: Friday, September 14, 2007 7:17 AM > To: [email protected] > Subject: Strange Hadoop behavior - Different results on > equivalent input > > Hello everyone, > I'm new to Hadoop and to this mailing list so: Hello. =) > > I'm experiencing a problem that I can't understand; I'm > performing a wordcount task (from the examples in the source) > on a single Hadoop node configured as a pseudo-distributed > environment. My input is a set of document I scratched from > /usr/share/doc. > > I have two inputs: > - the first one is a set of three files of 189, 45 and 1.9 > MB, named input-compact > - the second one is the same as above, put on a single 236MB > file with cat, named input-single, so I'm talking about > "equivalent" input > > Logs report 11 map tasks for one job and 10 for the other, > both having a total of 2 reduce tasks. I expect the outcome > to be the same, but it's not, as it follows from the tail of > my outputs > > $ tail /tmp/output-* > ==> /tmp/output-compact <== > yet.</td> 164 > you 23719 > You 4603 > your 7097 > Zend 111 > zero, 101 > zero 1637 > zero-based 114 > zval 140 > zval* 191 > > ==> /tmp/output-single <== > Y 289 > (Yasuhiro 105 > yet.</td> 164 > you 23719 > You 4622 > your 7121 > zero, 101 > zero 1646 > zero-based 114 > zval* 191 > > - Does the way Hadoop splits its input in block on HDFS > influence the possible outcome of the computation? > > - Even so: how can the result be so different? I mean, the > word zval, having 140 occurrences in the first run, doesn't > even appear in the second one! > > - Third question: I've been seeing that, when files are > small, hadoop tends to make as many maps as the number of > files. My initial input was scattered into 13k different > small files and was not good for the task, as I realized > quite soon, having almost 13k maps running the same task. > At that time, I specified a few parameters in my > initialization file, like mapred.map.tasks = 10 and > mapred.reduce.tasks = 2. I wonder how hadoop decides on the > number of maps; on the help it says that mapred.map.tasks is > a _value per job_ but I wonder if instead is not some > function of <#tasks, #input files> or other parameters. > > - Finally, is there a way to completely force these > parameters (numbers of maps and reduce)? > > Apologies if any of these questions might sound dumb, I'm > really new to the software and willing to learn more. > > Thanks, > Luca > >
