RE: Strange Hadoop behavior - Different results on equivalent input

Devaraj Das Sun, 16 Sep 2007 16:48:55 -0700

Hi Luca,
You really raised my curiousity and I went and tried it myself. I had a
bunch of files adding up to 591 MB in a dir, and an equivalent single file
in a different dir in the hdfs. Ran two MR jobs with #reducers = 2. The
outputs were exactly the same.
The split sizes will not affect the outcome in the wordcount case. The #maps
is a function of the hdfs block size, #maps the user specified,
length/number of files. The RecordReader,
org.apache.hadoop.mapred.LineRecordReader has logic for handling cases where
files could be split anywhere (newlines could straddle hdfs block boundary).
If you look at 
org.apache.hadoop.mapred.FileInputFormat.getSplits, you can see how all this
info is used. 
Hadoop doesn't honor mapred.map.tasks beyond considering it a hint. But it
accepts the user specified mapred.reduce.tasks and doesn't manipulate that.
You cannot force mapred.map.tasks but can specify mapred.reduce.tasks.
Thanks,
Devaraj.



> -----Original Message-----
> From: Luca Telloli [mailto:[EMAIL PROTECTED] 
> Sent: Friday, September 14, 2007 7:17 AM
> To: [email protected]
> Subject: Strange Hadoop behavior - Different results on 
> equivalent input
> 
> Hello everyone,
> I'm new to Hadoop and to this mailing list so: Hello. =)
> 
> I'm experiencing a problem that I can't understand; I'm 
> performing a wordcount task (from the examples in the source) 
> on a single Hadoop node configured as a pseudo-distributed 
> environment. My input is a set of document I scratched from 
> /usr/share/doc. 
> 
> I have two inputs: 
> - the first one is a set of three files of 189, 45 and 1.9 
> MB, named input-compact
> - the second one is the same as above, put on a single 236MB 
> file with cat, named input-single, so I'm talking about 
> "equivalent" input 
> 
> Logs report 11 map tasks for one job and 10 for the other, 
> both having a total of 2 reduce tasks. I expect the outcome 
> to be the same, but it's not, as it follows from the tail of 
> my outputs 
> 
> $ tail /tmp/output-*
> ==> /tmp/output-compact <==
> yet.</td>       164
> you     23719
> You     4603
> your    7097
> Zend    111
> zero,   101
> zero    1637
> zero-based      114
> zval    140
> zval*   191
> 
> ==> /tmp/output-single <==
> Y       289
> (Yasuhiro       105
> yet.</td>       164
> you     23719
> You     4622
> your    7121
> zero,   101
> zero    1646
> zero-based      114
> zval*   191
> 
> - Does the way Hadoop splits its input in block on HDFS 
> influence the possible outcome of the computation? 
> 
> - Even so: how can the result be so different? I mean, the 
> word zval, having 140 occurrences in the first run, doesn't 
> even appear in the second one! 
> 
> - Third question: I've been seeing that, when files are 
> small, hadoop tends to make as many maps as the number of 
> files. My initial input was scattered into 13k different 
> small files and was not good for the task, as I realized 
> quite soon, having almost 13k maps running the same task.
> At that time, I specified a few parameters in my 
> initialization file, like mapred.map.tasks = 10 and 
> mapred.reduce.tasks = 2. I wonder how hadoop decides on the 
> number of maps; on the help it says that mapred.map.tasks is 
> a _value per job_ but I wonder if instead is not some 
> function of <#tasks, #input files> or other parameters. 
> 
> - Finally, is there a way to completely force these 
> parameters (numbers of maps and reduce)? 
> 
> Apologies if any of these questions might sound dumb, I'm 
> really new to the software and willing to learn more. 
> 
> Thanks,
> Luca 
> 
>

RE: Strange Hadoop behavior - Different results on equivalent input

Reply via email to