Hello everyone, I'm new to Hadoop and to this mailing list so: Hello. =) I'm experiencing a problem that I can't understand; I'm performing a wordcount task (from the examples in the source) on a single Hadoop node configured as a pseudo-distributed environment. My input is a set of document I scratched from /usr/share/doc.
I have two inputs: - the first one is a set of three files of 189, 45 and 1.9 MB, named input-compact - the second one is the same as above, put on a single 236MB file with cat, named input-single, so I'm talking about "equivalent" input Logs report 11 map tasks for one job and 10 for the other, both having a total of 2 reduce tasks. I expect the outcome to be the same, but it's not, as it follows from the tail of my outputs $ tail /tmp/output-* ==> /tmp/output-compact <== yet.</td> 164 you 23719 You 4603 your 7097 Zend 111 zero, 101 zero 1637 zero-based 114 zval 140 zval* 191 ==> /tmp/output-single <== Y 289 (Yasuhiro 105 yet.</td> 164 you 23719 You 4622 your 7121 zero, 101 zero 1646 zero-based 114 zval* 191 - Does the way Hadoop splits its input in block on HDFS influence the possible outcome of the computation? - Even so: how can the result be so different? I mean, the word zval, having 140 occurrences in the first run, doesn't even appear in the second one! - Third question: I've been seeing that, when files are small, hadoop tends to make as many maps as the number of files. My initial input was scattered into 13k different small files and was not good for the task, as I realized quite soon, having almost 13k maps running the same task. At that time, I specified a few parameters in my initialization file, like mapred.map.tasks = 10 and mapred.reduce.tasks = 2. I wonder how hadoop decides on the number of maps; on the help it says that mapred.map.tasks is a _value per job_ but I wonder if instead is not some function of <#tasks, #input files> or other parameters. - Finally, is there a way to completely force these parameters (numbers of maps and reduce)? Apologies if any of these questions might sound dumb, I'm really new to the software and willing to learn more. Thanks, Luca
