I have a generic question about how the number of mapper tasks is calculated, as far as I know, the number is primarily based on the number of splits, say if I have 5 splits and I have 10 tasktracker running in the cluster, I will have 5 mapper tasks running in my MR job, right?
But what I found is that sometimes if the input is huge(5 GB), at this point I still have 5 splits which is on purpose, but I got more than 40 mapper tasks running, how this happens? Now, if I compress the huge input to smaller size, the number of mapper got back to 5 again, is something tricky happens here relevant to DFS block location of the input? BTW, our InputFormat is a special kind of FileInputFormat which does not split each file, whereas we copy each file to DFS and the location of the file on DFS will be the input key to mapper task. -- --Anfernee