Hi, I have a 20-node hadoop cluster, processing large log files. I've seen it said that there's never any reason to make the inputSplitSize larger than a single HDFS block (64M), because you give up data locality for no benefit if you do.
But when I kick off a job against the whole dataset with that default splitSize, I get about 180,000 map tasks, most lasting about 9-15 seconds each. Typically I can get through about half of them, then the jobTracker freezes with OOM errors. I do realize that I could just up the HADOOP_HEAP_SIZE on the jobTracker node. But it also seems like we ought to have fewer map tasks, lasting more like 1 or 1.5 minutes each, to reduce the overhead to the jobTracker of managing so many tasks...also the overhead to the cluster nodes of starting and cleaning up after so many child JVMs. Is that not a compelling reason for upping the inputSplitSize? Or am I missing something? Thanks
