Hello Brendan W., On Tue, Mar 29, 2011 at 9:01 PM, Brendan W. <[email protected]> wrote: > Hi, > > I have a 20-node hadoop cluster, processing large log files. I've seen it > said that there's never any reason to make the inputSplitSize larger than a > single HDFS block (64M), because you give up data locality for no benefit if > you do.
This is true. You wouldn't always want your InputSplits to have chunk sizes bigger than the input file's block-size on the HDFS. > But when I kick off a job against the whole dataset with that default > splitSize, I get about 180,000 map tasks, most lasting about 9-15 seconds > each. Typically I can get through about half of them, then the jobTracker > freezes with OOM errors. That your input splits are 180000 in number is a good indicator that you have: a) Too many files (A small files problem? [1]) b) Too low block size of the input files [2]) [1] - http://www.cloudera.com/blog/2009/02/the-small-files-problem/ [2] - For file sizes in GBs, it does not make sense to have 64 MB block sizes. Increasing block sizes for such files (it is a per-file property after-all) directly reduces your number of tasks. -- Harsh J http://harshj.com
