Hello Brendan W.,

On Tue, Mar 29, 2011 at 9:01 PM, Brendan W. <[email protected]> wrote:
> Hi,
>
> I have a 20-node hadoop cluster, processing large log files.  I've seen it
> said that there's never any reason to make the inputSplitSize larger than a
> single HDFS block (64M), because you give up data locality for no benefit if
> you do.

This is true. You wouldn't always want your InputSplits to have chunk
sizes bigger than the input file's block-size on the HDFS.

> But when I kick off a job against the whole dataset with that default
> splitSize, I get about 180,000 map tasks, most lasting about 9-15 seconds
> each.  Typically I can get through about half of them, then the jobTracker
> freezes with OOM errors.

That your input splits are 180000 in number is a good indicator that you have:
a) Too many files (A small files problem? [1])
b) Too low block size of the input files [2])

[1] - http://www.cloudera.com/blog/2009/02/the-small-files-problem/
[2] - For file sizes in GBs, it does not make sense to have 64 MB
block sizes. Increasing block sizes for such files (it is a per-file
property after-all) directly reduces your number of tasks.

-- 
Harsh J
http://harshj.com

Reply via email to