Thanks Harsh...it's definitely (2) below, i.e., giant files.

But what would be the benefit of actually changing the DFS block size (to
say N*64 Mbytes), as opposed to just increasing the inputSplitSize to N
64-Mbyte blocks for my job?  Both will reduce my number of mappers by a
factor of N, right?  Any benefit to one over the other?

On Tue, Mar 29, 2011 at 12:39 PM, Harsh J <[email protected]> wrote:

> Hello Brendan W.,
>
> On Tue, Mar 29, 2011 at 9:01 PM, Brendan W. <[email protected]> wrote:
> > Hi,
> >
> > I have a 20-node hadoop cluster, processing large log files.  I've seen
> it
> > said that there's never any reason to make the inputSplitSize larger than
> a
> > single HDFS block (64M), because you give up data locality for no benefit
> if
> > you do.
>
> This is true. You wouldn't always want your InputSplits to have chunk
> sizes bigger than the input file's block-size on the HDFS.
>
> > But when I kick off a job against the whole dataset with that default
> > splitSize, I get about 180,000 map tasks, most lasting about 9-15 seconds
> > each.  Typically I can get through about half of them, then the
> jobTracker
> > freezes with OOM errors.
>
> That your input splits are 180000 in number is a good indicator that you
> have:
> a) Too many files (A small files problem? [1])
> b) Too low block size of the input files [2])
>
> [1] - http://www.cloudera.com/blog/2009/02/the-small-files-problem/
> [2] - For file sizes in GBs, it does not make sense to have 64 MB
> block sizes. Increasing block sizes for such files (it is a per-file
> property after-all) directly reduces your number of tasks.
>
> --
> Harsh J
> http://harshj.com
>

Reply via email to