Thanks Harsh...it's definitely (2) below, i.e., giant files. But what would be the benefit of actually changing the DFS block size (to say N*64 Mbytes), as opposed to just increasing the inputSplitSize to N 64-Mbyte blocks for my job? Both will reduce my number of mappers by a factor of N, right? Any benefit to one over the other?
On Tue, Mar 29, 2011 at 12:39 PM, Harsh J <[email protected]> wrote: > Hello Brendan W., > > On Tue, Mar 29, 2011 at 9:01 PM, Brendan W. <[email protected]> wrote: > > Hi, > > > > I have a 20-node hadoop cluster, processing large log files. I've seen > it > > said that there's never any reason to make the inputSplitSize larger than > a > > single HDFS block (64M), because you give up data locality for no benefit > if > > you do. > > This is true. You wouldn't always want your InputSplits to have chunk > sizes bigger than the input file's block-size on the HDFS. > > > But when I kick off a job against the whole dataset with that default > > splitSize, I get about 180,000 map tasks, most lasting about 9-15 seconds > > each. Typically I can get through about half of them, then the > jobTracker > > freezes with OOM errors. > > That your input splits are 180000 in number is a good indicator that you > have: > a) Too many files (A small files problem? [1]) > b) Too low block size of the input files [2]) > > [1] - http://www.cloudera.com/blog/2009/02/the-small-files-problem/ > [2] - For file sizes in GBs, it does not make sense to have 64 MB > block sizes. Increasing block sizes for such files (it is a per-file > property after-all) directly reduces your number of tasks. > > -- > Harsh J > http://harshj.com >
