Virajith, The FILE_BYTES_READ also counts all the reads of spilled records done during sorting of the various outputs between the MR phases.
On Wed, Jun 29, 2011 at 6:30 PM, Virajith Jalaparti <virajit...@gmail.com> wrote: > I would like to clarify my earlier question: I found that each reducer > reports FILE_BYTES_READ as around 78GB and HDFS_BYTES_WRITTEN as 25GB and > REDUCE_SHUFFLE_BYTES as 25GB. So, why is the FILE_BYTES_READÂ 78GB and not > just 25GB? > > Thanks, > Virajith > > On Wed, Jun 29, 2011 at 10:29 AM, Virajith Jalaparti <virajit...@gmail.com> > wrote: >> >> Hi, >> >> I was running the Sort example in Hadoop 0.20.2 >> (hadoop-0.20.2-examples.jar) over an input data size of 100GB (generated >> using randomwriter) with 800mappers (I was using 128MB of HDFS block size) >> and 4 reducers over a 3 machine cluster with 2 slave nodes. While the input >> and output were 100GB, I found that the intermediate data sent to each >> reducer was around 78GB, making the total intermediate data around 310GB. I >> dont really understand why there is an increase in data size given that the >> sort example just uses the identity mapper and identity reducer. >> Could someone please help me out with this? >> >> Thanks!! > > -- Harsh J