Hi, I was running the Sort example in Hadoop 0.20.2 (hadoop-0.20.2-examples.jar) over an input data size of 100GB (generated using randomwriter) with 800mappers (I was using 128MB of HDFS block size) and 4 reducers over a 3 machine cluster with 2 slave nodes. While the input and output were 100GB, I found that the intermediate data sent to each reducer was around 78GB, making the total intermediate data around 310GB. I dont really understand why there is an increase in data size given that the sort example just uses the identity mapper and identity reducer. Could someone please help me out with this?
Thanks!!