Could you attach the complete reducer logs for the two runs? I am guessing that there could have been more map outputs shuffled to disk in Job1. How many maps did the Jobs have? And what were there map output sizes?
Cheers, Jothi On 8/28/09 1:21 AM, "Rares Vernica" <[email protected]> wrote: > Hello, > > I need some help in understanding the performance of the merge phase. > I am surprised how a merge phase for a task processing a smaller > amount of data takes more that the merge phase for a task processing a > larger amount of data. > > I have two jobs. Both are executed on the same 1 Node cluster (4 Map > and 4 Reduce in parallel). Job 1 is doing word count while Job 2 is > building an inverted list index (from word to records). They both take > the same input data. As you can imagine, the amount of data transfered > between the Job 1 Map and the Job 1 Reduce is smaller that the amount > of data transfered between Job 2 Map and Job 2 Reduce. Still, the > merge phase of Job 1 takes 2 seconds while the merge phase of Job 2 > takes less that 1 second (it is reported as taking 0 seconds). Another > difference is the fact that the keys for Job 1 are Text, while the > keys for Job 2 are IntWritable (words are converted to integers). > > Here are some relevant statistics. > > Job 1 > --- > Reduce input groups 450,693 > Map output bytes 3,152,970,981 > Map input bytes 3,169,519,473 > > Job 2 > --- > Reduce input groups 450,633 > Map output bytes 8,349,926,960 > Map input bytes 3,169,519,473 > > Here are some excerpts from the logs which might be relevant: > > Job 1 > --- > 2009-08-27 12:10:04,171 INFO org.apache.hadoop.mapred.ReduceTask: > Interleaved on-disk merge complete: 0 files left. > 2009-08-27 12:10:04,172 INFO org.apache.hadoop.mapred.ReduceTask: > In-memory merge complete: 4 files left. > 2009-08-27 12:10:04,197 INFO org.apache.hadoop.mapred.Merger: Merging > 4 sorted segments > 2009-08-27 12:10:04,197 INFO org.apache.hadoop.mapred.Merger: Down to > the last merge-pass, with 4 segments left of total size: 13139424 > bytes > 2009-08-27 12:10:04,203 INFO org.apache.hadoop.io.compress.CodecPool: > Got brand-new compressor > 2009-08-27 12:10:06,695 INFO org.apache.hadoop.mapred.ReduceTask: > Merged 4 segments, 13139424 bytes to disk to satisfy reduce memory > limit > 2009-08-27 12:10:06,696 INFO org.apache.hadoop.mapred.ReduceTask: > Merging 1 files, 2952848 bytes from disk > 2009-08-27 12:10:06,696 INFO org.apache.hadoop.mapred.ReduceTask: > Merging 0 segments, 0 bytes from memory into reduce > 2009-08-27 12:10:06,696 INFO org.apache.hadoop.mapred.Merger: Merging > 1 sorted segments > 2009-08-27 12:10:06,699 INFO org.apache.hadoop.io.compress.CodecPool: > Got brand-new decompressor > 2009-08-27 12:10:06,700 INFO org.apache.hadoop.mapred.Merger: Down to > the last merge-pass, with 1 segments left of total size: 2952844 bytes > > Job 2 > --- > 2009-08-27 11:01:42,851 INFO org.apache.hadoop.mapred.ReduceTask: > Interleaved on-disk merge complete: 4 files left. > 2009-08-27 11:01:42,851 INFO org.apache.hadoop.mapred.ReduceTask: > In-memory merge complete: 0 files left. > 2009-08-27 11:01:42,862 INFO org.apache.hadoop.mapred.ReduceTask: > Merging 4 files, 508528572 bytes from disk > 2009-08-27 11:01:42,862 INFO org.apache.hadoop.mapred.ReduceTask: > Merging 0 segments, 0 bytes from memory into reduce > 2009-08-27 11:01:42,866 INFO org.apache.hadoop.mapred.Merger: Merging > 4 sorted segments > 2009-08-27 11:01:42,873 INFO org.apache.hadoop.io.compress.CodecPool: > Got brand-new decompressor > 2009-08-27 11:01:42,874 INFO org.apache.hadoop.io.compress.CodecPool: > Got brand-new decompressor > 2009-08-27 11:01:42,876 INFO org.apache.hadoop.io.compress.CodecPool: > Got brand-new decompressor > 2009-08-27 11:01:42,876 INFO org.apache.hadoop.io.compress.CodecPool: > Got brand-new decompressor > 2009-08-27 11:01:42,877 INFO org.apache.hadoop.mapred.Merger: Down to > the last merge-pass, with 4 segments left of total size: 508528556 > bytes > > From the performance, I feel that the Job 1 sort is done on disk while > the Job 2 sort is done in memory. This is strange because Job 1 Map > output is smaller than Job 2 Map output. > > Thanks a lot! > Rares Vernica
