[ https://issues.apache.org/jira/browse/HADOOP-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650847#action_12650847 ]
Christian Kunz commented on HADOOP-4730: ---------------------------------------- I was monitoring a long tail (single reducer) of a job, and noticed that it was spending a lot of time in the merge phase doing merges in single-threaded fashion. I attach the log: 2008-11-25 16:27:52,222 INFO org.apache.hadoop.mapred.ReduceTask: Initiating final on-disk merge with 394 files 2008-11-25 16:27:52,343 INFO org.apache.hadoop.mapred.Merger: Merging 394 sorted segments 2008-11-25 16:27:57,982 INFO org.apache.hadoop.mapred.Merger: Merging 97 intermediate segments out of a total of 394 2008-11-25 17:10:23,569 INFO org.apache.hadoop.mapred.Merger: Merging 100 intermediate segments out of a total of 298 2008-11-25 17:59:22,272 INFO org.apache.hadoop.mapred.Merger: Merging 100 intermediate segments out of a total of 199 2008-11-25 18:48:48,813 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 100 segments left of total size: 113074719385 bytes 2008-11-25 18:48:50,521 INFO org.apache.hadoop.mapred.pipes.PipesReducer: starting application Between 16:28 and 18:48 3 merges got executed, each taking 40-50 minutes. With running in parallel we could have saved about 1.5hr. > multi-threaded merge phase > -------------------------- > > Key: HADOOP-4730 > URL: https://issues.apache.org/jira/browse/HADOOP-4730 > Project: Hadoop Core > Issue Type: Improvement > Components: mapred > Affects Versions: 0.18.1 > Reporter: Christian Kunz > > Doing merges in multiple threads (when enough cores are available -- a > monitoring issue), the time spent in merging could be cut by a factor equal > to the number of threads. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.