[ 
https://issues.apache.org/jira/browse/HADOOP-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650847#action_12650847
 ] 

Christian Kunz commented on HADOOP-4730:
----------------------------------------

I was monitoring a long tail (single reducer) of a job, and noticed that it was 
spending a lot of time in the merge phase doing merges in single-threaded 
fashion. I attach the log:

2008-11-25 16:27:52,222 INFO org.apache.hadoop.mapred.ReduceTask: Initiating 
final on-disk merge with 394 files
2008-11-25 16:27:52,343 INFO org.apache.hadoop.mapred.Merger: Merging 394 
sorted segments
2008-11-25 16:27:57,982 INFO org.apache.hadoop.mapred.Merger: Merging 97 
intermediate segments out of a total of 394
2008-11-25 17:10:23,569 INFO org.apache.hadoop.mapred.Merger: Merging 100 
intermediate segments out of a total of 298
2008-11-25 17:59:22,272 INFO org.apache.hadoop.mapred.Merger: Merging 100 
intermediate segments out of a total of 199
2008-11-25 18:48:48,813 INFO org.apache.hadoop.mapred.Merger: Down to the last 
merge-pass, with 100 segments left of total size: 113074719385 bytes
2008-11-25 18:48:50,521 INFO org.apache.hadoop.mapred.pipes.PipesReducer: 
starting application

Between 16:28 and 18:48 3 merges got executed, each taking 40-50 minutes. With 
running in parallel we could have saved about 1.5hr.

> multi-threaded merge phase
> --------------------------
>
>                 Key: HADOOP-4730
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4730
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.18.1
>            Reporter: Christian Kunz
>
> Doing merges in multiple threads (when enough cores are available -- a 
> monitoring issue), the time spent in merging could be cut by a factor equal 
> to the number of threads.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to