[ 
https://issues.apache.org/jira/browse/HADOOP-2920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574700#action_12574700
 ] 

Runping Qi commented on HADOOP-2920:
------------------------------------


When there are io.sort.factor + n + 1 files, then merging the smallest n+2 
files should be the right approach.
 


> Optimize the last merge of the map output files
> -----------------------------------------------
>
>                 Key: HADOOP-2920
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2920
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>
> In ReduceTask, today we do merges of io.sort.factor number of files everytime 
> we merge and write the result back to disk. The last merge can probably be 
> better. For example, if there are io.sort.factor + 10 files at the end, today 
> we will merge 100 files into one and then return an iterator over the 
> remaining 11 files. This can be improved (in terms of disk I/O) to merge the 
> smallest 11 files and then return an iterator over the 100 remaining files. 
> Other option is to not do any single level merge when we have io.sort.factor 
> + n files remaining (where n << io.sort.factor) but just return the iterator 
> directly. Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to