[
https://issues.apache.org/jira/browse/HADOOP-2920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574700#action_12574700
]
Runping Qi commented on HADOOP-2920:
------------------------------------
When there are io.sort.factor + n + 1 files, then merging the smallest n+2
files should be the right approach.
> Optimize the last merge of the map output files
> -----------------------------------------------
>
> Key: HADOOP-2920
> URL: https://issues.apache.org/jira/browse/HADOOP-2920
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Reporter: Devaraj Das
>
> In ReduceTask, today we do merges of io.sort.factor number of files everytime
> we merge and write the result back to disk. The last merge can probably be
> better. For example, if there are io.sort.factor + 10 files at the end, today
> we will merge 100 files into one and then return an iterator over the
> remaining 11 files. This can be improved (in terms of disk I/O) to merge the
> smallest 11 files and then return an iterator over the 100 remaining files.
> Other option is to not do any single level merge when we have io.sort.factor
> + n files remaining (where n << io.sort.factor) but just return the iterator
> directly. Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.