Optimize the last merge of the map output files
-----------------------------------------------
Key: HADOOP-2920
URL: https://issues.apache.org/jira/browse/HADOOP-2920
Project: Hadoop Core
Issue Type: Improvement
Components: mapred
Reporter: Devaraj Das
In ReduceTask, today we do merges of io.sort.factor number of files everytime
we merge and write the result back to disk. The last merge can probably be
better. For example, if there are io.sort.factor + 10 files at the end, today
we will merge 100 files into one and then return an iterator over the remaining
11 files. This can be improved (in terms of disk I/O) to merge the smallest 11
files and then return an iterator over the 100 remaining files. Other option is
to not do any single level merge when we have io.sort.factor + n files
remaining (where n << io.sort.factor) but just return the iterator directly.
Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.