[ https://issues.apache.org/jira/browse/HADOOP-910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466134 ]
Runping Qi commented on HADOOP-910: ----------------------------------- Even when single thread can not use all disk capacity, multi threading helps only if the network throughput is higher than the single thread can handle. > Reduces can do merges for the on-disk map output files in parallel with their > copying > ------------------------------------------------------------------------------------- > > Key: HADOOP-910 > URL: https://issues.apache.org/jira/browse/HADOOP-910 > Project: Hadoop > Issue Type: Improvement > Components: mapred > Reporter: Devaraj Das > > Proposal to extend the parallel in-memory-merge/copying, that is being done > as part of HADOOP-830, to the on-disk files. > Today, the Reduces dump the map output files to disk and the final merge > happens only after all the map outputs have been collected. It might make > sense to parallelize this part. That is, whenever a Reduce has collected > io.sort.factor number of segments on disk, it initiates a merge of those and > creates one big segment. If the rate of copying is faster than the merge, we > can probably have multiple threads doing parallel merges of independent sets > of io.sort.factor number of segments. If the rate of copying is not as fast > as merge, we stand to gain a lot - at the end of copying of all the map > outputs, we will be left with a small number of segments for the final merge > (which hopefully will feed the reduce directly (via the RawKeyValueIterator) > without having to hit the disk for writing additional output segments). > If the disk bandwidth is higher than the network bandwidth, we have a good > story, I guess, to do such a thing. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira