Reduces can do merges for the on-disk map output files in parallel with their 
copying
-------------------------------------------------------------------------------------

                 Key: HADOOP-910
                 URL: https://issues.apache.org/jira/browse/HADOOP-910
             Project: Hadoop
          Issue Type: Improvement
          Components: mapred
            Reporter: Devaraj Das


Proposal to extend the parallel in-memory-merge/copying, that is being done as 
part of HADOOP-830, to the on-disk files.

Today, the Reduces dump the map output files to disk and the final merge 
happens only after all the map outputs have been collected. It might make sense 
to parallelize this part. That is, whenever a Reduce has collected 
io.sort.factor number of segments on disk, it initiates a merge of those and 
creates one big segment. If the rate of copying is faster than the merge, we 
can probably have multiple threads doing parallel merges of independent sets of 
io.sort.factor number of segments. If the rate of copying is not as fast as 
merge, we stand to gain a lot - at the end of copying of all the map outputs, 
we will be left with a small number of segments for the final merge (which 
hopefully will feed the reduce directly (via the RawKeyValueIterator) without 
having to hit the disk for writing additional output segments).
If the disk bandwidth is higher than the network bandwidth, we have a good 
story, I guess, to do such a thing.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to