Map output merge still uses unnecessary seeks
---------------------------------------------

                 Key: MAPREDUCE-902
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-902
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: task
    Affects Versions: 0.20.1
            Reporter: Christian Kunz


HADOOP-3638 improved the merge of the map output by caching the index files.

But why not also caching the data files?

In our use-case scenario, still using hadoop-0.18.3, but HADOOP-3638 would only 
help partially, an individual map tasks finishes in less than 30 minutes, but 
needs 4 hours to merge 70 spills for 20,000 partitions (with lzo compression), 
reading about 10kB from each spill file (which is re-opened for every 
partition). As this is just a merge sort, there is no reason to not keep the 
input files open and eliminate seek altogether with sequential access.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to