Improve CPU cache behavior in map side sort
-------------------------------------------

                 Key: MAPREDUCE-3235
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3235
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
          Components: task
    Affects Versions: 0.23.0
            Reporter: Todd Lipcon
            Assignee: Todd Lipcon


When running oprofile on a terasort workload, I noticed that a large amount of 
CPU usage was going to MapTask$MapOutputBuffer.compare. Upon disassembling this 
and looking at cycle counters, most of the cycles were going to memory loads 
dereferencing into the array of key-value data -- implying expensive cache 
misses. This can be avoided as follows:
- rather than simply swapping indexes into the kv array, swap the entire meta 
entries in the meta array. Swapping 16 bytes is only negligibly slower than 
swapping 4 bytes. This requires adding the value-length into the meta array, 
since we used to rely on the previous-in-the-array meta entry to determine 
this. So we replace INDEX with VALUELEN and avoid one layer of indirection.
- introduce an interface which allows key types to provide a 4-byte comparison 
proxy. For string keys, this can simply be the first 4 bytes of the string. The 
idea is that, if stringCompare(key1.proxy(), key2.proxy()) != 0, then 
compare(key1, key2) should have the same result. If the proxies are equal, the 
normal comparison method is used. We then include the 4-byte proxy as part of 
the metadata entry, so that for many cases the indirection into the data buffer 
can be avoided.

On a terasort benchmark, these optimizations plus an optimization to 
WritableComparator.compareBytes dropped the aggregate mapside CPU millis by 
40%, and the compare() routine mostly dropped off the oprofile results.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to