[
https://issues.apache.org/jira/browse/MAPREDUCE-3235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13296065#comment-13296065
]
Hal Mo commented on MAPREDUCE-3235:
-----------------------------------
> Todd Lipcon added a comment - 15/Jun/12 21:15
>> Unit test passed, and got 10% improvement of cpu usage per node.
> What workload? Terasort or something else?
on 9-slaves cluster, test Terasort,spill only once(teragen 126*512M)
parameter:
-Dmapred.child.java.opts=-Xmx1g
-Dio.sort.mb=647
-Ddfs.block.size=536870912
-Dio.sort.record.percent=0.167
-Dmapred.map.tasks=126
-Dmapred.reduce.tasks=126
-Dio.sort.factor=100
-Dmapred.compress.map.output=ture
-Dmapred.map.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec
-Dio.sort.spill.percent=0.95
-Dmapred.map.tasks.speculative.execution=false
-Dmapred.reduce.tasks.speculative.execution=false
result:
(here was the how I got the cpu use:
run cat /proc/stat|grep "cpu " | awk '{print $2+$3+$4+$7+$8+$9}'
before and after the job, per each node, ignoring $5(idle) and $6(iowait),
and subtract the two number)
cpu use(original):
run node1 node2 node3 node4 node5 node6 node7 node8 node9
cluster
#1 95934 99531 96278 97429 87239 98085 96376 97491 96159
864522
#2 93360 95457 94389 94158 83400 95927 98785 96480 97012
848968
#3 97022 94855 101510 94534 83064 95922 96947 97071 96707
857632
#4 94124 97135 95332 95126 86036 94352 97149 101701 95313
856268
#5 92429 94422 94107 90331 83487 91857 94130 94866 92747
828376
avg 851153
cpu use(with patch):
run node1 node2 node3 node4 node5 node6 node7 node8 node9
cluster
#1 84241 86630 84010 84752 76644 85126 84457 85035 84504
755399
#2 86727 82622 84231 84313 72410 87602 83546 87104 85624
754179
#3 84626 84467 81459 85049 71584 83746 83843 88444 80776
743994
#4 82072 83229 81670 84070 71931 86059 88080 84624 87980
749715
#5 80029 84439 87655 87411 73583 85899 84129 85077 86522
754744
avg 751606
improved:851153/751606*100%-100%=13.2%
> Improve CPU cache behavior in map side sort
> -------------------------------------------
>
> Key: MAPREDUCE-3235
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3235
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: performance, task
> Affects Versions: 0.23.0
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Attachments: map_sort_perf.diff, mr-3235-poc.txt
>
>
> When running oprofile on a terasort workload, I noticed that a large amount
> of CPU usage was going to MapTask$MapOutputBuffer.compare. Upon disassembling
> this and looking at cycle counters, most of the cycles were going to memory
> loads dereferencing into the array of key-value data -- implying expensive
> cache misses. This can be avoided as follows:
> - rather than simply swapping indexes into the kv array, swap the entire meta
> entries in the meta array. Swapping 16 bytes is only negligibly slower than
> swapping 4 bytes. This requires adding the value-length into the meta array,
> since we used to rely on the previous-in-the-array meta entry to determine
> this. So we replace INDEX with VALUELEN and avoid one layer of indirection.
> - introduce an interface which allows key types to provide a 4-byte
> comparison proxy. For string keys, this can simply be the first 4 bytes of
> the string. The idea is that, if stringCompare(key1.proxy(), key2.proxy()) !=
> 0, then compare(key1, key2) should have the same result. If the proxies are
> equal, the normal comparison method is used. We then include the 4-byte proxy
> as part of the metadata entry, so that for many cases the indirection into
> the data buffer can be avoided.
> On a terasort benchmark, these optimizations plus an optimization to
> WritableComparator.compareBytes dropped the aggregate mapside CPU millis by
> 40%, and the compare() routine mostly dropped off the oprofile results.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira