[
https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12642828#action_12642828
]
Owen O'Malley commented on HADOOP-2774:
---------------------------------------
Sure and in my proposal includes the records written as intermediates in the
merge. So roughly it looks like:
case 1 = first level spill (10 m row writes) + second level (10 m row writes) +
final write (10 m row writes) = 30 m
case 2 = first level (10 m row writes) + final write (10 m row writes) = 20 m
which shows the 2 versus 3 levels.
However, consider case 3 a map writes 500 spills of 20,000 records. It would be
merged as:
second level: 5 pieces + 100 pieces + 100 pieces + 100 pieces + 100 pieces
final level: 5 big pieces + 95 small pieces
total = first level (10 m) + second level (405 * 20k = 8.1m) + final write
(10m) = 28.1 m
which is a much better indication of the performance than either levels (3) or
first level spills (500).
> Add counters to show number of key/values that have been sorted and merged in
> the maps and reduces
> --------------------------------------------------------------------------------------------------
>
> Key: HADOOP-2774
> URL: https://issues.apache.org/jira/browse/HADOOP-2774
> Project: Hadoop Core
> Issue Type: Bug
> Reporter: Owen O'Malley
> Assignee: Ravi Gummadi
>
> For each *pass* of the sort and merge, I would like a count of the number of
> records. So for example, if the map output 100 records and they were sorted
> once, the counter would be 100. If it spilled twice and was merged together,
> it would be 200. Clearly in a multi-level merge, it may not be a multiple of
> the number of map output records. This would let the users easily see if they
> have values like io.sort.mb or io.sort.factor set too low.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.