[
https://issues.apache.org/jira/browse/HADOOP-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12699224#action_12699224
]
Ravi Gummadi commented on HADOOP-5572:
--------------------------------------
We are planning to allocate 33% of map task's progress to final sort.
Since merge progress is not updated currently(both map side and reduce side),
even if we allocate 33% of mapTask progress to sort(merge), map progress will
be stuck at 66.7% till sort(merge) is finished and progress will jump from
66.7% to 100%. This could affect speculative execution.
Here is a proposal for updating sort/merge progress approximately.
In merge(), we consider the smallest io.sort.factor files for each merge. So we
assume that there is no combiner and we calculate the denominator for
mergeProgress using the following before the begining of merges:
We maintain a list of sizes of segments to be merged(sorted list). We add the
sizes of smallest factor segments(that are going be merged first) and add the
sum to the list and remove the smallest factor sizes. Do this again and again
until we are left with 1 element in the list. This element is the denominator
for mergeProgress for 1st merge.
As and when the segments are read for a merge, the numerator is incremented
based on position in the segment and mergeProgress is updated.
Denominator is decreased by the difference (inputRecordsForThisMerge -
mergedRecordsInThisMerge). This is to get better approximation of mergeProgress
with combiner being called in merges.
mergeProgress is not very accurate(when combiner is used in merges) in the
above approach because of 2 reasons:
(1) Exact estimation of total size of data(going to be merged in all the
merges) before merges is not possible when combiner is there.
(2) sizes of compressed and uncompressed segments(inMemory segments) are
treated alike.
This would also avoid jump of reduce task progress from 33.3% to 66.7%. On
reduce side, for mergeProgress, we will have to avoid adding the sizes of
segments of last merge of factor segments in estimating the total size of data
that will be merged(computation of denominator from the list of sizes of
segments), because the last merge is considered as part of the 3rd phase of
reduce task(i.e. reduce phase).
Thoughts ?
> The map progress value should have a separate phase for doing the final sort.
> -----------------------------------------------------------------------------
>
> Key: HADOOP-5572
> URL: https://issues.apache.org/jira/browse/HADOOP-5572
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Reporter: Owen O'Malley
> Assignee: Ravi Gummadi
>
> Currently, the final spill and sort doesn't record any progress while it
> runs, leading to the perception that the map is done, but "stuck".
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.