[ 
https://issues.apache.org/jira/browse/HADOOP-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12699224#action_12699224
 ] 

Ravi Gummadi commented on HADOOP-5572:
--------------------------------------

We are planning to allocate 33% of map task's progress to final sort.

Since merge progress is not updated currently(both map side and reduce side), 
even if we allocate 33% of mapTask progress to sort(merge), map progress will 
be stuck at 66.7% till sort(merge) is finished and progress will jump from 
66.7% to 100%. This could affect speculative execution.

Here is a proposal for updating sort/merge progress approximately.

In merge(), we consider the smallest io.sort.factor files for each merge. So we 
assume that there is no combiner and we calculate the denominator for 
mergeProgress using the following before the begining of merges:

We maintain a list of sizes of segments to be merged(sorted list). We add the 
sizes of smallest factor segments(that are going be merged first) and add the 
sum to the list and remove the smallest factor sizes. Do this again and again 
until we are left with 1 element in the list. This element is the denominator 
for mergeProgress for 1st merge. 
As and when the segments are read for a merge, the numerator is incremented 
based on position in the segment and mergeProgress is updated.
Denominator is decreased by the difference (inputRecordsForThisMerge - 
mergedRecordsInThisMerge). This is to get better approximation of mergeProgress 
with combiner being called in merges.

mergeProgress is not very accurate(when combiner is used in merges) in the 
above approach because of 2 reasons:
(1) Exact estimation of total size of data(going to be merged in all the 
merges) before merges is not possible when combiner is there.
(2) sizes of compressed and uncompressed segments(inMemory segments) are 
treated alike.

This would also avoid jump of reduce task progress from 33.3% to 66.7%. On 
reduce side, for mergeProgress, we will have to avoid adding the sizes of 
segments of last merge of factor segments in estimating the total size of data 
that will be merged(computation of denominator from the list of sizes of 
segments), because the last merge is considered as part of the 3rd phase of 
reduce task(i.e. reduce phase).

Thoughts ?

> The map progress value should have a separate phase for doing the final sort.
> -----------------------------------------------------------------------------
>
>                 Key: HADOOP-5572
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5572
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Ravi Gummadi
>
> Currently, the final spill and sort doesn't record any progress while it 
> runs, leading to the perception that the map is done, but "stuck".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to