[ 
https://issues.apache.org/jira/browse/TEZ-2778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14730405#comment-14730405
 ] 

Bikas Saha commented on TEZ-2778:
---------------------------------

The approach is evolved to do the following
After every read error reported by an attempt, we store the next (last) data 
movement event.
So in the common case, for each attempt we have 1 last data event. But for 
attempts which see read errors, there will be N+1 data events where N is the 
number of read errors seen by that vertex.
Critical path calculation changes as follows
When an attempt is on the critical path then look at its last data event that 
immediately precedes the stop critical time of that attempt in that step on the 
critical path. And continue the normal (existing) critical path logic as is. 
This removes the need for special heuristic for read errors.
E.g. if TA0 completes. TB0 has read error from TA0. TA1 completes. TB0 
completes. Then TB0 will have 2 last data events. On critical path TB0 will 
first trace back using the last data event from TA1. This goes to TA1 and then 
returns to TB0 because TB0 caused the scheduling of TA1. At this time, the stop 
critical time of TB1 is before the last data event from TA1. So we use the last 
data event from TA0. This takes us to TA0 on the critical path.

> Improvements to handle read errors - part 2
> -------------------------------------------
>
>                 Key: TEZ-2778
>                 URL: https://issues.apache.org/jira/browse/TEZ-2778
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Bikas Saha
>            Assignee: Bikas Saha
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to