[
https://issues.apache.org/jira/browse/TEZ-2778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14730405#comment-14730405
]
Bikas Saha commented on TEZ-2778:
---------------------------------
The approach is evolved to do the following
After every read error reported by an attempt, we store the next (last) data
movement event.
So in the common case, for each attempt we have 1 last data event. But for
attempts which see read errors, there will be N+1 data events where N is the
number of read errors seen by that vertex.
Critical path calculation changes as follows
When an attempt is on the critical path then look at its last data event that
immediately precedes the stop critical time of that attempt in that step on the
critical path. And continue the normal (existing) critical path logic as is.
This removes the need for special heuristic for read errors.
E.g. if TA0 completes. TB0 has read error from TA0. TA1 completes. TB0
completes. Then TB0 will have 2 last data events. On critical path TB0 will
first trace back using the last data event from TA1. This goes to TA1 and then
returns to TB0 because TB0 caused the scheduling of TA1. At this time, the stop
critical time of TB1 is before the last data event from TA1. So we use the last
data event from TA0. This takes us to TA0 on the critical path.
> Improvements to handle read errors - part 2
> -------------------------------------------
>
> Key: TEZ-2778
> URL: https://issues.apache.org/jira/browse/TEZ-2778
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Bikas Saha
> Assignee: Bikas Saha
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)