[ 
https://issues.apache.org/jira/browse/TEZ-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14526916#comment-14526916
 ] 

Bikas Saha commented on TEZ-2404:
---------------------------------

TEZ-1897 is not enabled yet. So we dont have to fix this immediately. We can 
use the time to explore other solutions that dont involve routing the same 
event twice. E.g. when the task completes then it sends an event to its vertex 
so that the vertex can increment its completed task count. Can that be used to 
mark the successful attempt as done in the history logs by the vertex? 
Logically, from what I see, the vertex is using the task attempt completed 
event as a marker for the successful attempts history event completion, right? 
This approach may mean that an unsuccessful attempt will not have a completion 
marker. Will that be a problem? Maybe not, since we dont care about those 
attempts anyways. For work preserving AM restart we can discard these events if 
the running task has not reconnected with the AM. In the non-work-preserving AM 
restart case we can always discard these events.

> Handle DataMovementEvent before its TaskAttemptCompletedEvent
> -------------------------------------------------------------
>
>                 Key: TEZ-2404
>                 URL: https://issues.apache.org/jira/browse/TEZ-2404
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>         Attachments: TEZ-2404-1.patch, TEZ-2404-2.patch
>
>
> TEZ-2325 route TASK_ATTEMPT_COMPLETED_EVENT directly to the attempt, but it 
> would cause recovery issue. Recovery need that DataMovement event is handled 
> before TaskAttemptCompletedEvent, otherwise DataMovement event may be lost in 
> recovering and cause the its dependent tasks hang.
> 2 Ways to fix this issue.
> 1. Still route TaskAtttemptCompletedEvent in Vertex
> 2. route DataMovementEvent before TaskAttemptCompeltedEvent in 
> TezTaskAttemptListener



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to