[ 
https://issues.apache.org/jira/browse/TEZ-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14529432#comment-14529432
 ] 

Bikas Saha commented on TEZ-2404:
---------------------------------

I am afraid the approach in the patch would be a regression because it is 
making task_completed_event getting double routed again. That is what was fixed 
in TEZ-2325. Double routing causes such an event to first get added to the end 
of the event queue, then get handled by the vertex, then get put back at the 
end of the event queue, then get handled by the task attempt.
We should look at fixing this in some other manner. One idea would be the 
following. When a task succeeds, it chooses a successful attempt and terminates 
all other attempts. It could send all these attempt ids to the vertex in the 
TaskCompletedEvent and the vertex could write the end marker for the events 
from this list. So recovery would need to change to use this end marker and not 
the attempt completed marker to determine that it has seen all events. Any 
other ideas?

> Handle DataMovementEvent before its TaskAttemptCompletedEvent
> -------------------------------------------------------------
>
>                 Key: TEZ-2404
>                 URL: https://issues.apache.org/jira/browse/TEZ-2404
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>            Priority: Critical
>         Attachments: TEZ-2404-1.patch, TEZ-2404-2.patch
>
>
> TEZ-2325 route TASK_ATTEMPT_COMPLETED_EVENT directly to the attempt, but it 
> would cause recovery issue. Recovery need that DataMovement event is handled 
> before TaskAttemptCompletedEvent, otherwise DataMovement event may be lost in 
> recovering and cause the its dependent tasks hang.
> 2 Ways to fix this issue.
> 1. Still route TaskAtttemptCompletedEvent in Vertex
> 2. route DataMovementEvent before TaskAttemptCompeltedEvent in 
> TezTaskAttemptListener



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to