[ 
https://issues.apache.org/jira/browse/TEZ-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14530062#comment-14530062
 ] 

Siddharth Seth commented on TEZ-2404:
-------------------------------------

I'm not completely aware of how the recovery code works. Assuming the 
TASK_FINISHED_EVENT triggers some kind of a sync point which is hit in the 
Vertex to ensure all source events are serialized.
Won't special casing TASK_COMPLETE (DONE, FAILED, etc) to go to VertexImpl and 
TASK_STATUS_UPDATE to go to TaskImpl work ? - as long as they 
TASK_STATUS_UPDATE goes before the TASK_COMPLETE event.
Both would go out on the main dispatcher so ordering is maintained.
This does still give us most of the benefits of TEZ-2325, since TaskComplete 
events are received once per task - but TASK_STATUS_UPDATES are received every 
100ms / heartbeat-interval - which can amount to a large number of events for 
even short running tasks.

> Handle DataMovementEvent before its TaskAttemptCompletedEvent
> -------------------------------------------------------------
>
>                 Key: TEZ-2404
>                 URL: https://issues.apache.org/jira/browse/TEZ-2404
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>            Priority: Critical
>         Attachments: TEZ-2404-1.patch, TEZ-2404-2.patch
>
>
> TEZ-2325 route TASK_ATTEMPT_COMPLETED_EVENT directly to the attempt, but it 
> would cause recovery issue. Recovery need that DataMovement event is handled 
> before TaskAttemptCompletedEvent, otherwise DataMovement event may be lost in 
> recovering and cause the its dependent tasks hang.
> 2 Ways to fix this issue.
> 1. Still route TaskAtttemptCompletedEvent in Vertex
> 2. route DataMovementEvent before TaskAttemptCompeltedEvent in 
> TezTaskAttemptListener



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to