[
https://issues.apache.org/jira/browse/TEZ-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14530259#comment-14530259
]
Jeff Zhang edited comment on TEZ-2404 at 5/6/15 10:11 AM:
----------------------------------------------------------
bq. This does still give us most of the benefits of TEZ-2325, since
TaskComplete events are received once per task - but TASK_STATUS_UPDATES are
received every 100ms / heartbeat-interval - which can amount to a large number
of events for even short running tasks.
+1 on this. Recovery only depend on TaskAttemptFinishedEvent &
DataMovementEvent and require DataMovementEvent logged before
TaskAttemptFinishedEvent. The patch should be able to gurantee
DataMovementEvent is logged before TaskAttemptFinishedEvent and
TaskAttemptFinishedEvent is routed to TaskAttempt after the TaskStatusUpdate.
Any other ordering issues in your mind ? [~bikassaha]
was (Author: zjffdu):
bq. This does still give us most of the benefits of TEZ-2325, since
TaskComplete events are received once per task - but TASK_STATUS_UPDATES are
received every 100ms / heartbeat-interval - which can amount to a large number
of events for even short running tasks.
+1 on this. Recovery only depend on TaskAttemptFinishedEvent &
DataMovementEvent and require DataMovementEvent logged before
TaskAttemptFinishedEvent. The patch should be able to gurantee
TaskAttemptFinishedEvent is routed to TaskAttempt after the TaskStatusUpdate.
Any other ordering issues in your mind ? [~bikassaha]
> Handle DataMovementEvent before its TaskAttemptCompletedEvent
> -------------------------------------------------------------
>
> Key: TEZ-2404
> URL: https://issues.apache.org/jira/browse/TEZ-2404
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Jeff Zhang
> Assignee: Jeff Zhang
> Priority: Critical
> Attachments: TEZ-2404-1.patch, TEZ-2404-2.patch
>
>
> TEZ-2325 route TASK_ATTEMPT_COMPLETED_EVENT directly to the attempt, but it
> would cause recovery issue. Recovery need that DataMovement event is handled
> before TaskAttemptCompletedEvent, otherwise DataMovement event may be lost in
> recovering and cause the its dependent tasks hang.
> 2 Ways to fix this issue.
> 1. Still route TaskAtttemptCompletedEvent in Vertex
> 2. route DataMovementEvent before TaskAttemptCompeltedEvent in
> TezTaskAttemptListener
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)