[
https://issues.apache.org/jira/browse/TEZ-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14531791#comment-14531791
]
Bikas Saha commented on TEZ-2404:
---------------------------------
The failures that I were seeing are unrelated to this or TEZ-2325. Looked at
that further and opened TEZ-2426.
We can commit the current patch and keep TEZ-2418 open and expand its scope to
move both completed and failed events back to the be sent directly to the task
attempt. I will update TEZ-2418.
We should open a jira to make recovery resilient to ordering of these events
and make TEZ-2418 blocked on this jira. The change in this patch is creating a
nuanced routing where some events are single routed and some are double routed
with the implicit assumption that ordering is being maintained because the
double routed event was initially after the single routed event and all the
routing happens on the same thread. So double routing delays it further on the
same thread and we are safe. If the double routed event was actually ahead then
this would break immediately. IMO this kind of nuanced event routing is not
something we should keep around for long.
> Handle DataMovementEvent before its TaskAttemptCompletedEvent
> -------------------------------------------------------------
>
> Key: TEZ-2404
> URL: https://issues.apache.org/jira/browse/TEZ-2404
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Jeff Zhang
> Assignee: Jeff Zhang
> Priority: Critical
> Attachments: TEZ-2404-1.patch, TEZ-2404-2.patch
>
>
> TEZ-2325 route TASK_ATTEMPT_COMPLETED_EVENT directly to the attempt, but it
> would cause recovery issue. Recovery need that DataMovement event is handled
> before TaskAttemptCompletedEvent, otherwise DataMovement event may be lost in
> recovering and cause the its dependent tasks hang.
> 2 Ways to fix this issue.
> 1. Still route TaskAtttemptCompletedEvent in Vertex
> 2. route DataMovementEvent before TaskAttemptCompeltedEvent in
> TezTaskAttemptListener
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)