[
https://issues.apache.org/jira/browse/TEZ-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kuhu Shukla updated TEZ-3758:
-----------------------------
Attachment: TEZ-3758.003.patch
Revised patch that adds the new redundant transition only for cases when
attempts are launched or succeed. Renamed the transition class accordingly.
Also made analogous change when task state is FAILED. While the current
inconsistency of 'status' data structure does not impact us if the task was
marked failed as the DAG would fail as well, but after this change at least the
status data structure reflects correct values. The KILLED task state
transitions did not need this change since they already mark the statuses
correctly before adding another attempt.
Failing tests pass after this change.
> Vertex can hang in RUNNING state when two task attempts finish very closely
> and have retroactive failures
> ---------------------------------------------------------------------------------------------------------
>
> Key: TEZ-3758
> URL: https://issues.apache.org/jira/browse/TEZ-3758
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.7.1, 0.9.0
> Reporter: Kuhu Shukla
> Assignee: Kuhu Shukla
> Attachments: TEZ-3758.001.patch, TEZ-3758.002.patch,
> TEZ-3758.003.patch
>
>
> A vertex's count of what tasks are done can go off in a case where two task
> attempts finish very closely, say within a millisecond of each other. We had
> a case where this task, which was marked successful, never scheduled another
> attempt upon getting a retroactive failure since it thought it had one
> uncompleted task attempt already. This is because the attempt that finished 1
> ms later transitioned to SUCCEEDED but we don't take any action on the
> taskAttempStatus data structure and it stays false. This JIRA will attempt to
> solve that race.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)