[ 
https://issues.apache.org/jira/browse/TEZ-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuhu Shukla updated TEZ-3758:
-----------------------------
    Attachment: TEZ-3758.002.patch

[~jeagles], thank you for the review comments. I have updated the patch and 
also taken out 2 import statements from the test that were unused. Hope that is 
ok. I have added the check that the failure of the attempt that did not 
contribute to the task being marked as succeeded does not cause more task 
attempts to be scheduled. Request for review/comments. Thanks a lot!

> Vertex can hang in RUNNING state when two task attempts finish very closely 
> and have retroactive failures
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: TEZ-3758
>                 URL: https://issues.apache.org/jira/browse/TEZ-3758
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.1, 0.9.0
>            Reporter: Kuhu Shukla
>            Assignee: Kuhu Shukla
>         Attachments: TEZ-3758.001.patch, TEZ-3758.002.patch
>
>
> A vertex's count of what tasks are done can go off in a case where two task 
> attempts finish very closely, say within a millisecond of each other. We had 
> a case where this task, which was marked successful, never scheduled another 
> attempt upon getting a retroactive failure since it thought it had one 
> uncompleted task attempt already. This is because the attempt that finished 1 
> ms later transitioned to SUCCEEDED but we don't take any action on the 
> taskAttempStatus data structure and it stays false. This JIRA will attempt to 
> solve that race.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to