[jira] [Comment Edited] (TEZ-2304) InvalidStateTransitonException TA_SCHEDULE at START_WAIT during recovery

Jeff Zhang (JIRA) Sun, 24 May 2015 23:30:08 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557994#comment-14557994
 ]


Jeff Zhang edited comment on TEZ-2304 at 5/25/15 6:28 AM:
----------------------------------------------------------

In this log, there's only recovery events for 
attempt_1428329756093_168563_1_00_006728_1 (attempt_1) but no 
attempt_1428329756093_168563_1_00_006728_0 (attempt_0)
It is possible that attempt_0 is killed before it started so there's no any 
recovery events for it. But we should log the TaskAttemptFinishedEvent even 
when there's no TaskAttemptStartedEvent. (link this with TEZ-2456)
Otherwise in this case, attempt_0 wouldn't be recovered and attempt_1 will be 
recovered, and when a new attempt is scheduled its task attempt id would be the 
same as the attempt_1, because we create task attempt id by using the 
attempts.size();
{code}
TaskAttempt attempt = createAttempt(attempts.size());
{code}

That's why we would see the following weird transition ( from NEW to KILLED, 
and then form NEW to START_WAIT), actually these are 2 different task attempt 
but with the same attempt id, so their state machines are messed up together. 
{noformat}
2015-04-09 20:05:42,055 INFO [AsyncDispatcher event handler] 
impl.TaskAttemptImpl: attempt_1428329756093_168563_1_00_006728_1 TaskAttempt 
Transitioned from NEW to KILLED due to event TA_RECOVER
{noformat}
{noformat}
2015-04-09 20:05:45,748 INFO [AsyncDispatcher event handler] 
impl.TaskAttemptImpl: attempt_1428329756093_168563_1_00_006728_1 TaskAttempt 
Transitioned from NEW to START_WAIT due to event TA_SCHEDULE
{noformat}


was (Author: zjffdu):
In this log, there's only recovery events for 
attempt_1428329756093_168563_1_00_006728_1 (attempt_1) but no 
attempt_1428329756093_168563_1_00_006728_0 (attempt_0)
It is possible that attempt_0 is killed before it started so there's no any 
recovery events for it. We should log the TaskAttemptFinishedEvent even when 
there's no TaskAttemptStartedEvent. (link this with TEZ-2456)
In this case, attempt_0 wouldn't be recovered and attempt_1 will be recovered, 
and when a new attempt is scheduled its task attempt id would be the same as 
the attempt_1, because we create task attempt id by using the attempts.size();
{code}
TaskAttempt attempt = createAttempt(attempts.size());
{code}

That's why we would see the following weird transition ( from NEW to KILLED, 
and then form NEW to START_WAIT), actually these are 2 different task attempt 
but with the same attempt id, so their state machines are messed up together. 
{noformat}
2015-04-09 20:05:42,055 INFO [AsyncDispatcher event handler] 
impl.TaskAttemptImpl: attempt_1428329756093_168563_1_00_006728_1 TaskAttempt 
Transitioned from NEW to KILLED due to event TA_RECOVER
{noformat}
{noformat}
2015-04-09 20:05:45,748 INFO [AsyncDispatcher event handler] 
impl.TaskAttemptImpl: attempt_1428329756093_168563_1_00_006728_1 TaskAttempt 
Transitioned from NEW to START_WAIT due to event TA_SCHEDULE
{noformat}

> InvalidStateTransitonException TA_SCHEDULE at START_WAIT during recovery
> ------------------------------------------------------------------------
>
>                 Key: TEZ-2304
>                 URL: https://issues.apache.org/jira/browse/TEZ-2304
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Jason Lowe
>         Attachments: 168563_recovery.gz
>
>
> I saw a Tez AM throw a few InvalidStateTransitonException (sic) instances 
> during recovery complaining about TA_SCHEDULE arriving at the START_WAIT 
> state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TEZ-2304) InvalidStateTransitonException TA_SCHEDULE at START_WAIT during recovery

Reply via email to