[
https://issues.apache.org/jira/browse/TEZ-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557994#comment-14557994
]
Jeff Zhang edited comment on TEZ-2304 at 5/25/15 6:28 AM:
----------------------------------------------------------
In this log, there's only recovery events for
attempt_1428329756093_168563_1_00_006728_1 (attempt_1) but no
attempt_1428329756093_168563_1_00_006728_0 (attempt_0)
It is possible that attempt_0 is killed before it started so there's no any
recovery events for it. But we should log the TaskAttemptFinishedEvent even
when there's no TaskAttemptStartedEvent. (link this with TEZ-2456)
Otherwise in this case, attempt_0 wouldn't be recovered and attempt_1 will be
recovered, and when a new attempt is scheduled its task attempt id would be the
same as the attempt_1, because we create task attempt id by using the
attempts.size();
{code}
TaskAttempt attempt = createAttempt(attempts.size());
{code}
That's why we would see the following weird transition ( from NEW to KILLED,
and then form NEW to START_WAIT), actually these are 2 different task attempt
but with the same attempt id, so their state machines are messed up together.
{noformat}
2015-04-09 20:05:42,055 INFO [AsyncDispatcher event handler]
impl.TaskAttemptImpl: attempt_1428329756093_168563_1_00_006728_1 TaskAttempt
Transitioned from NEW to KILLED due to event TA_RECOVER
{noformat}
{noformat}
2015-04-09 20:05:45,748 INFO [AsyncDispatcher event handler]
impl.TaskAttemptImpl: attempt_1428329756093_168563_1_00_006728_1 TaskAttempt
Transitioned from NEW to START_WAIT due to event TA_SCHEDULE
{noformat}
was (Author: zjffdu):
In this log, there's only recovery events for
attempt_1428329756093_168563_1_00_006728_1 (attempt_1) but no
attempt_1428329756093_168563_1_00_006728_0 (attempt_0)
It is possible that attempt_0 is killed before it started so there's no any
recovery events for it. We should log the TaskAttemptFinishedEvent even when
there's no TaskAttemptStartedEvent. (link this with TEZ-2456)
In this case, attempt_0 wouldn't be recovered and attempt_1 will be recovered,
and when a new attempt is scheduled its task attempt id would be the same as
the attempt_1, because we create task attempt id by using the attempts.size();
{code}
TaskAttempt attempt = createAttempt(attempts.size());
{code}
That's why we would see the following weird transition ( from NEW to KILLED,
and then form NEW to START_WAIT), actually these are 2 different task attempt
but with the same attempt id, so their state machines are messed up together.
{noformat}
2015-04-09 20:05:42,055 INFO [AsyncDispatcher event handler]
impl.TaskAttemptImpl: attempt_1428329756093_168563_1_00_006728_1 TaskAttempt
Transitioned from NEW to KILLED due to event TA_RECOVER
{noformat}
{noformat}
2015-04-09 20:05:45,748 INFO [AsyncDispatcher event handler]
impl.TaskAttemptImpl: attempt_1428329756093_168563_1_00_006728_1 TaskAttempt
Transitioned from NEW to START_WAIT due to event TA_SCHEDULE
{noformat}
> InvalidStateTransitonException TA_SCHEDULE at START_WAIT during recovery
> ------------------------------------------------------------------------
>
> Key: TEZ-2304
> URL: https://issues.apache.org/jira/browse/TEZ-2304
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.6.0
> Reporter: Jason Lowe
> Attachments: 168563_recovery.gz
>
>
> I saw a Tez AM throw a few InvalidStateTransitonException (sic) instances
> during recovery complaining about TA_SCHEDULE arriving at the START_WAIT
> state.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)