[
https://issues.apache.org/jira/browse/TEZ-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168959#comment-14168959
]
Bikas Saha commented on TEZ-1470:
---------------------------------
{code}+ successfulAttempt = null;
+ recoveredState = TaskState.RUNNING; // reset to RUNNING, may fail
after SUCCEEDED
+ } else if (taskAttemptState.equals(TaskAttemptState.KILLED)) {
+ successfulAttempt = null;
+ recoveredState = TaskState.RUNNING; // reset to RUNNING, may
{code}
Does this need to take care of any ordering? With speculation, a successful
task attempt will be followed with a killed task attempt. So we will see a
successful task attempt record and and then a killed attempt record. This
should not end up in making the task unsuccessful after recover. Not sure
whether it already works correctly. Can we please check and confirm? thanks!
> Recovery fails due to TaskAttemptFinishedEvent being recorded multiple times
> for the same task
> ----------------------------------------------------------------------------------------------
>
> Key: TEZ-1470
> URL: https://issues.apache.org/jira/browse/TEZ-1470
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Jeff Zhang
> Assignee: Jeff Zhang
> Priority: Minor
> Fix For: 0.5.2
>
> Attachments: Tez-1470-2.patch, Tez-1470.patch
>
>
> TaskAttempt can move from SUCCEEDED to KILLED due to node failure. In this
> case TaskAttemptFinishedEvent may been recorded 2 times,and will cause
> failure in recovery.
> {code}
> 14-05-16 08:07:18,386 INFO [main] org.apache.hadoop.service.AbstractService:
> Service org.apache.tez.dag.app.DAGAppMaster failed in state STARTED; cause:
> org.apache.tez.dag.api.TezUncheckedException: Invalid recovery event for
> attempt finished, more completions than starts encountered,
> taskId=task_1400226928057_0001_1_05_000005, finishedAttempts=2,
> incompleteAttempts=-1
> org.apache.tez.dag.api.TezUncheckedException: Invalid recovery event for
> attempt finished, more completions than starts encountered,
> taskId=task_1400226928057_0001_1_05_000005, finishedAttempts=2,
> incompleteAttempts=-1
> at
> org.apache.tez.dag.app.dag.impl.TaskImpl.restoreFromEvent(TaskImpl.java:592)
> at
> org.apache.tez.dag.app.RecoveryParser.parseRecoveryData(RecoveryParser.java:814)
> at
> org.apache.tez.dag.app.DAGAppMaster.recoverDAG(DAGAppMaster.java:1529)
> at
> org.apache.tez.dag.app.DAGAppMaster.serviceStart(DAGAppMaster.java:1558)
> at
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at org.apache.tez.dag.app.DAGAppMaster$5.run(DAGAppMaster.java:1957)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
> at
> org.apache.tez.dag.app.DAGAppMaster.initAndStartAppMaster(DAGAppMaster.java:1953)
> at org.apache.tez.dag.app.DAGAppMaster.main(DAGAppMaster.java:1792)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)