[
https://issues.apache.org/jira/browse/TEZ-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168990#comment-14168990
]
Jeff Zhang commented on TEZ-1470:
---------------------------------
[~bikassaha] Thanks for your careful review. Yes, this do cause issue with
speculation. Following code may be better. But speculation is not supported in
tez yet, right ?
{code}
// reset to RUNNING, may fail after SUCCEEDED
if (successfulAttempt.equals(taskAttempt.getID())) {
successfulAttempt = null;
recoveredState = TaskState.RUNNING;
}
{code}
[~hitesh], [~bikassaha], besides, I was thinking that restore events one by one
may not be a good solution for recover considering there may be some
non-trivial cases like (one start event, multiple finished event ). I was
thinking that we can group events first, and then create a RecoveryData from
these grouped event and restore from that RecoveryData rather than restore
event one by one. It would be much clean and easy to find potential issues by
checking the grouped events or the RecoveryData. What do you think ?
> Recovery fails due to TaskAttemptFinishedEvent being recorded multiple times
> for the same task
> ----------------------------------------------------------------------------------------------
>
> Key: TEZ-1470
> URL: https://issues.apache.org/jira/browse/TEZ-1470
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Jeff Zhang
> Assignee: Jeff Zhang
> Priority: Minor
> Fix For: 0.5.2
>
> Attachments: Tez-1470-2.patch, Tez-1470.patch
>
>
> TaskAttempt can move from SUCCEEDED to KILLED due to node failure. In this
> case TaskAttemptFinishedEvent may been recorded 2 times,and will cause
> failure in recovery.
> {code}
> 14-05-16 08:07:18,386 INFO [main] org.apache.hadoop.service.AbstractService:
> Service org.apache.tez.dag.app.DAGAppMaster failed in state STARTED; cause:
> org.apache.tez.dag.api.TezUncheckedException: Invalid recovery event for
> attempt finished, more completions than starts encountered,
> taskId=task_1400226928057_0001_1_05_000005, finishedAttempts=2,
> incompleteAttempts=-1
> org.apache.tez.dag.api.TezUncheckedException: Invalid recovery event for
> attempt finished, more completions than starts encountered,
> taskId=task_1400226928057_0001_1_05_000005, finishedAttempts=2,
> incompleteAttempts=-1
> at
> org.apache.tez.dag.app.dag.impl.TaskImpl.restoreFromEvent(TaskImpl.java:592)
> at
> org.apache.tez.dag.app.RecoveryParser.parseRecoveryData(RecoveryParser.java:814)
> at
> org.apache.tez.dag.app.DAGAppMaster.recoverDAG(DAGAppMaster.java:1529)
> at
> org.apache.tez.dag.app.DAGAppMaster.serviceStart(DAGAppMaster.java:1558)
> at
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at org.apache.tez.dag.app.DAGAppMaster$5.run(DAGAppMaster.java:1957)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
> at
> org.apache.tez.dag.app.DAGAppMaster.initAndStartAppMaster(DAGAppMaster.java:1953)
> at org.apache.tez.dag.app.DAGAppMaster.main(DAGAppMaster.java:1792)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)