[ 
https://issues.apache.org/jira/browse/TEZ-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169012#comment-14169012
 ] 

Bikas Saha commented on TEZ-1470:
---------------------------------

+1 for creating a consolidated VertexRecoveryData and recovering from it. This 
would follow the pattern where the RecoveryService only loads this data from 
the store. Then passes this data to the application, which can then pass the 
data to the DAG, which can then pass the relevant parts to each vertex and so 
on. Each vertex then transitions to its correct state based on that data.

> Recovery fails due to TaskAttemptFinishedEvent being recorded multiple times 
> for the same task
> ----------------------------------------------------------------------------------------------
>
>                 Key: TEZ-1470
>                 URL: https://issues.apache.org/jira/browse/TEZ-1470
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>            Priority: Minor
>             Fix For: 0.5.2
>
>         Attachments: Tez-1470-2.patch, Tez-1470.patch
>
>
> TaskAttempt can move from SUCCEEDED to KILLED due to node failure. In this 
> case TaskAttemptFinishedEvent may been recorded 2 times,and will cause 
> failure in recovery.
> {code}
> 14-05-16 08:07:18,386 INFO [main] org.apache.hadoop.service.AbstractService: 
> Service org.apache.tez.dag.app.DAGAppMaster failed in state STARTED; cause: 
> org.apache.tez.dag.api.TezUncheckedException: Invalid recovery event for 
> attempt finished, more completions than starts encountered, 
> taskId=task_1400226928057_0001_1_05_000005, finishedAttempts=2, 
> incompleteAttempts=-1
> org.apache.tez.dag.api.TezUncheckedException: Invalid recovery event for 
> attempt finished, more completions than starts encountered, 
> taskId=task_1400226928057_0001_1_05_000005, finishedAttempts=2, 
> incompleteAttempts=-1
>       at 
> org.apache.tez.dag.app.dag.impl.TaskImpl.restoreFromEvent(TaskImpl.java:592)
>       at 
> org.apache.tez.dag.app.RecoveryParser.parseRecoveryData(RecoveryParser.java:814)
>       at 
> org.apache.tez.dag.app.DAGAppMaster.recoverDAG(DAGAppMaster.java:1529)
>       at 
> org.apache.tez.dag.app.DAGAppMaster.serviceStart(DAGAppMaster.java:1558)
>       at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>       at org.apache.tez.dag.app.DAGAppMaster$5.run(DAGAppMaster.java:1957)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:415)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
>       at 
> org.apache.tez.dag.app.DAGAppMaster.initAndStartAppMaster(DAGAppMaster.java:1953)
>       at org.apache.tez.dag.app.DAGAppMaster.main(DAGAppMaster.java:1792)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to