[ 
https://issues.apache.org/jira/browse/TEZ-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14166373#comment-14166373
 ] 

Hitesh Shah commented on TEZ-1470:
----------------------------------

Looking at the patch in general, had a question:
   - do we need to change all the code in TaskImpl to use the taskAttemptStatus 
map and not the original int counters?
   - could we instead the limit the change to use the taskAttemptStatus map 
only in the recovery part where when events are restored, we populate the map 
and eventually in the recover transition, use the map to setup the counters and 
act as needed? 



> Recovery fail due to TaskAttemptFinishedEvent is recorded multiple times for 
> the same task
> ------------------------------------------------------------------------------------------
>
>                 Key: TEZ-1470
>                 URL: https://issues.apache.org/jira/browse/TEZ-1470
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>            Priority: Minor
>         Attachments: Tez-1470.patch
>
>
> TaskAttempt can move from SUCCEEDED to KILLED due to node failure. In this 
> case TaskAttemptFinishedEvent may been recorded 2 times,and will cause 
> failure in recovery.
> {code}
> 14-05-16 08:07:18,386 INFO [main] org.apache.hadoop.service.AbstractService: 
> Service org.apache.tez.dag.app.DAGAppMaster failed in state STARTED; cause: 
> org.apache.tez.dag.api.TezUncheckedException: Invalid recovery event for 
> attempt finished, more completions than starts encountered, 
> taskId=task_1400226928057_0001_1_05_000005, finishedAttempts=2, 
> incompleteAttempts=-1
> org.apache.tez.dag.api.TezUncheckedException: Invalid recovery event for 
> attempt finished, more completions than starts encountered, 
> taskId=task_1400226928057_0001_1_05_000005, finishedAttempts=2, 
> incompleteAttempts=-1
>       at 
> org.apache.tez.dag.app.dag.impl.TaskImpl.restoreFromEvent(TaskImpl.java:592)
>       at 
> org.apache.tez.dag.app.RecoveryParser.parseRecoveryData(RecoveryParser.java:814)
>       at 
> org.apache.tez.dag.app.DAGAppMaster.recoverDAG(DAGAppMaster.java:1529)
>       at 
> org.apache.tez.dag.app.DAGAppMaster.serviceStart(DAGAppMaster.java:1558)
>       at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>       at org.apache.tez.dag.app.DAGAppMaster$5.run(DAGAppMaster.java:1957)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:415)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
>       at 
> org.apache.tez.dag.app.DAGAppMaster.initAndStartAppMaster(DAGAppMaster.java:1953)
>       at org.apache.tez.dag.app.DAGAppMaster.main(DAGAppMaster.java:1792)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to