[ 
https://issues.apache.org/jira/browse/TEZ-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14167974#comment-14167974
 ] 

Jeff Zhang commented on TEZ-1470:
---------------------------------

[~hitesh] Attach new patch

bq. can taskAttemptStatus map be Map<Integer, Boolean>
Change it

bq. also have you checked thread safe access to it in all cases? Maybe change 
to concurrent hash map?
Yes, it is only changed in state machine thread (AsynDispatcher) except 
restoreFromEvent which is called before state machine start.

bq. should getUncompletedAttemptsCount and getFinishedAttemptsCount have 
appropriate locking as they traverse the map?
Add readLock 

> Recovery fail due to TaskAttemptFinishedEvent is recorded multiple times for 
> the same task
> ------------------------------------------------------------------------------------------
>
>                 Key: TEZ-1470
>                 URL: https://issues.apache.org/jira/browse/TEZ-1470
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>            Priority: Minor
>         Attachments: Tez-1470-2.patch, Tez-1470.patch
>
>
> TaskAttempt can move from SUCCEEDED to KILLED due to node failure. In this 
> case TaskAttemptFinishedEvent may been recorded 2 times,and will cause 
> failure in recovery.
> {code}
> 14-05-16 08:07:18,386 INFO [main] org.apache.hadoop.service.AbstractService: 
> Service org.apache.tez.dag.app.DAGAppMaster failed in state STARTED; cause: 
> org.apache.tez.dag.api.TezUncheckedException: Invalid recovery event for 
> attempt finished, more completions than starts encountered, 
> taskId=task_1400226928057_0001_1_05_000005, finishedAttempts=2, 
> incompleteAttempts=-1
> org.apache.tez.dag.api.TezUncheckedException: Invalid recovery event for 
> attempt finished, more completions than starts encountered, 
> taskId=task_1400226928057_0001_1_05_000005, finishedAttempts=2, 
> incompleteAttempts=-1
>       at 
> org.apache.tez.dag.app.dag.impl.TaskImpl.restoreFromEvent(TaskImpl.java:592)
>       at 
> org.apache.tez.dag.app.RecoveryParser.parseRecoveryData(RecoveryParser.java:814)
>       at 
> org.apache.tez.dag.app.DAGAppMaster.recoverDAG(DAGAppMaster.java:1529)
>       at 
> org.apache.tez.dag.app.DAGAppMaster.serviceStart(DAGAppMaster.java:1558)
>       at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>       at org.apache.tez.dag.app.DAGAppMaster$5.run(DAGAppMaster.java:1957)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:415)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
>       at 
> org.apache.tez.dag.app.DAGAppMaster.initAndStartAppMaster(DAGAppMaster.java:1953)
>       at org.apache.tez.dag.app.DAGAppMaster.main(DAGAppMaster.java:1792)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to