[
https://issues.apache.org/jira/browse/TEZ-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14167974#comment-14167974
]
Jeff Zhang commented on TEZ-1470:
---------------------------------
[~hitesh] Attach new patch
bq. can taskAttemptStatus map be Map<Integer, Boolean>
Change it
bq. also have you checked thread safe access to it in all cases? Maybe change
to concurrent hash map?
Yes, it is only changed in state machine thread (AsynDispatcher) except
restoreFromEvent which is called before state machine start.
bq. should getUncompletedAttemptsCount and getFinishedAttemptsCount have
appropriate locking as they traverse the map?
Add readLock
> Recovery fail due to TaskAttemptFinishedEvent is recorded multiple times for
> the same task
> ------------------------------------------------------------------------------------------
>
> Key: TEZ-1470
> URL: https://issues.apache.org/jira/browse/TEZ-1470
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Jeff Zhang
> Assignee: Jeff Zhang
> Priority: Minor
> Attachments: Tez-1470-2.patch, Tez-1470.patch
>
>
> TaskAttempt can move from SUCCEEDED to KILLED due to node failure. In this
> case TaskAttemptFinishedEvent may been recorded 2 times,and will cause
> failure in recovery.
> {code}
> 14-05-16 08:07:18,386 INFO [main] org.apache.hadoop.service.AbstractService:
> Service org.apache.tez.dag.app.DAGAppMaster failed in state STARTED; cause:
> org.apache.tez.dag.api.TezUncheckedException: Invalid recovery event for
> attempt finished, more completions than starts encountered,
> taskId=task_1400226928057_0001_1_05_000005, finishedAttempts=2,
> incompleteAttempts=-1
> org.apache.tez.dag.api.TezUncheckedException: Invalid recovery event for
> attempt finished, more completions than starts encountered,
> taskId=task_1400226928057_0001_1_05_000005, finishedAttempts=2,
> incompleteAttempts=-1
> at
> org.apache.tez.dag.app.dag.impl.TaskImpl.restoreFromEvent(TaskImpl.java:592)
> at
> org.apache.tez.dag.app.RecoveryParser.parseRecoveryData(RecoveryParser.java:814)
> at
> org.apache.tez.dag.app.DAGAppMaster.recoverDAG(DAGAppMaster.java:1529)
> at
> org.apache.tez.dag.app.DAGAppMaster.serviceStart(DAGAppMaster.java:1558)
> at
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at org.apache.tez.dag.app.DAGAppMaster$5.run(DAGAppMaster.java:1957)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
> at
> org.apache.tez.dag.app.DAGAppMaster.initAndStartAppMaster(DAGAppMaster.java:1953)
> at org.apache.tez.dag.app.DAGAppMaster.main(DAGAppMaster.java:1792)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)