[
https://issues.apache.org/jira/browse/TEZ-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504220#comment-14504220
]
Jeff Zhang commented on TEZ-2303:
---------------------------------
bq. should we disable the dag client service until all the recovery data is
read completely and the recover event is sent to the dag?
I think this could be resolve by the DAGAppMaster state machine TEZ-1273. (Also
should disallow submitDAG when recovering)
bq. next, even after we send the recover event to dag, the recovery process is
asynchronous so a client can query dag status so do we need to build in any
additional checks to guard against getStatus/getProgress while the recovered
data is being re-built?
The issue of calling getStatus/getProgress when recovering is that the client
side may get the wrong status. It's not easy to decide whether the DAG has
completed its recovery (event DAG complete its RecoveryTransition, maybe its
vertices/task/taskattempts are still in Recovering), although we can iterate
through all of them to check whether it complete recovering for each
getDAGStatus call, but may be not a good idea for large jobs. I think TEZ-1657
can solve this issue ( TEZ-1657 try to recover the DAG in one step rather than
separate it into 2 steps: restoreFromEvent & RecoveryTransition. That means
combined with TEZ-1273, when DAGAppMaster complete its recovering DAG also
complete its recovering ). So I think for this ticket we could just first add
writelock in restoreFromEvent (at least it won't make AM shutdown although
client may get wrong dag status). For the long term, TEZ-1273 & TEZ-1657 can
resolve this issue. [~hitesh] Any thoughts ?
> ConcurrentModificationException while processing recovery
> ---------------------------------------------------------
>
> Key: TEZ-2303
> URL: https://issues.apache.org/jira/browse/TEZ-2303
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.6.0
> Reporter: Jason Lowe
> Assignee: Jeff Zhang
>
> Saw a Tez AM log a few ConcurrentModificationException messages while trying
> to recover from a previous attempt that crashed. Exception details to follow.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)