[ 
https://issues.apache.org/jira/browse/TEZ-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504220#comment-14504220
 ] 

Jeff Zhang commented on TEZ-2303:
---------------------------------


bq. should we disable the dag client service until all the recovery data is 
read completely and the recover event is sent to the dag?
I think this could be resolve by the DAGAppMaster state machine TEZ-1273. (Also 
should disallow submitDAG when recovering)

bq. next, even after we send the recover event to dag, the recovery process is 
asynchronous so a client can query dag status so do we need to build in any 
additional checks to guard against getStatus/getProgress while the recovered 
data is being re-built?
The issue of calling getStatus/getProgress when recovering is that the client 
side may get the wrong status. It's not easy to decide whether the DAG has 
completed its recovery (event DAG complete its RecoveryTransition, maybe its 
vertices/task/taskattempts are still in Recovering), although we can iterate 
through all of them to check whether it complete recovering for each 
getDAGStatus call, but may be not a good idea for large jobs. I think TEZ-1657 
can solve this issue ( TEZ-1657 try to recover the DAG in one step rather than 
separate it into 2 steps: restoreFromEvent & RecoveryTransition. That means 
combined with TEZ-1273, when DAGAppMaster complete its recovering DAG also 
complete its recovering ). So I think for this ticket we could just first add 
writelock in restoreFromEvent (at least it won't make AM shutdown although 
client may get wrong dag status). For the long term, TEZ-1273 & TEZ-1657 can 
resolve this issue. [~hitesh] Any thoughts ?

> ConcurrentModificationException while processing recovery
> ---------------------------------------------------------
>
>                 Key: TEZ-2303
>                 URL: https://issues.apache.org/jira/browse/TEZ-2303
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Jason Lowe
>            Assignee: Jeff Zhang
>
> Saw a Tez AM log a few ConcurrentModificationException messages while trying 
> to recover from a previous attempt that crashed.  Exception details to follow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to