[ https://issues.apache.org/jira/browse/TEZ-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14516200#comment-14516200 ]
Jeff Zhang commented on TEZ-2303: --------------------------------- [~hitesh] Yes I think it make sense for the short term fix as least it fix the ConcurrentModificationException. Regarding the issue of not providing info to clients until the recovery phase is over, I think there are 2 main scenario: * ClientHandler RPC is started but recovery log is not read. In this case, it will throw "No dag running" exception in AM, no effect on the client side. so I think it is OK. {code} 2015-04-28 09:32:02,054 INFO [IPC Server handler 0 on 6000] ipc.Server: IPC Server handler 0 on 6000, call org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus from 127.0.0.1:63539 Call#9557 Retry#0 org.apache.tez.dag.api.TezException: No running dag at present at org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:89) at org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:156) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:95) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7465) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) {code} * The second scenario is that even the recovery log is read, the RecoveryTransition may not have completed. Then the client side may still get wrong dag status. As I mentioned, this may need some big change on the recovery. We can leave it in future and take it into account when refactoring the recovery code. > ConcurrentModificationException while processing recovery > --------------------------------------------------------- > > Key: TEZ-2303 > URL: https://issues.apache.org/jira/browse/TEZ-2303 > Project: Apache Tez > Issue Type: Bug > Affects Versions: 0.6.0 > Reporter: Jason Lowe > Assignee: Jeff Zhang > Attachments: TEZ-2303-1.patch, TEZ-2303-2.patch, TEZ-2303-4.patch > > > Saw a Tez AM log a few ConcurrentModificationException messages while trying > to recover from a previous attempt that crashed. Exception details to follow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)