[
https://issues.apache.org/jira/browse/TEZ-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14516200#comment-14516200
]
Jeff Zhang commented on TEZ-2303:
---------------------------------
[~hitesh] Yes I think it make sense for the short term fix as least it fix the
ConcurrentModificationException.
Regarding the issue of not providing info to clients until the recovery phase
is over, I think there are 2 main scenario:
* ClientHandler RPC is started but recovery log is not read. In this case, it
will throw "No dag running" exception in AM, no effect on the client side. so
I think it is OK.
{code}
2015-04-28 09:32:02,054 INFO [IPC Server handler 0 on 6000] ipc.Server: IPC
Server handler 0 on 6000, call
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus
from 127.0.0.1:63539 Call#9557 Retry#0
org.apache.tez.dag.api.TezException: No running dag at present
at
org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:89)
at
org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:156)
at
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:95)
at
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7465)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
{code}
* The second scenario is that even the recovery log is read, the
RecoveryTransition may not have completed. Then the client side may still get
wrong dag status. As I mentioned, this may need some big change on the
recovery. We can leave it in future and take it into account when refactoring
the recovery code.
> ConcurrentModificationException while processing recovery
> ---------------------------------------------------------
>
> Key: TEZ-2303
> URL: https://issues.apache.org/jira/browse/TEZ-2303
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.6.0
> Reporter: Jason Lowe
> Assignee: Jeff Zhang
> Attachments: TEZ-2303-1.patch, TEZ-2303-2.patch, TEZ-2303-4.patch
>
>
> Saw a Tez AM log a few ConcurrentModificationException messages while trying
> to recover from a previous attempt that crashed. Exception details to follow.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)