[ 
https://issues.apache.org/jira/browse/TEZ-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14516200#comment-14516200
 ] 

Jeff Zhang commented on TEZ-2303:
---------------------------------

[~hitesh] Yes I think it make sense for the short term fix as least it fix the 
ConcurrentModificationException. 

Regarding the issue of not providing info to clients until the recovery phase 
is over, I think there are 2 main scenario:

* ClientHandler RPC is started but recovery log is not read. In this case, it 
will throw "No dag running" exception in AM, no effect on the client side.  so 
I think it is OK.
{code}
2015-04-28 09:32:02,054 INFO [IPC Server handler 0 on 6000] ipc.Server: IPC 
Server handler 0 on 6000, call 
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus 
from 127.0.0.1:63539 Call#9557 Retry#0
org.apache.tez.dag.api.TezException: No running dag at present
    at 
org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:89)
    at 
org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:156)
    at 
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:95)
    at 
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7465)
    at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
{code}

* The second scenario is that even the recovery log is read, the 
RecoveryTransition may not have completed. Then the client side may still get 
wrong dag status.  As I mentioned, this may need some big change on the 
recovery. We can leave it in future and take it into account when refactoring 
the recovery code. 


> ConcurrentModificationException while processing recovery
> ---------------------------------------------------------
>
>                 Key: TEZ-2303
>                 URL: https://issues.apache.org/jira/browse/TEZ-2303
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Jason Lowe
>            Assignee: Jeff Zhang
>         Attachments: TEZ-2303-1.patch, TEZ-2303-2.patch, TEZ-2303-4.patch
>
>
> Saw a Tez AM log a few ConcurrentModificationException messages while trying 
> to recover from a previous attempt that crashed.  Exception details to follow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to