[ 
https://issues.apache.org/jira/browse/TEZ-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14516200#comment-14516200
 ] 

Jeff Zhang edited comment on TEZ-2303 at 4/28/15 2:22 AM:
----------------------------------------------------------

[~hitesh] Yes I think it make sense for the short term fix as least it fix the 
ConcurrentModificationException, the recovery process can keep going. 

Regarding the issue of not providing info to clients until the recovery phase 
is over, I think there are 2 main scenario:

* ClientHandler RPC is started but recovery log is not read. In this case, it 
will throw "No dag running" exception in AM, no effect on the client side.  so 
I think it is OK.
{code}
2015-04-28 09:32:02,054 INFO [IPC Server handler 0 on 6000] ipc.Server: IPC 
Server handler 0 on 6000, call 
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus 
from 127.0.0.1:63539 Call#9557 Retry#0
org.apache.tez.dag.api.TezException: No running dag at present
    at 
org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:89)
    at 
org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:156)
    at 
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:95)
    at 
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7465)
    at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
{code}

* The second scenario is that even the recovery log is read, the 
RecoveryTransition may not have completed. Then the client side may still get 
wrong dag status.  As I mentioned, this may need some big change on the 
recovery. We can leave it in future and take it into account when refactoring 
the recovery code. 



was (Author: zjffdu):
[~hitesh] Yes I think it make sense for the short term fix as least it fix the 
ConcurrentModificationException. 

Regarding the issue of not providing info to clients until the recovery phase 
is over, I think there are 2 main scenario:

* ClientHandler RPC is started but recovery log is not read. In this case, it 
will throw "No dag running" exception in AM, no effect on the client side.  so 
I think it is OK.
{code}
2015-04-28 09:32:02,054 INFO [IPC Server handler 0 on 6000] ipc.Server: IPC 
Server handler 0 on 6000, call 
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus 
from 127.0.0.1:63539 Call#9557 Retry#0
org.apache.tez.dag.api.TezException: No running dag at present
    at 
org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:89)
    at 
org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:156)
    at 
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:95)
    at 
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7465)
    at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
{code}

* The second scenario is that even the recovery log is read, the 
RecoveryTransition may not have completed. Then the client side may still get 
wrong dag status.  As I mentioned, this may need some big change on the 
recovery. We can leave it in future and take it into account when refactoring 
the recovery code. 


> ConcurrentModificationException while processing recovery
> ---------------------------------------------------------
>
>                 Key: TEZ-2303
>                 URL: https://issues.apache.org/jira/browse/TEZ-2303
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Jason Lowe
>            Assignee: Jeff Zhang
>         Attachments: TEZ-2303-1.patch, TEZ-2303-2.patch, TEZ-2303-4.patch
>
>
> Saw a Tez AM log a few ConcurrentModificationException messages while trying 
> to recover from a previous attempt that crashed.  Exception details to follow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to