[
https://issues.apache.org/jira/browse/TEZ-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090008#comment-15090008
]
Siddharth Seth commented on TEZ-2307:
-------------------------------------
[~zjffdu] - not sure if this patch is ready for review yet or not. I was
looking at the code - and I think there's another problem around the way the
state transitions happen.
If a dag is accepted before the AM transitions all it's states - there's a
possibility that the DAG_FINISHED event and the subsequent DAG_CLEANUP have not
been processed. If DAG_CLEANUP is processed after a new DAG is submitted, we
may see additional errors with that DAG - since cleanup notifies components
about the previous dag finishing, and also empties the ID caches. This could
result in all kinds of strange errors with the newly submitted DAG. There's a
small chance that synchronization is taking care of this - but I have my
doubts, since 'submitDAG' holds the lock on the AppMaster - so just allowing a
new DAG to be submitted may guarantee out of order execution of the previous
DAGs cleanup - instead of throwing the exception that it throws today.
I think we need to make the new DAG submission wait till the previous DAG has
been cleaned up.
> Possible wrong error message when submitting new dag
> ----------------------------------------------------
>
> Key: TEZ-2307
> URL: https://issues.apache.org/jira/browse/TEZ-2307
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Jeff Zhang
> Assignee: Jeff Zhang
> Attachments: TEZ-2307-1.patch
>
>
> In the following 2 cases, AM would propagate wrong error message to client
> ("App master already running a DAG")
> * The last dag is completed but AM is still in RUNNING state
> * AM is in shutting down.
> {code}
> 2015-04-10 06:01:50,369 INFO [IPC Server handler 0 on 46821] ipc.Server
> (Server.java:run(2070)) - IPC Server handler 0 on 46821, call
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.submitDAG
> from 10.0.0.223:48581 Call#411 Retry#0
> org.apache.tez.dag.api.TezException: App master already running a DAG
> at
> org.apache.tez.dag.app.DAGAppMaster.submitDAGToAppMaster(DAGAppMaster.java:1131)
> at
> org.apache.tez.dag.api.client.DAGClientHandler.submitDAG(DAGClientHandler.java:118)
> at
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.submitDAG(DAGClientAMProtocolBlockingPBServerImpl.java:163)
> at
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7471)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)