[
https://issues.apache.org/jira/browse/TEZ-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15123898#comment-15123898
]
Siddharth Seth commented on TEZ-2307:
-------------------------------------
bq. I thought about that. but it would make user confused that the last dag is
completed but he still can not submit another dag due to AM is still in RUNNING.
I though this is what this jira is fixing ? Run the new DAG after the previous
one is complete, taking into account errors from the new dag and cleanup of the
old dag.
bq. For now it seems dag clean up won't take too much, have you thought to put
it in DAGImpl.finish ?
Cleanup sends messages to user plugins. Calling it within finished would mean a
dag status look up from the plugins would get the state as RUNNING, instead of
the actual final state. DAG_CLEANUP was added as a new state in the
DAGAppMaster state machine to allow for any events which are pending in the
queue after "DAGAppMasterEventDAGFinished" to get processed. If you think
there's no other events there - the DAG_CLEANUP state can be collapsed into
DAG_FINISHED - in which case DAGAppMasterState.IDLE will be reached after
cleanup. Otherwise, I think it's better to move the transition to the IDLE
state into DAG_CLEANUP handling. In either case - notify after the state is
IDLE - so that the new submission can proceed after the old dag is cleaned up.
> Possible wrong error message when submitting new dag
> ----------------------------------------------------
>
> Key: TEZ-2307
> URL: https://issues.apache.org/jira/browse/TEZ-2307
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Jeff Zhang
> Assignee: Jeff Zhang
> Attachments: TEZ-2307-1.patch, TEZ-2307-2.patch, TEZ-2307-3.patch,
> TEZ-2307-4.patch
>
>
> In the following 2 cases, AM would propagate wrong error message to client
> ("App master already running a DAG")
> * The last dag is completed but AM is still in RUNNING state
> * AM is in shutting down.
> {code}
> 2015-04-10 06:01:50,369 INFO [IPC Server handler 0 on 46821] ipc.Server
> (Server.java:run(2070)) - IPC Server handler 0 on 46821, call
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.submitDAG
> from 10.0.0.223:48581 Call#411 Retry#0
> org.apache.tez.dag.api.TezException: App master already running a DAG
> at
> org.apache.tez.dag.app.DAGAppMaster.submitDAGToAppMaster(DAGAppMaster.java:1131)
> at
> org.apache.tez.dag.api.client.DAGClientHandler.submitDAG(DAGClientHandler.java:118)
> at
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.submitDAG(DAGClientAMProtocolBlockingPBServerImpl.java:163)
> at
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7471)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)