[ 
https://issues.apache.org/jira/browse/TEZ-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122063#comment-15122063
 ] 

Siddharth Seth commented on TEZ-2307:
-------------------------------------

bq. I think make the submit RPC call wait might not be a good option because it 
is confused that user can not submit new dag even after previous dag is 
completed. So I suggest that user can still submit new dag, but keep the dag in 
NEW state until the cleanup of previous dag is done.
This is an option. Couple of things which will need to be considered though. 
The user will consider submitDag as successful. What happens if there's an 
error during the cleanup of the previous DAG ? That would have to be sent back 
as part of dag status monitoring. This can get fairly confusing for users - DAG 
accepted, but then notified about failure due to a cleanup error from the 
previous DAG.

On the patch itself.
Instead of using a field - dagCleanupDone, I think it'll be better to move the 
DAGAppMaster into IDLE state only after the cleanup is done. My bad here, I 
should have fixed this in the patch which added the cleanup state. submitDag 
can wait on the DAG entering IDLE state instead of waiting on dagCleanup. A 
notification can be sent out once the DAG enters cleanup state. This also gets 
rid of the call from DAGImpl to set the dagCleanupedFlag to false.
- In the current patch, calling setDagCleanupDone races with handling of the 
DAGCleanupEvent if concurrent dispatchers are used. It'd be better to avoid 
this for when we support concurrent dispatchers as the default.
- A boolean field (maybe volatile) is sufficient instead of an AtomicBoolean 
since we're synchronizing on it.

> Possible wrong error message when submitting new dag
> ----------------------------------------------------
>
>                 Key: TEZ-2307
>                 URL: https://issues.apache.org/jira/browse/TEZ-2307
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>         Attachments: TEZ-2307-1.patch, TEZ-2307-2.patch, TEZ-2307-3.patch, 
> TEZ-2307-4.patch
>
>
> In the following 2 cases, AM would propagate wrong error message to client 
> ("App master already running a DAG")
> * The last dag is completed but AM is still in RUNNING state
> * AM is in shutting down. 
> {code}
> 2015-04-10 06:01:50,369 INFO  [IPC Server handler 0 on 46821] ipc.Server 
> (Server.java:run(2070)) - IPC Server handler 0 on 46821, call 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.submitDAG 
> from 10.0.0.223:48581 Call#411 Retry#0
> org.apache.tez.dag.api.TezException: App master already running a DAG
>       at 
> org.apache.tez.dag.app.DAGAppMaster.submitDAGToAppMaster(DAGAppMaster.java:1131)
>       at 
> org.apache.tez.dag.api.client.DAGClientHandler.submitDAG(DAGClientHandler.java:118)
>       at 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.submitDAG(DAGClientAMProtocolBlockingPBServerImpl.java:163)
>       at 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7471)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:415)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to