[jira] [Commented] (TEZ-2300) TezClient.stop() takes a lot of time or does not work sometimes

Bikas Saha (JIRA) Tue, 25 Aug 2015 15:07:06 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14712064#comment-14712064
 ]


Bikas Saha commented on TEZ-2300:
---------------------------------

I apologize. Looks like I misread the code. My bad. If we call kill dag and 
also stop the AM then there will likely be many conflicts as the events to kill 
are processed while services are being stopped at the same time. So the v4 
patch is good from that point of view. 
We should still keep this code refactoring though.
{code}         LOG.info("Sending a kill event to the current DAG"
             + ", dagId=" + currentDAG.getID());
-        try {
-          logDAGKillRequestEvent(currentDAG.getID(), true);
-        } catch (IOException e) {
-          throw new TezException(e);
-        }
-        sendEvent(new DAGEvent(currentDAG.getID(), DAGEventType.DAG_KILL));
+        tryKillDAG(currentDAG);{code}

This would still be an incompatible change though and I am not sure how 
Hive/others would be affected by this. We could add a config that determines 
this behavior with a default matching the current behavior.

However, the way the shutdown would happen is that the DAG will first be killed 
(if there are many events in the queue - unlikely after TEZ-776 for large jobs 
but possible for other reasons) then it will take some time to drain them. 
Next, once the DAG is killed, shutdown will be initiated. This will first sleep 
for some time (5s) to allow clients to get final status. Then it will stop the 
AM - which releases resources and drains ATS.
However, currently the initiateStop() logic in the scheduler does not handle AM 
side interactions. So it will continue to service AM allocation/deallocation 
requests and potentially pass them on to the RM. That needs to get handled as a 
follow up of TEZ-2687. /cc [~zjffdu]. Only after that can we safely add this 
code from the v5 patch (with the ordering reversed)
{code}
   public void shutdownTezAM() throws TezException {
     sessionStopped.set(true);
     synchronized (this) {
+      this.taskSchedulerEventHandler.initiateStop();
       this.taskSchedulerEventHandler.setShouldUnregisterFlag(); <<<< ideally 
call this before initiateStop(). {code}

Given that this jira has been hanging around for a while, to make progress we 
could commit v5 patch (minus the initiateStop and stop calls) and add the 
config for compatibility and call it a day. Also, given that the AM will sleep 
5s before actually shutting down, the hard stop in the client of 10s maybe too 
short?

> TezClient.stop() takes a lot of time or does not work sometimes
> ---------------------------------------------------------------
>
>                 Key: TEZ-2300
>                 URL: https://issues.apache.org/jira/browse/TEZ-2300
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>            Assignee: Jonathan Eagles
>         Attachments: TEZ-2300.1.patch, TEZ-2300.2.patch, TEZ-2300.3.patch, 
> TEZ-2300.4.patch, TEZ-2300.5.patch, syslog_dag_1428329756093_325099_1_post 
>
>
>   Noticed this with a couple of pig scripts which were not behaving well (AM 
> close to OOM, etc) and even with some that were running fine. Pig calls 
> Tezclient.stop() in shutdown hook. Ctrl+C to the pig script either exits 
> immediately or is hung. In both cases it either takes a long time for the 
> yarn application to go to KILLED state. Many times I just end up calling yarn 
> application -kill separately after waiting for 5 mins or more for it to get 
> killed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-2300) TezClient.stop() takes a lot of time or does not work sometimes

Reply via email to