[ 
https://issues.apache.org/jira/browse/TEZ-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14498230#comment-14498230
 ] 

Rohini Palaniswamy commented on TEZ-2300:
-----------------------------------------

There are couple of issues with the behavior after talking to [~jlowe] and 
comparing what is done in MR
    - Kill is put in the event queue and is processed like any other event. 
When there are millions of event in the queue it takes a long time to get to 
that and I see the AM even scheduling new tasks. MR also does it this way. 
Problem is with too many events and TEZ-776 should reduce that. But still with 
large jobs there are going to be many events in the queue.
   - TezClient.stop() returns immediately after the kill. It should not and it 
should poll and wait on the client side. MR does that.
   - If the DAG is not killed and session not shutdown even after a certain 
timeout, yarn kill should be called. MR does that.

This is an important issue as people might kill a script and think the 
application is killed and proceed with running a new one which could cause lot 
of issues while the old one is still running.  So the kill needs to be 
synchronous and reliable.

> TezClient.stop() takes a lot of time or does not work sometimes
> ---------------------------------------------------------------
>
>                 Key: TEZ-2300
>                 URL: https://issues.apache.org/jira/browse/TEZ-2300
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>         Attachments: syslog_dag_1428329756093_325099_1_post 
>
>
>   Noticed this with a couple of pig scripts which were not behaving well (AM 
> close to OOM, etc) and even with some that were running fine. Pig calls 
> Tezclient.stop() in shutdown hook. Ctrl+C to the pig script either exits 
> immediately or is hung. In both cases it either takes a long time for the 
> yarn application to go to KILLED state. Many times I just end up calling yarn 
> application -kill separately after waiting for 5 mins or more for it to get 
> killed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to