[jira] [Commented] (TEZ-2300) TezClient.stop() takes a lot of time or does not work sometimes

Hitesh Shah (JIRA) Thu, 13 Aug 2015 13:35:30 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14695892#comment-14695892
 ]


Hitesh Shah commented on TEZ-2300:
----------------------------------

There are a bunch of behavioural changes that will kick in as a result of this 
change especially in the scenario where a hard kill gets triggered. 

Couple of questions: 
   - Is the main concern that the AM needs to be killed or that the DAG needs 
to be killed? Can we clarify the requirement? Comments seem to suggest that the 
DAG being killed completely is the bigger concern. 
   - ATS flushes are slow. Agreed. We need to address that ( which will be done 
by either using HDFS as a logger or a faster ATSv2 ) but killing the AM will 
just mean incomplete data in history.
   - Is there a problem where the AM holds onto its containers for a long time 
while the shutting down process is going on a concern? Maybe we can introduce a 
short-circuit to release all containers faster when a shutdown is invoked?  
   - as for a async wait or sync wait, given that this is a behavioural change, 
we could add a new explicit wait for a sync stop. However, to avoid code 
change, even though I dont like this approach, we could consider a config knob 
( the horror of one more config knob to add to the 100s we already have ) to 
drive what is the default stop impl to apply when the current stop api is 
invoked?




 

> TezClient.stop() takes a lot of time or does not work sometimes
> ---------------------------------------------------------------
>
>                 Key: TEZ-2300
>                 URL: https://issues.apache.org/jira/browse/TEZ-2300
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>            Assignee: Jonathan Eagles
>         Attachments: TEZ-2300.1.patch, TEZ-2300.2.patch, TEZ-2300.3.patch, 
> TEZ-2300.4.patch, syslog_dag_1428329756093_325099_1_post 
>
>
>   Noticed this with a couple of pig scripts which were not behaving well (AM 
> close to OOM, etc) and even with some that were running fine. Pig calls 
> Tezclient.stop() in shutdown hook. Ctrl+C to the pig script either exits 
> immediately or is hung. In both cases it either takes a long time for the 
> yarn application to go to KILLED state. Many times I just end up calling yarn 
> application -kill separately after waiting for 5 mins or more for it to get 
> killed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-2300) TezClient.stop() takes a lot of time or does not work sometimes

Reply via email to