[ 
https://issues.apache.org/jira/browse/TEZ-4063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yingda Chen reassigned TEZ-4063:
--------------------------------

    Assignee: Ying Han

> DAGClient:tryKillDAG taking long time
> -------------------------------------
>
>                 Key: TEZ-4063
>                 URL: https://issues.apache.org/jira/browse/TEZ-4063
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Ganesha Shreedhara
>            Assignee: Ying Han
>            Priority: Major
>
> Hive uses DAGClient:tryKillDAG() to kill tez application. It is taking time 
> to kill when there are too many tasks getting processed. This is because the 
> kill event is getting added to eventQueue and it takes time when the 
> eventQueue has too many events before the kill the event.
> I have a job which has ~3L mappers, ~5K reducers and ~1000 parallel tasks 
> running.
> When hive query is killed in the middle of this job getting processed, it 
> takes ~6mins for the tasks to start getting killed. It is taking ~3mins for 
> the kill event from AM to reach the DAG and ~3mins again for the kill event 
> from DAG to reach the vertex.
>  
> *Below is the log for the same:* 
> {code:java}
> 2019-04-10 15:11:35,776 [INFO] [IPC Server handler 0 on 44129] 
> |app.DAGAppMaster|: Sending a kill event to the current DAG, 
> dagId=dag_1554789825317_0535_1
>  2019-04-10 15:11:35,785 [INFO] [IPC Server handler 0 on 44129] 
> |history.HistoryEventHandler|: 
> [HISTORY][DAG:dag_1554789825317_0535_1][Event:DAG_KILL_REQUEST]: 
> org.apache.tez.dag.history.events.DAGKillRequestEvent@731f79f4
>  .
>  .
>  ~ 3 mins of delay
>  .
>  .
>  2019-04-10 15:14:34,171 [INFO] [Dispatcher thread \{Central}] 
> |impl.DAGImpl|: Dag received [DAG_TERMINATE, DAG_KILL] in RUNNING state
>  .
>  .
>  ~ 3 mins of delay
>  .
>  .
>  2019-04-10 15:17:52,434 [INFO] [Dispatcher thread \{Central}] 
> |impl.VertexImpl|: Killing tasks in vertex: vertex_1554789825317_0535_1_01 
> [Reducer 2] due to trigger: DAG_TERMINATED
>  2019-04-10 15:17:52,439 [INFO] [Dispatcher thread \{Central}] 
> |impl.VertexImpl|: Killing tasks in vertex: vertex_1554789825317_0535_1_00 
> [Map 1] due to trigger: DAG_TERMINATED{code}
>  
> Pig uses TezClient:stop() method which kills application in asynchronous 
> manner. It also uses tez.client.timeout-ms configuration which can be 
> configured to kill the yarn application if the client timeout exceeds a 
> threshold value. 
>  
> Is this an expected behaviour to add kill event to eventQueue and process it 
> synchronously when DAGClient:tryKillDAG() is called? 
> Can we process the kill event immediately (may be when a configuration is 
> enabled) if the user doesn't want the past events to be processed? 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to