[
https://issues.apache.org/jira/browse/TEZ-4063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yingda Chen reassigned TEZ-4063:
--------------------------------
Assignee: Ying Han
> DAGClient:tryKillDAG taking long time
> -------------------------------------
>
> Key: TEZ-4063
> URL: https://issues.apache.org/jira/browse/TEZ-4063
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Ganesha Shreedhara
> Assignee: Ying Han
> Priority: Major
>
> Hive uses DAGClient:tryKillDAG() to kill tez application. It is taking time
> to kill when there are too many tasks getting processed. This is because the
> kill event is getting added to eventQueue and it takes time when the
> eventQueue has too many events before the kill the event.
> I have a job which has ~3L mappers, ~5K reducers and ~1000 parallel tasks
> running.
> When hive query is killed in the middle of this job getting processed, it
> takes ~6mins for the tasks to start getting killed. It is taking ~3mins for
> the kill event from AM to reach the DAG and ~3mins again for the kill event
> from DAG to reach the vertex.
>
> *Below is the log for the same:*
> {code:java}
> 2019-04-10 15:11:35,776 [INFO] [IPC Server handler 0 on 44129]
> |app.DAGAppMaster|: Sending a kill event to the current DAG,
> dagId=dag_1554789825317_0535_1
> 2019-04-10 15:11:35,785 [INFO] [IPC Server handler 0 on 44129]
> |history.HistoryEventHandler|:
> [HISTORY][DAG:dag_1554789825317_0535_1][Event:DAG_KILL_REQUEST]:
> org.apache.tez.dag.history.events.DAGKillRequestEvent@731f79f4
> .
> .
> ~ 3 mins of delay
> .
> .
> 2019-04-10 15:14:34,171 [INFO] [Dispatcher thread \{Central}]
> |impl.DAGImpl|: Dag received [DAG_TERMINATE, DAG_KILL] in RUNNING state
> .
> .
> ~ 3 mins of delay
> .
> .
> 2019-04-10 15:17:52,434 [INFO] [Dispatcher thread \{Central}]
> |impl.VertexImpl|: Killing tasks in vertex: vertex_1554789825317_0535_1_01
> [Reducer 2] due to trigger: DAG_TERMINATED
> 2019-04-10 15:17:52,439 [INFO] [Dispatcher thread \{Central}]
> |impl.VertexImpl|: Killing tasks in vertex: vertex_1554789825317_0535_1_00
> [Map 1] due to trigger: DAG_TERMINATED{code}
>
> Pig uses TezClient:stop() method which kills application in asynchronous
> manner. It also uses tez.client.timeout-ms configuration which can be
> configured to kill the yarn application if the client timeout exceeds a
> threshold value.
>
> Is this an expected behaviour to add kill event to eventQueue and process it
> synchronously when DAGClient:tryKillDAG() is called?
> Can we process the kill event immediately (may be when a configuration is
> enabled) if the user doesn't want the past events to be processed?
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)