[
https://issues.apache.org/jira/browse/TEZ-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14712064#comment-14712064
]
Bikas Saha commented on TEZ-2300:
---------------------------------
I apologize. Looks like I misread the code. My bad. If we call kill dag and
also stop the AM then there will likely be many conflicts as the events to kill
are processed while services are being stopped at the same time. So the v4
patch is good from that point of view.
We should still keep this code refactoring though.
{code} LOG.info("Sending a kill event to the current DAG"
+ ", dagId=" + currentDAG.getID());
- try {
- logDAGKillRequestEvent(currentDAG.getID(), true);
- } catch (IOException e) {
- throw new TezException(e);
- }
- sendEvent(new DAGEvent(currentDAG.getID(), DAGEventType.DAG_KILL));
+ tryKillDAG(currentDAG);{code}
This would still be an incompatible change though and I am not sure how
Hive/others would be affected by this. We could add a config that determines
this behavior with a default matching the current behavior.
However, the way the shutdown would happen is that the DAG will first be killed
(if there are many events in the queue - unlikely after TEZ-776 for large jobs
but possible for other reasons) then it will take some time to drain them.
Next, once the DAG is killed, shutdown will be initiated. This will first sleep
for some time (5s) to allow clients to get final status. Then it will stop the
AM - which releases resources and drains ATS.
However, currently the initiateStop() logic in the scheduler does not handle AM
side interactions. So it will continue to service AM allocation/deallocation
requests and potentially pass them on to the RM. That needs to get handled as a
follow up of TEZ-2687. /cc [~zjffdu]. Only after that can we safely add this
code from the v5 patch (with the ordering reversed)
{code}
public void shutdownTezAM() throws TezException {
sessionStopped.set(true);
synchronized (this) {
+ this.taskSchedulerEventHandler.initiateStop();
this.taskSchedulerEventHandler.setShouldUnregisterFlag(); <<<< ideally
call this before initiateStop(). {code}
Given that this jira has been hanging around for a while, to make progress we
could commit v5 patch (minus the initiateStop and stop calls) and add the
config for compatibility and call it a day. Also, given that the AM will sleep
5s before actually shutting down, the hard stop in the client of 10s maybe too
short?
> TezClient.stop() takes a lot of time or does not work sometimes
> ---------------------------------------------------------------
>
> Key: TEZ-2300
> URL: https://issues.apache.org/jira/browse/TEZ-2300
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Rohini Palaniswamy
> Assignee: Jonathan Eagles
> Attachments: TEZ-2300.1.patch, TEZ-2300.2.patch, TEZ-2300.3.patch,
> TEZ-2300.4.patch, TEZ-2300.5.patch, syslog_dag_1428329756093_325099_1_post
>
>
> Noticed this with a couple of pig scripts which were not behaving well (AM
> close to OOM, etc) and even with some that were running fine. Pig calls
> Tezclient.stop() in shutdown hook. Ctrl+C to the pig script either exits
> immediately or is hung. In both cases it either takes a long time for the
> yarn application to go to KILLED state. Many times I just end up calling yarn
> application -kill separately after waiting for 5 mins or more for it to get
> killed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)