[jira] [Commented] (FLINK-16279) Per job Yarn application leak in normal execution mode.

Kostas Kloudas (Jira) Mon, 02 Mar 2020 02:47:20 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-16279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17049092#comment-17049092
 ]


Kostas Kloudas commented on FLINK-16279:
----------------------------------------

If I understand the usecase correctly, we have a job submitted in yarn pre-job 
mode, attached, using {{executeAsync()}} and the job failed but the cluster was 
expecting a final request for the job result to shut down, but this request 
never came.

[~tison] For session cluster the cluster lifecycle is independent from that of 
the job, so I guess that no action should be taken in this case.

For the per-job cluster, the {{shutdownOnExit}} could maybe work because as 
soon as the client disconnects, it will issue a (best-effort) shutdown cluster 
command. 

Another (maybe cleaner) solution, could be that if the job reaches a terminal 
state which is NOT normal termination, then we always tear down the cluster. 
The benefit of this method is that it is the dispatcher that aligns the 
lifecycle of the job with that of the cluster, and not the client which is 
controlled by the user. This would require changes in the 
{{jobReachedGloballyTerminalState()}} in the {{MiniDispatcher}}. This of course 
leaves open the scenario of what happens when the client fails/disconnects. In 
this case we will have a "zombie" cluster. In this case we may need the 
{{shutdownOnExit}}.

But these are just thoughts that I have not yet investigated 100%.

What are your thoughts on that? 

> Per job Yarn application leak in normal execution mode.
> -------------------------------------------------------
>
>                 Key: FLINK-16279
>                 URL: https://issues.apache.org/jira/browse/FLINK-16279
>             Project: Flink
>          Issue Type: Bug
>          Components: Client / Job Submission, Runtime / Coordination
>    Affects Versions: 1.10.0
>            Reporter: Wenlong Lyu
>            Priority: Major
>
> I run a job in yarn per job mode using {{env.executeAsync}}, the job failed 
> but the yarn cluster didn't be destroyed.
> After some research on the code, I found that:
> when running in attached mode, MiniDispatcher will never set 
> {{shutDownfuture}} before received a request from job client. 
> {code}
>               if (executionMode == ClusterEntrypoint.ExecutionMode.NORMAL) {
>                       // terminate the MiniDispatcher once we served the 
> first JobResult successfully
>                       jobResultFuture.thenAccept((JobResult result) -> {
>                               ApplicationStatus status = 
> result.getSerializedThrowable().isPresent() ?
>                                               ApplicationStatus.FAILED : 
> ApplicationStatus.SUCCEEDED;
>                               LOG.debug("Shutting down per-job cluster 
> because someone retrieved the job result.");
>                               shutDownFuture.complete(status);
>                       });
>               } 
> {code}
> However, when running in async mode(submit job by env.executeAsync), there 
> may be no request from job client because when a user find that the job is 
> failed from job client, he may never request the result again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-16279) Per job Yarn application leak in normal execution mode.

Reply via email to