[ https://issues.apache.org/jira/browse/FLINK-21008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17270260#comment-17270260 ]
Till Rohrmann commented on FLINK-21008: --------------------------------------- Wouldn't it work if {{ClusterEntrypoint.closeAsync()}} calls {code} shutDownAsync( ApplicationStatus.UNKNOWN, "Cluster entrypoint has been closed externally.", false) {code} meaning that a SIGTERM won't clean up the HA data. Now if the cluster entrypoint wants to shut down (e.g. because the job of the per-job cluster has finished), it will call {{shutDownAsync(.., .., true)}} which will clean up the HA data. > ClusterEntrypoint#shutDownAsync may not be fully executed > --------------------------------------------------------- > > Key: FLINK-21008 > URL: https://issues.apache.org/jira/browse/FLINK-21008 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.11.3, 1.12.1 > Reporter: Yang Wang > Assignee: Yang Wang > Priority: Critical > Fix For: 1.13.0 > > > Recently, in our internal use case for native K8s integration with K8s HA > enabled, we found that the leader related ConfigMaps could be residual in > some corner situations. > After some investigations, I think it is possibly caused by the inappropriate > shutdown process. > In {{ClusterEntrypoint#shutDownAsync}}, we first call the > {{closeClusterComponent}}, which also includes deregistering the Flink > application from cluster management(e.g. Yarn, K8s). Then we call the > {{stopClusterServices}} and {{cleanupDirectories}}. Imagine that the cluster > management do the deregister very fast, the JobManager process receives > SIGNAL 15 before or is being executing the {{stopClusterServices}} and > {{cleanupDirectories}}. The jvm process will directly exit then. So the two > methods may not be executed. -- This message was sent by Atlassian Jira (v8.3.4#803005)