[ 
https://issues.apache.org/jira/browse/FLINK-21008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17271059#comment-17271059
 ] 

Yang Wang commented on FLINK-21008:
-----------------------------------

[~trohrmann] Thanks for your suggestion.

But I think it could not resolve the current problem(residual 
ConfigMap/ZooKeeper nodes) if {{ClusterEntrypoint.closeAsync()}} is executed 
with {{cleanupHaData = false}}.

Let's me try to describe how this problem could happen.

When the only existing Flink job in application is cancelled, 
{{shutDownAsync(applicationStatus, null, true)}} will be called to shutdown the 
services and do the clean up. However, before 
{{haServices.closeAndCleanupAllData()}} is executed, cluster entrypoint 
receives a SIGTERM.

In such situation, we will have residual HA related ConfigMaps and ZooKeeper 
Nodes. So I am suggesting to add a shutdown hook for the clean up  in the 
{{shutDownAsync}}. The shutdown hook should be removed after 
{{haServices.closeAndCleanupAllData()}} is finished.

All in all, when received SIGTERM, cluster entrypoint could directly exit and 
leave the clean up in the shutdown hook. Because when we register the signal 
handler, we could not know whether to do the HA data clean up or not.

 

> Residual HA related Kubernetes ConfigMaps and ZooKeeper nodes when cluster 
> entrypoint received SIGTERM in shutdown
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-21008
>                 URL: https://issues.apache.org/jira/browse/FLINK-21008
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.11.3, 1.12.1
>            Reporter: Yang Wang
>            Assignee: Yang Wang
>            Priority: Critical
>             Fix For: 1.13.0
>
>
> Recently, in our internal use case for native K8s integration with K8s HA 
> enabled, we found that the leader related ConfigMaps could be residual in 
> some corner situations.
> After some investigations, I think it is possibly caused by the inappropriate 
> shutdown process.
> In {{ClusterEntrypoint#shutDownAsync}}, we first call the 
> {{closeClusterComponent}}, which also includes deregistering the Flink 
> application from cluster management(e.g. Yarn, K8s). Then we call the 
> {{stopClusterServices}} and {{cleanupDirectories}}. Imagine that the cluster 
> management do the deregister very fast, the JobManager process receives 
> SIGNAL 15 before or is being executing the {{stopClusterServices}} and 
> {{cleanupDirectories}}. The jvm process will directly exit then. So the two 
> methods may not be executed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to