[
https://issues.apache.org/jira/browse/FLINK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451230#comment-17451230
]
Till Rohrmann commented on FLINK-25098:
---------------------------------------
How exactly are you tearing down the initial cluster?
That Flink does not delete the HA CMs is by design and documented
[here|https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/ha/kubernetes_ha/#high-availability-data-clean-up].
What should not happen is that killing the Flink cluster deletes the submitted
{{JobGraph}} but does not remove the entry from the HA ConfigMap.
When tearing down the initial cluster, are you also deleting the PVC or the PV?
> Jobmanager CrashLoopBackOff in HA configuration
> -----------------------------------------------
>
> Key: FLINK-25098
> URL: https://issues.apache.org/jira/browse/FLINK-25098
> Project: Flink
> Issue Type: Bug
> Components: Deployment / Kubernetes
> Affects Versions: 1.13.2, 1.13.3
> Environment: Reproduced with:
> * Persistent jobs storage provided by the rocks-cephfs storage class.
> * OpenShift 4.9.5.
> Reporter: Adrian Vasiliu
> Priority: Critical
> Attachments: jm-flink-ha-jobmanager-log.txt,
> jm-flink-ha-tls-proxy-log.txt
>
>
> In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink
> 1.13.3), turning to Flink HA by using 3 replicas of the jobmanager leads to
> CrashLoopBackoff for all replicas.
> Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of
> jobmanager pod:
> [^jm-flink-ha-jobmanager-log.txt]
> [^jm-flink-ha-tls-proxy-log.txt]
> Reproduced with:
> * Persistent jobs storage provided by the {{rocks-cephfs}} storage class
> (shared by all replicas - ReadWriteMany) and mount path set via
> {{{}high-availability.storageDir: file///<dir>{}}}.
> * OpenShift 4.9.5 and also 4.8.x - reproduced in several clusters, it's not
> a "one-shot" trouble.
> Remarks:
> * This is a follow-up of
> https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
>
> * Picked Critical severity as HA is critical for our product.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)