[
https://issues.apache.org/jira/browse/FLINK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451817#comment-17451817
]
Till Rohrmann commented on FLINK-25098:
---------------------------------------
If you are terminating the cluster before the jobs has properly terminated,
then this explains the situation. If the job is not terminated but you are only
killing the process, then the job won't be removed from Flink's HA state.
Hence, when recovering, Flink assumes that the data is still there. However,
the used PVs are cleaned up in the meantime and the data is gone. The problem
is that you are using storage that is not persistent as Flink would need it to
be.
> Jobmanager CrashLoopBackOff in HA configuration
> -----------------------------------------------
>
> Key: FLINK-25098
> URL: https://issues.apache.org/jira/browse/FLINK-25098
> Project: Flink
> Issue Type: Bug
> Components: Deployment / Kubernetes
> Affects Versions: 1.13.2, 1.13.3
> Environment: Reproduced with:
> * Persistent jobs storage provided by the rocks-cephfs storage class.
> * OpenShift 4.9.5.
> Reporter: Adrian Vasiliu
> Priority: Critical
> Attachments: jm-flink-ha-jobmanager-log.txt,
> jm-flink-ha-tls-proxy-log.txt
>
>
> In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink
> 1.13.3), turning to Flink HA by using 3 replicas of the jobmanager leads to
> CrashLoopBackoff for all replicas.
> Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of
> jobmanager pod:
> [^jm-flink-ha-jobmanager-log.txt]
> [^jm-flink-ha-tls-proxy-log.txt]
> Reproduced with:
> * Persistent jobs storage provided by the {{rocks-cephfs}} storage class
> (shared by all replicas - ReadWriteMany) and mount path set via
> {{{}high-availability.storageDir: file///<dir>{}}}.
> * OpenShift 4.9.5 and also 4.8.x - reproduced in several clusters, it's not
> a "one-shot" trouble.
> Remarks:
> * This is a follow-up of
> https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
>
> * Picked Critical severity as HA is critical for our product.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)