[
https://issues.apache.org/jira/browse/FLINK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452441#comment-17452441
]
Enrique Lacal commented on FLINK-25098:
---------------------------------------
Hi Till,
I want to differentiate between a reinstall and this occurring whilst the Flink
Cluster is running. I agree that to uninstall everything it is safer to
manually delete the ConfigMaps and this is what Adrian has seen above with them
pointing to an outdated job graph. Deleting the ConfigMaps solve this issue,
but not the latter.
The other problem we are seeing is that whilst a Flink Cluster is running the
Configmap for one of the jobs becomes inconsistent for some reason and when the
leader JM goes down and a new follower tries to restore that job using the
ConfigMap it cannot find the checkpoint referenced. (This is what Neeraj has
shared through the logs) I've done some investigation and it seems that Flink
updates the ConfigMap before creating the `completedCheckpoint` folder in the
dir set by `high-availability.storageDir`. I watched the ConfigMap and the Fs
simultaneously. My assumption is that before the `completedCheckpoint` is
written but after the ConfigMap is updated the leader JM goes down for some
reason and the state becomes inconsistent. Then Flink cannot recover from this
state without manual intervention, which is a significant problem. Another idea
might be that the checkpoint fails but the CM is prematurely updated. I think
this is less likely.
I understand you want to see some logs on when the checkpoint fails to
understand the root cause, so we have set up persistence for our logs and will
share those as soon as we can reproduce this issue. I also couldn't find a way
of reproducing this by crashing the job, killing the leader pod etc.. and since
the interval between the CM being updated and then the file in FS being created
is too short it's hard to crash the pod at a specific time. From my observation
after the Flink Cluster is in this unrecoverable state, the actual checkpoint
is stored in `state.checkpoints.dir` such as `chk-<number>` but the
`completeCheckpoint` doesn't exist which mean that the checkpoint has been
taken correctly but the reference to it is not there. I believe this is the way
it works, but not 100% sure.
Just in case, these are the parameters used for checkpointing:
|Checkpointing Mode|Exactly Once|
|Checkpoint Storage|FileSystemCheckpointStorage|
|State Backend|EmbeddedRocksDBStateBackend|
|Interval|5s|
|Timeout|10m 0s|
|Minimum Pause Between Checkpoints|0ms|
|Maximum Concurrent Checkpoints|1|
|Unaligned Checkpoints|Disabled|
|Persist Checkpoints Externally|Enabled (retain on cancellation)|
|Tolerable Failed Checkpoints|0|
Do you think there is a workaround for this issue? Maybe changing the above
configuration to be less strict?
Thanks for your time!
> Jobmanager CrashLoopBackOff in HA configuration
> -----------------------------------------------
>
> Key: FLINK-25098
> URL: https://issues.apache.org/jira/browse/FLINK-25098
> Project: Flink
> Issue Type: Bug
> Components: Deployment / Kubernetes
> Affects Versions: 1.13.2, 1.13.3
> Environment: Reproduced with:
> * Persistent jobs storage provided by the rocks-cephfs storage class.
> * OpenShift 4.9.5.
> Reporter: Adrian Vasiliu
> Priority: Critical
> Attachments:
> iaf-insights-engine--7fc4-eve-29ee-ep-jobmanager-1-jobmanager.log,
> jm-flink-ha-jobmanager-log.txt, jm-flink-ha-tls-proxy-log.txt
>
>
> In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink
> 1.13.3), turning to Flink HA by using 3 replicas of the jobmanager leads to
> CrashLoopBackoff for all replicas.
> Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of
> jobmanager pod:
> [^jm-flink-ha-jobmanager-log.txt]
> [^jm-flink-ha-tls-proxy-log.txt]
> Reproduced with:
> * Persistent jobs storage provided by the {{rocks-cephfs}} storage class
> (shared by all replicas - ReadWriteMany) and mount path set via
> {{{}high-availability.storageDir: file///<dir>{}}}.
> * OpenShift 4.9.5 and also 4.8.x - reproduced in several clusters, it's not
> a "one-shot" trouble.
> Remarks:
> * This is a follow-up of
> https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
>
> * Picked Critical severity as HA is critical for our product.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)