Victor Xu created FLINK-25958:
---------------------------------
Summary: OOME Checkpoints & Savepoints were shown as COMPLETE in
Flink UI
Key: FLINK-25958
URL: https://issues.apache.org/jira/browse/FLINK-25958
Project: Flink
Issue Type: Bug
Components: Runtime / Checkpointing
Affects Versions: 1.13.5
Environment: Ververica Platform 2.6.2
Flink 1.13.5
Reporter: Victor Xu
Attachments: JIRA-1.jpg
Flink job was running but the checkpoints & savepoints were failing all the
time due to OOM Exception. However, the Flink UI showed COMPLETE for those
checkpoints & savepoints.
For example (checkpoint 39 & 40):
{noformat}
2022-01-27 02:41:39,969 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
checkpoint 39 (type=CHECKPOINT) @ 1643251299952 for job
ab2217e5ce144087bbddf6bd6c3
668eb.
2022-01-27 02:43:19,678 WARN org.apache.flink.runtime.jobmaster.JobMaster
[] - Error while processing AcknowledgeCheckpoint message
org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete the
pending checkpoint 39. Failure reason: Failure to finalize checkpoint.
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1227)
~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1072)
~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
at
org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89)
~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
at
org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119)
~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-s
tream2]
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
[?:?]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
[?:?]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
[?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.lang.IllegalArgumentException: Self-suppression not permitted
at java.lang.Throwable.addSuppressed(Throwable.java:1054) ~[?:?]
at
org.apache.flink.util.InstantiationUtil.serializeObject(InstantiationUtil.java:627)
~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
at
com.ververica.platform.flink.ha.kubernetes.KubernetesHaCheckpointStore.serializeCheckpoint(KubernetesHaCheckpointStore.java:204)
~[vvp-flink-ha-kubernetes-flink113-1.4-20211013.09
1138-2.jar:?]
at
com.ververica.platform.flink.ha.kubernetes.KubernetesHaCheckpointStore.addCheckpoint(KubernetesHaCheckpointStore.java:83)
~[vvp-flink-ha-kubernetes-flink113-1.4-20211013.091138-2.
jar:?]
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1209)
~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
... 9 more
Caused by: java.lang.OutOfMemoryError: Java heap space
2022-01-27 03:41:39,970 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
checkpoint 40 (type=CHECKPOINT) @ 1643254899952 for job
ab2217e5ce144087bbddf6bd6c3
668eb.
2022-01-27 03:43:22,326 WARN org.apache.flink.runtime.jobmaster.JobMaster
[] - Error while processing AcknowledgeCheckpoint message
org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete the
pending checkpoint 40. Failure reason: Failure to finalize checkpoint.
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1227)
~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1072)
~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
at
org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89)
~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
at
org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119)
~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-s
tream2]
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
[?:?]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
[?:?]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
[?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.lang.IllegalArgumentException: Self-suppression not permitted
at java.lang.Throwable.addSuppressed(Throwable.java:1054) ~[?:?]
at
org.apache.flink.util.InstantiationUtil.serializeObject(InstantiationUtil.java:627)
~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
at
com.ververica.platform.flink.ha.kubernetes.KubernetesHaCheckpointStore.serializeCheckpoint(KubernetesHaCheckpointStore.java:204)
~[vvp-flink-ha-kubernetes-flink113-1.4-20211013.09
1138-2.jar:?]
at
com.ververica.platform.flink.ha.kubernetes.KubernetesHaCheckpointStore.addCheckpoint(KubernetesHaCheckpointStore.java:83)
~[vvp-flink-ha-kubernetes-flink113-1.4-20211013.091138-2.jar:?]
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1209)
~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
... 9 more
Caused by: java.lang.OutOfMemoryError: Java heap space{noformat}
Please find attached a screenshot of the Flink UI (both 39 & 40 were COMPLETE).
--
This message was sent by Atlassian Jira
(v8.20.1#820001)