[jira] [Commented] (FLINK-25958) OOME Checkpoints & Savepoints were shown as COMPLETE in Flink UI

Piotr Nowojski (Jira) Fri, 04 Feb 2022 02:28:33 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-25958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17486941#comment-17486941
 ]


Piotr Nowojski commented on FLINK-25958:
----------------------------------------

It looks like the problem is caused by pre-mature reporting that the checkpoint 
is completed via {{PendingCheckpointStats#reportCompletedCheckpoint}} call in 
{{PendingCheckpoint#finalizeCheckpoint}}. That's when {{PendingCheckpoint}} is 
converted to {{CompletedCheckpoint}}, however this doesn't mean the checkpoint 
will indeed completed. For example adding to checkpoint store can still fail.

It looks like there is no good reason behind this behaviour and it's just an 
unintentional artefact of FLINK-4410 changes, that added support for displaying 
failed/pending checkpoints stats.

The most naive solution might be just moving 
{{PendingCheckpointStats#reportCompletedCheckpoint}}  at the end of the 
checkpointing process. However one thing to consider is that this could 
indadvertedly create the opposite problem, where checkpoint has completed but 
failure while reporting fails to mark it as such.

> OOME Checkpoints & Savepoints were shown as COMPLETE in Flink UI
> ----------------------------------------------------------------
>
>                 Key: FLINK-25958
>                 URL: https://issues.apache.org/jira/browse/FLINK-25958
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.15.0, 1.12.7, 1.13.5, 1.14.3
>         Environment: Ververica Platform 2.6.2
> Flink 1.13.5
>            Reporter: Victor Xu
>            Priority: Major
>         Attachments: JIRA-1.jpg
>
>
> Flink job was running but the checkpoints & savepoints were failing all the 
> time due to OOM Exception. However, the Flink UI showed COMPLETE for those 
> checkpoints & savepoints.
> For example (checkpoint 39 & 40):
> {noformat}
> 2022-01-27 02:41:39,969 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering 
> checkpoint 39 (type=CHECKPOINT) @ 1643251299952 for job 
> ab2217e5ce144087bbddf6bd6c3
> 668eb.
> 2022-01-27 02:43:19,678 WARN  org.apache.flink.runtime.jobmaster.JobMaster    
>              [] - Error while processing AcknowledgeCheckpoint message
> org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete 
> the pending checkpoint 39. Failure reason: Failure to finalize checkpoint.
>         at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1227)
>  ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
>         at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1072)
>  ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
>         at 
> org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89)
>  ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
>         at 
> org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119)
>  ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-s
> tream2]
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
>         at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>  [?:?]
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  [?:?]
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  [?:?]
>         at java.lang.Thread.run(Thread.java:829) [?:?]
> Caused by: java.lang.IllegalArgumentException: Self-suppression not permitted
>         at java.lang.Throwable.addSuppressed(Throwable.java:1054) ~[?:?]
>         at 
> org.apache.flink.util.InstantiationUtil.serializeObject(InstantiationUtil.java:627)
>  ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
>         at 
> com.ververica.platform.flink.ha.kubernetes.KubernetesHaCheckpointStore.serializeCheckpoint(KubernetesHaCheckpointStore.java:204)
>  ~[vvp-flink-ha-kubernetes-flink113-1.4-20211013.09
> 1138-2.jar:?]
>         at 
> com.ververica.platform.flink.ha.kubernetes.KubernetesHaCheckpointStore.addCheckpoint(KubernetesHaCheckpointStore.java:83)
>  ~[vvp-flink-ha-kubernetes-flink113-1.4-20211013.091138-2.
> jar:?]
>         at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1209)
>  ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
>         ... 9 more
> Caused by: java.lang.OutOfMemoryError: Java heap space
> 2022-01-27 03:41:39,970 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering 
> checkpoint 40 (type=CHECKPOINT) @ 1643254899952 for job 
> ab2217e5ce144087bbddf6bd6c3
> 668eb.
> 2022-01-27 03:43:22,326 WARN  org.apache.flink.runtime.jobmaster.JobMaster    
>              [] - Error while processing AcknowledgeCheckpoint message
> org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete 
> the pending checkpoint 40. Failure reason: Failure to finalize checkpoint.
>         at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1227)
>  ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
>         at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1072)
>  ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
>         at 
> org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89)
>  ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
>         at 
> org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119)
>  ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-s
> tream2]
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
>         at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>  [?:?]
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  [?:?]
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  [?:?]
>         at java.lang.Thread.run(Thread.java:829) [?:?]
> Caused by: java.lang.IllegalArgumentException: Self-suppression not permitted
>         at java.lang.Throwable.addSuppressed(Throwable.java:1054) ~[?:?]
>         at 
> org.apache.flink.util.InstantiationUtil.serializeObject(InstantiationUtil.java:627)
>  ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
>         at 
> com.ververica.platform.flink.ha.kubernetes.KubernetesHaCheckpointStore.serializeCheckpoint(KubernetesHaCheckpointStore.java:204)
>  ~[vvp-flink-ha-kubernetes-flink113-1.4-20211013.09
> 1138-2.jar:?]
>         at 
> com.ververica.platform.flink.ha.kubernetes.KubernetesHaCheckpointStore.addCheckpoint(KubernetesHaCheckpointStore.java:83)
>  ~[vvp-flink-ha-kubernetes-flink113-1.4-20211013.091138-2.jar:?]
>         at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1209)
>  ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
>         ... 9 more
> Caused by: java.lang.OutOfMemoryError: Java heap space{noformat}
> Please find attached a screenshot of the Flink UI (both 39 & 40 were 
> COMPLETE).
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (FLINK-25958) OOME Checkpoints & Savepoints were shown as COMPLETE in Flink UI

Reply via email to