[jira] [Commented] (FLINK-35499) EventTimeWindowCheckpointingITCase times out due to Checkpoint expired before completing

Matthias Pohl (Jira) Tue, 17 Sep 2024 00:10:50 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-35499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882277#comment-17882277
 ]


Matthias Pohl commented on FLINK-35499:
---------------------------------------

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=62173&view=logs&j=baf26b34-3c6a-54e8-f93f-cf269b32f802&t=8c9d126d-57d2-5a9e-a8c8-ff53f7b35cd9&l=9047

> EventTimeWindowCheckpointingITCase times out due to Checkpoint expired before 
> completing
> ----------------------------------------------------------------------------------------
>
>                 Key: FLINK-35499
>                 URL: https://issues.apache.org/jira/browse/FLINK-35499
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.20.0
>            Reporter: Ryan Skraba
>            Priority: Critical
>              Labels: test-stability
>         Attachments: FLINK-35499.AdaptiveScheduler.log
>
>
> * 1.20 AdaptiveScheduler / Test (module: tests) 
> https://github.com/apache/flink/actions/runs/9311892945/job/25632037990#step:10:8702
> * 1.20 Default (Java 8) / Test (module: tests) 
> https://github.com/apache/flink/actions/runs/9275522134/job/25520829730#step:10:8264
> Going into the logs, we see the following error occurs:
> {code:java}
> ================================================================================
> Test testTumblingTimeWindow[statebackend type =ROCKSDB_INCREMENTAL, 
> buffersPerChannel = 
> 2](org.apache.flink.test.checkpointing.EventTimeWindowCheckpointingITCase) is 
> running.
> --------------------------------------------------------------------------------
> <...>
> 20:24:23,562 [    Checkpoint Timer] INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering 
> checkpoint 22 (type=CheckpointType{name='Checkpoint', 
> sharingFilesStrategy=FORWARD_BACKWARD}) @ 1716927863562 for job 
> 15d0a663cb415b09b9a68ccc40640c6d.
> 20:24:23,609 [jobmanager-io-thread-2] INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed 
> checkpoint 22 for job 15d0a663cb415b09b9a68ccc40640c6d (2349132 bytes, 
> checkpointDuration=43 ms, finalizationTime=4 ms).
> 20:24:23,610 [    Checkpoint Timer] INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering 
> checkpoint 23 (type=CheckpointType{name='Checkpoint', 
> sharingFilesStrategy=FORWARD_BACKWARD}) @ 1716927863610 for job 
> 15d0a663cb415b09b9a68ccc40640c6d.
> 20:24:23,620 [jobmanager-io-thread-2] WARN  
> org.apache.flink.runtime.jobmaster.JobMaster                 [] - Error while 
> processing AcknowledgeCheckpoint message
> java.lang.IllegalStateException: Attempt to reference unknown state: 
> a9a90973-4ee5-384f-acef-58a7c7560920
>       at 
> org.apache.flink.util.Preconditions.checkState(Preconditions.java:193) 
> ~[flink-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
>       at 
> org.apache.flink.runtime.state.SharedStateRegistryImpl.registerReference(SharedStateRegistryImpl.java:97)
>  ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
>       at 
> org.apache.flink.runtime.state.SharedStateRegistry.registerReference(SharedStateRegistry.java:53)
>  ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
>       at 
> org.apache.flink.runtime.state.IncrementalRemoteKeyedStateHandle.registerSharedStates(IncrementalRemoteKeyedStateHandle.java:289)
>  ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
>       at 
> org.apache.flink.runtime.checkpoint.OperatorSubtaskState.registerSharedState(OperatorSubtaskState.java:243)
>  ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
>       at 
> org.apache.flink.runtime.checkpoint.OperatorSubtaskState.registerSharedStates(OperatorSubtaskState.java:226)
>  ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
>       at 
> org.apache.flink.runtime.checkpoint.TaskStateSnapshot.registerSharedStates(TaskStateSnapshot.java:193)
>  ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
>       at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1245)
>  ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
>       at 
> org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$2(ExecutionGraphHandler.java:109)
>  ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
>       at 
> org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$4(ExecutionGraphHandler.java:139)
>  ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
>       at 
> org.apache.flink.util.MdcUtils.lambda$wrapRunnable$1(MdcUtils.java:64) 
> ~[flink-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [?:1.8.0_392]
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [?:1.8.0_392]
>       at java.lang.Thread.run(Thread.java:750) [?:1.8.0_392]
> 20:24:23,663 [Source: Custom Source (1/1)#1] INFO  
> org.apache.flink.runtime.taskmanager.Task                    [] - Source: 
> Custom Source (1/1)#1 
> (bc4de0d149fba0ca825771ff7eeae08d_bc764cd8ddf7a0cff126f51c16239658_0_1) 
> switched from RUNNING to FINISHED.
> 20:24:23,663 [Source: Custom Source (1/1)#1] INFO  
> org.apache.flink.runtime.taskmanager.Task                    [] - Freeing 
> task resources for Source: Custom Source (1/1)#1 
> (bc4de0d149fba0ca825771ff7eeae08d_bc764cd8ddf7a0cff126f51c16239658_0_1).
> 20:24:23,663 [flink-pekko.actor.default-dispatcher-8] INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - 
> Un-registering task and sending final execution state FINISHED to JobManager 
> for task Source: Custom Source (1/1)#1 
> bc4de0d149fba0ca825771ff7eeae08d_bc764cd8ddf7a0cff126f51c16239658_0_1.
> 20:24:23,663 [flink-pekko.actor.default-dispatcher-8] INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: 
> Custom Source (1/1) 
> (bc4de0d149fba0ca825771ff7eeae08d_bc764cd8ddf7a0cff126f51c16239658_0_1) 
> switched from RUNNING to FINISHED.
> 20:24:23,663 [flink-pekko.actor.default-dispatcher-8] INFO  
> org.apache.flink.runtime.jobmaster.JobMaster                 [] - Triggering 
> a manual checkpoint for job 15d0a663cb415b09b9a68ccc40640c6d.
> 20:24:23,664 [TumblingEventTimeWindows (3/4)#1] INFO  
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend [] - Closed 
> RocksDB State Backend. Cleaning up RocksDB working directory 
> /tmp/junit4600599408056781950/junit4118530363439384731/job_15d0a663cb415b09b9a68ccc40640c6d_op_WindowOperator_0a448493b4782967b150582570326227__3_4__uuid_f9b935a1-38cc-4cb3-92f0-97bf257d499d.
> 20:24:23,665 [TumblingEventTimeWindows (3/4)#1] INFO  
> org.apache.flink.runtime.taskmanager.Task                    [] - 
> TumblingEventTimeWindows (3/4)#1 
> (bc4de0d149fba0ca825771ff7eeae08d_0a448493b4782967b150582570326227_2_1) 
> switched from RUNNING to FINISHED.
> 20:24:23,665 [TumblingEventTimeWindows (3/4)#1] INFO  
> org.apache.flink.runtime.taskmanager.Task                    [] - Freeing 
> task resources for TumblingEventTimeWindows (3/4)#1 
> (bc4de0d149fba0ca825771ff7eeae08d_0a448493b4782967b150582570326227_2_1).
> 20:24:23,665 [flink-pekko.actor.default-dispatcher-8] INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - 
> Un-registering task and sending final execution state FINISHED to JobManager 
> for task TumblingEventTimeWindows (3/4)#1 
> bc4de0d149fba0ca825771ff7eeae08d_0a448493b4782967b150582570326227_2_1.
> 20:24:23,665 [flink-pekko.actor.default-dispatcher-8] INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - 
> TumblingEventTimeWindows (3/4) 
> (bc4de0d149fba0ca825771ff7eeae08d_0a448493b4782967b150582570326227_2_1) 
> switched from RUNNING to FINISHED.
> 20:24:23,665 [flink-pekko.actor.default-dispatcher-10] INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager 
> [] - Received resource requirements from job 
> 15d0a663cb415b09b9a68ccc40640c6d: 
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
> numberOfRequiredSlots=3}]
> 20:26:01,189 [flink-pekko.actor.default-dispatcher-16] INFO  
> org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool [] - 
> Releasing slot [27b0503b2930b17041ffd3635dfb819f].
> 20:26:01,190 [flink-pekko.actor.default-dispatcher-16] INFO  
> org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl [] - Free slot 
> TaskSlot(index:3, state:ACTIVE, resource profile: 
> ResourceProfile{taskHeapMemory=512.000gb (549755813888 bytes), 
> taskOffHeapMemory=512.000gb (549755813888 bytes), managedMemory=64.000mb 
> (67108864 bytes), networkMemory=32.000mb (33554432 bytes)}, allocationId: 
> 27b0503b2930b17041ffd3635dfb819f, jobId: 15d0a663cb415b09b9a68ccc40640c6d).
> 20:26:01,205 [flink-pekko.actor.default-dispatcher-16] INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.DefaultSlotStatusSyncer 
> [] - Freeing slot 27b0503b2930b17041ffd3635dfb819f.
> 20:34:23,610 [    Checkpoint Timer] INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Checkpoint 
> 23 of job 15d0a663cb415b09b9a68ccc40640c6d expired before completing.
> 20:34:23,611 [    Checkpoint Timer] WARN  
> org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to 
> trigger or complete checkpoint 23 for job 15d0a663cb415b09b9a68ccc40640c6d. 
> (0 consecutive failed attempts so far)
> org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint expired 
> before completing.
>       at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:2346)
>  ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
>       at 
> org.apache.flink.util.MdcUtils.lambda$wrapRunnable$1(MdcUtils.java:64) 
> ~[flink-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> [?:1.8.0_392]
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
> [?:1.8.0_392]
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  [?:1.8.0_392]
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  [?:1.8.0_392]
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [?:1.8.0_392]
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [?:1.8.0_392]
>       at java.lang.Thread.run(Thread.java:750) [?:1.8.0_392]
> 20:34:23,611 [    Checkpoint Timer] INFO  
> org.apache.flink.runtime.checkpoint.CheckpointRequestDecider [] - checkpoint 
> request time in queue: 599948
> 20:34:23,612 [flink-pekko.actor.default-dispatcher-130] INFO  
> org.apache.flink.runtime.jobmaster.JobMaster                 [] - Trying to 
> recover from a global failure.
> org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable 
> failure threshold. The latest checkpoint failed due to Checkpoint expired 
> before completing., view the Checkpoint History tab or the Job Manager log to 
> find out why continuous checkpoints failed.
>       at 
> org.apache.flink.runtime.checkpoint.CheckpointFailureManager.checkFailureAgainstCounter(CheckpointFailureManager.java:212)
>  ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
>       at 
> org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:169)
>  ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
>       at 
> org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:122)
>  ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
>       at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:2281)
>  ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
>       at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:2260)
>  ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
>       at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$1200(CheckpointCoordinator.java:102)
>  ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
>       at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:2346)
>  ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
>       at 
> org.apache.flink.util.MdcUtils.lambda$wrapRunnable$1(MdcUtils.java:64) 
> ~[flink-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[?:1.8.0_392]
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
> ~[?:1.8.0_392]
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  ~[?:1.8.0_392]
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  ~[?:1.8.0_392]
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  ~[?:1.8.0_392]
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  ~[?:1.8.0_392]
>       at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_392]
> 20:34:23,613 [flink-pekko.actor.default-dispatcher-130] INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job 
> Tumbling Window Test (15d0a663cb415b09b9a68ccc40640c6d) switched from state 
> RUNNING to FAILING.
> org.apache.flink.runtime.JobException: Recovery is suppressed by 
> FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=1, 
> backoffTimeMS=0)
>       at 
> org.apache.flink.runtime.executiongraph.failover.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:219)
>  ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
>       at 
> org.apache.flink.runtime.executiongraph.failover.ExecutionFailureHandler.handleFailureAndReport(ExecutionFailureHandler.java:166)
>  ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
>       at 
> org.apache.flink.runtime.executiongraph.failover.ExecutionFailureHandler.getGlobalFailureHandlingResult(ExecutionFailureHandler.java:140)
>  ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
>       at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleGlobalFailure(DefaultScheduler.java:324)
>  ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
>       at 
> org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyGlobalFailure(UpdateSchedulerNgOnInternalFailuresListener.java:57)
>  ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
>       at 
> org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.failGlobal(DefaultExecutionGraph.java:1092)
>  ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
>       at 
> org.apache.flink.runtime.executiongraph.DefaultExecutionGraph$1.lambda$failJob$0(DefaultExecutionGraph.java:477)
>  ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
>       at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:460)
>  ~[flink-rpc-akka44fe192f-1cb4-4526-9c09-a5d6b8d70d87.jar:1.20-SNAPSHOT]
>       at 
> org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
>  ~[flink-rpc-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> <...> 
> {code}
> Notably: {{org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint 
> tolerable failure threshold. The latest checkpoint failed due to Checkpoint 
> expired before completing., view the Checkpoint History tab or the Job 
> Manager log to find out why continuous checkpoints failed.}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-35499) EventTimeWindowCheckpointingITCase times out due to Checkpoint expired before completing

Reply via email to