[
https://issues.apache.org/jira/browse/FLINK-35499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882277#comment-17882277
]
Matthias Pohl commented on FLINK-35499:
---------------------------------------
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=62173&view=logs&j=baf26b34-3c6a-54e8-f93f-cf269b32f802&t=8c9d126d-57d2-5a9e-a8c8-ff53f7b35cd9&l=9047
> EventTimeWindowCheckpointingITCase times out due to Checkpoint expired before
> completing
> ----------------------------------------------------------------------------------------
>
> Key: FLINK-35499
> URL: https://issues.apache.org/jira/browse/FLINK-35499
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.20.0
> Reporter: Ryan Skraba
> Priority: Critical
> Labels: test-stability
> Attachments: FLINK-35499.AdaptiveScheduler.log
>
>
> * 1.20 AdaptiveScheduler / Test (module: tests)
> https://github.com/apache/flink/actions/runs/9311892945/job/25632037990#step:10:8702
> * 1.20 Default (Java 8) / Test (module: tests)
> https://github.com/apache/flink/actions/runs/9275522134/job/25520829730#step:10:8264
> Going into the logs, we see the following error occurs:
> {code:java}
> ================================================================================
> Test testTumblingTimeWindow[statebackend type =ROCKSDB_INCREMENTAL,
> buffersPerChannel =
> 2](org.apache.flink.test.checkpointing.EventTimeWindowCheckpointingITCase) is
> running.
> --------------------------------------------------------------------------------
> <...>
> 20:24:23,562 [ Checkpoint Timer] INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
> checkpoint 22 (type=CheckpointType{name='Checkpoint',
> sharingFilesStrategy=FORWARD_BACKWARD}) @ 1716927863562 for job
> 15d0a663cb415b09b9a68ccc40640c6d.
> 20:24:23,609 [jobmanager-io-thread-2] INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
> checkpoint 22 for job 15d0a663cb415b09b9a68ccc40640c6d (2349132 bytes,
> checkpointDuration=43 ms, finalizationTime=4 ms).
> 20:24:23,610 [ Checkpoint Timer] INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
> checkpoint 23 (type=CheckpointType{name='Checkpoint',
> sharingFilesStrategy=FORWARD_BACKWARD}) @ 1716927863610 for job
> 15d0a663cb415b09b9a68ccc40640c6d.
> 20:24:23,620 [jobmanager-io-thread-2] WARN
> org.apache.flink.runtime.jobmaster.JobMaster [] - Error while
> processing AcknowledgeCheckpoint message
> java.lang.IllegalStateException: Attempt to reference unknown state:
> a9a90973-4ee5-384f-acef-58a7c7560920
> at
> org.apache.flink.util.Preconditions.checkState(Preconditions.java:193)
> ~[flink-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> at
> org.apache.flink.runtime.state.SharedStateRegistryImpl.registerReference(SharedStateRegistryImpl.java:97)
> ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> at
> org.apache.flink.runtime.state.SharedStateRegistry.registerReference(SharedStateRegistry.java:53)
> ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> at
> org.apache.flink.runtime.state.IncrementalRemoteKeyedStateHandle.registerSharedStates(IncrementalRemoteKeyedStateHandle.java:289)
> ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> at
> org.apache.flink.runtime.checkpoint.OperatorSubtaskState.registerSharedState(OperatorSubtaskState.java:243)
> ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> at
> org.apache.flink.runtime.checkpoint.OperatorSubtaskState.registerSharedStates(OperatorSubtaskState.java:226)
> ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> at
> org.apache.flink.runtime.checkpoint.TaskStateSnapshot.registerSharedStates(TaskStateSnapshot.java:193)
> ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1245)
> ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> at
> org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$2(ExecutionGraphHandler.java:109)
> ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> at
> org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$4(ExecutionGraphHandler.java:139)
> ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> at
> org.apache.flink.util.MdcUtils.lambda$wrapRunnable$1(MdcUtils.java:64)
> ~[flink-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [?:1.8.0_392]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [?:1.8.0_392]
> at java.lang.Thread.run(Thread.java:750) [?:1.8.0_392]
> 20:24:23,663 [Source: Custom Source (1/1)#1] INFO
> org.apache.flink.runtime.taskmanager.Task [] - Source:
> Custom Source (1/1)#1
> (bc4de0d149fba0ca825771ff7eeae08d_bc764cd8ddf7a0cff126f51c16239658_0_1)
> switched from RUNNING to FINISHED.
> 20:24:23,663 [Source: Custom Source (1/1)#1] INFO
> org.apache.flink.runtime.taskmanager.Task [] - Freeing
> task resources for Source: Custom Source (1/1)#1
> (bc4de0d149fba0ca825771ff7eeae08d_bc764cd8ddf7a0cff126f51c16239658_0_1).
> 20:24:23,663 [flink-pekko.actor.default-dispatcher-8] INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] -
> Un-registering task and sending final execution state FINISHED to JobManager
> for task Source: Custom Source (1/1)#1
> bc4de0d149fba0ca825771ff7eeae08d_bc764cd8ddf7a0cff126f51c16239658_0_1.
> 20:24:23,663 [flink-pekko.actor.default-dispatcher-8] INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source:
> Custom Source (1/1)
> (bc4de0d149fba0ca825771ff7eeae08d_bc764cd8ddf7a0cff126f51c16239658_0_1)
> switched from RUNNING to FINISHED.
> 20:24:23,663 [flink-pekko.actor.default-dispatcher-8] INFO
> org.apache.flink.runtime.jobmaster.JobMaster [] - Triggering
> a manual checkpoint for job 15d0a663cb415b09b9a68ccc40640c6d.
> 20:24:23,664 [TumblingEventTimeWindows (3/4)#1] INFO
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend [] - Closed
> RocksDB State Backend. Cleaning up RocksDB working directory
> /tmp/junit4600599408056781950/junit4118530363439384731/job_15d0a663cb415b09b9a68ccc40640c6d_op_WindowOperator_0a448493b4782967b150582570326227__3_4__uuid_f9b935a1-38cc-4cb3-92f0-97bf257d499d.
> 20:24:23,665 [TumblingEventTimeWindows (3/4)#1] INFO
> org.apache.flink.runtime.taskmanager.Task [] -
> TumblingEventTimeWindows (3/4)#1
> (bc4de0d149fba0ca825771ff7eeae08d_0a448493b4782967b150582570326227_2_1)
> switched from RUNNING to FINISHED.
> 20:24:23,665 [TumblingEventTimeWindows (3/4)#1] INFO
> org.apache.flink.runtime.taskmanager.Task [] - Freeing
> task resources for TumblingEventTimeWindows (3/4)#1
> (bc4de0d149fba0ca825771ff7eeae08d_0a448493b4782967b150582570326227_2_1).
> 20:24:23,665 [flink-pekko.actor.default-dispatcher-8] INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] -
> Un-registering task and sending final execution state FINISHED to JobManager
> for task TumblingEventTimeWindows (3/4)#1
> bc4de0d149fba0ca825771ff7eeae08d_0a448493b4782967b150582570326227_2_1.
> 20:24:23,665 [flink-pekko.actor.default-dispatcher-8] INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph [] -
> TumblingEventTimeWindows (3/4)
> (bc4de0d149fba0ca825771ff7eeae08d_0a448493b4782967b150582570326227_2_1)
> switched from RUNNING to FINISHED.
> 20:24:23,665 [flink-pekko.actor.default-dispatcher-10] INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager
> [] - Received resource requirements from job
> 15d0a663cb415b09b9a68ccc40640c6d:
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
> numberOfRequiredSlots=3}]
> 20:26:01,189 [flink-pekko.actor.default-dispatcher-16] INFO
> org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool [] -
> Releasing slot [27b0503b2930b17041ffd3635dfb819f].
> 20:26:01,190 [flink-pekko.actor.default-dispatcher-16] INFO
> org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl [] - Free slot
> TaskSlot(index:3, state:ACTIVE, resource profile:
> ResourceProfile{taskHeapMemory=512.000gb (549755813888 bytes),
> taskOffHeapMemory=512.000gb (549755813888 bytes), managedMemory=64.000mb
> (67108864 bytes), networkMemory=32.000mb (33554432 bytes)}, allocationId:
> 27b0503b2930b17041ffd3635dfb819f, jobId: 15d0a663cb415b09b9a68ccc40640c6d).
> 20:26:01,205 [flink-pekko.actor.default-dispatcher-16] INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.DefaultSlotStatusSyncer
> [] - Freeing slot 27b0503b2930b17041ffd3635dfb819f.
> 20:34:23,610 [ Checkpoint Timer] INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Checkpoint
> 23 of job 15d0a663cb415b09b9a68ccc40640c6d expired before completing.
> 20:34:23,611 [ Checkpoint Timer] WARN
> org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to
> trigger or complete checkpoint 23 for job 15d0a663cb415b09b9a68ccc40640c6d.
> (0 consecutive failed attempts so far)
> org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint expired
> before completing.
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:2346)
> ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> at
> org.apache.flink.util.MdcUtils.lambda$wrapRunnable$1(MdcUtils.java:64)
> ~[flink-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> [?:1.8.0_392]
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> [?:1.8.0_392]
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> [?:1.8.0_392]
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> [?:1.8.0_392]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [?:1.8.0_392]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [?:1.8.0_392]
> at java.lang.Thread.run(Thread.java:750) [?:1.8.0_392]
> 20:34:23,611 [ Checkpoint Timer] INFO
> org.apache.flink.runtime.checkpoint.CheckpointRequestDecider [] - checkpoint
> request time in queue: 599948
> 20:34:23,612 [flink-pekko.actor.default-dispatcher-130] INFO
> org.apache.flink.runtime.jobmaster.JobMaster [] - Trying to
> recover from a global failure.
> org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable
> failure threshold. The latest checkpoint failed due to Checkpoint expired
> before completing., view the Checkpoint History tab or the Job Manager log to
> find out why continuous checkpoints failed.
> at
> org.apache.flink.runtime.checkpoint.CheckpointFailureManager.checkFailureAgainstCounter(CheckpointFailureManager.java:212)
> ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> at
> org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:169)
> ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> at
> org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:122)
> ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:2281)
> ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:2260)
> ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$1200(CheckpointCoordinator.java:102)
> ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:2346)
> ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> at
> org.apache.flink.util.MdcUtils.lambda$wrapRunnable$1(MdcUtils.java:64)
> ~[flink-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> ~[?:1.8.0_392]
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> ~[?:1.8.0_392]
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> ~[?:1.8.0_392]
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> ~[?:1.8.0_392]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> ~[?:1.8.0_392]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> ~[?:1.8.0_392]
> at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_392]
> 20:34:23,613 [flink-pekko.actor.default-dispatcher-130] INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job
> Tumbling Window Test (15d0a663cb415b09b9a68ccc40640c6d) switched from state
> RUNNING to FAILING.
> org.apache.flink.runtime.JobException: Recovery is suppressed by
> FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=1,
> backoffTimeMS=0)
> at
> org.apache.flink.runtime.executiongraph.failover.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:219)
> ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> at
> org.apache.flink.runtime.executiongraph.failover.ExecutionFailureHandler.handleFailureAndReport(ExecutionFailureHandler.java:166)
> ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> at
> org.apache.flink.runtime.executiongraph.failover.ExecutionFailureHandler.getGlobalFailureHandlingResult(ExecutionFailureHandler.java:140)
> ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> at
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleGlobalFailure(DefaultScheduler.java:324)
> ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> at
> org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyGlobalFailure(UpdateSchedulerNgOnInternalFailuresListener.java:57)
> ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> at
> org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.failGlobal(DefaultExecutionGraph.java:1092)
> ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> at
> org.apache.flink.runtime.executiongraph.DefaultExecutionGraph$1.lambda$failJob$0(DefaultExecutionGraph.java:477)
> ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> at
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:460)
> ~[flink-rpc-akka44fe192f-1cb4-4526-9c09-a5d6b8d70d87.jar:1.20-SNAPSHOT]
> at
> org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
> ~[flink-rpc-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> <...>
> {code}
> Notably: {{org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint
> tolerable failure threshold. The latest checkpoint failed due to Checkpoint
> expired before completing., view the Checkpoint History tab or the Job
> Manager log to find out why continuous checkpoints failed.}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)