[ 
https://issues.apache.org/jira/browse/FLINK-38967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18060023#comment-18060023
 ] 

Martijn Visser commented on FLINK-38967:
----------------------------------------

[~fcsaky] [~mateczagany] I saw a test instability that might be related to this 
one

{code:java}
2026-02-21T06:06:24.3774754Z "Channel state writer" #135 daemon prio=5 
os_prio=0 cpu=1223785.83ms elapsed=12432.27s tid=0x00007fa7f9c79000 nid=0x7d84 
runnable  [0x00007fa7bbe7c000]
2026-02-21T06:06:24.3775409Z    java.lang.Thread.State: RUNNABLE
2026-02-21T06:06:24.3775894Z    at 
java.io.FileDescriptor.sync([email protected]/Native Method)
2026-02-21T06:06:24.3776547Z    at 
org.apache.flink.core.fs.local.LocalDataOutputStream.sync(LocalDataOutputStream.java:86)
2026-02-21T06:06:24.3777292Z    at 
org.apache.flink.core.fs.FSDataOutputStreamWrapper.sync(FSDataOutputStreamWrapper.java:50)
2026-02-21T06:06:24.3778247Z    at 
org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.sync(FsCheckpointStreamFactory.java:335)
2026-02-21T06:06:24.3779265Z    at 
org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:408)
2026-02-21T06:06:24.3780513Z    - locked <0x00000000ae9135b0> (a 
org.apache.flink.test.checkpointing.UnalignedCheckpointFailureHandlingITCase$FailingOnceFsCheckpointOutputStream)
2026-02-21T06:06:24.3781705Z    at 
org.apache.flink.test.checkpointing.UnalignedCheckpointFailureHandlingITCase$FailingOnceFsCheckpointOutputStream.closeAndGetHandle(UnalignedCheckpointFailureHandlingITCase.java:355)
2026-02-21T06:06:24.3782772Z    at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateCheckpointWriter.finishWriteAndResult(ChannelStateCheckpointWriter.java:246)
2026-02-21T06:06:24.3783661Z    at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateCheckpointWriter$$Lambda$2202/0x0000000100b1d440.run(Unknown
 Source)
2026-02-21T06:06:24.3784550Z    at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateCheckpointWriter.doComplete(ChannelStateCheckpointWriter.java:255)
2026-02-21T06:06:24.3785481Z    at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateCheckpointWriter.lambda$tryFinishResult$3(ChannelStateCheckpointWriter.java:236)
2026-02-21T06:06:24.3786369Z    at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateCheckpointWriter$$Lambda$2201/0x0000000100b1d040.run(Unknown
 Source)
2026-02-21T06:06:24.3787217Z    at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateCheckpointWriter.runWithChecks(ChannelStateCheckpointWriter.java:274)
2026-02-21T06:06:24.3788156Z    at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateCheckpointWriter.tryFinishResult(ChannelStateCheckpointWriter.java:236)
2026-02-21T06:06:24.3789072Z    at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateCheckpointWriter.completeInput(ChannelStateCheckpointWriter.java:208)
2026-02-21T06:06:24.3790097Z    at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequest.lambda$completeInput$0(ChannelStateWriteRequest.java:114)
2026-02-21T06:06:24.3791020Z    at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequest$$Lambda$2122/0x0000000100ad0840.accept(Unknown
 Source)
2026-02-21T06:06:24.3791919Z    at 
org.apache.flink.runtime.checkpoint.channel.CheckpointInProgressRequest.execute(ChannelStateWriteRequest.java:366)
2026-02-21T06:06:24.3792913Z    at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestDispatcherImpl.handleCheckpointInProgressRequest(ChannelStateWriteRequestDispatcherImpl.java:175)
2026-02-21T06:06:24.3793991Z    at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestDispatcherImpl.dispatchInternal(ChannelStateWriteRequestDispatcherImpl.java:125)
2026-02-21T06:06:24.3795010Z    at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestDispatcherImpl.dispatch(ChannelStateWriteRequestDispatcherImpl.java:92)
2026-02-21T06:06:24.3796002Z    at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestExecutorImpl.loop(ChannelStateWriteRequestExecutorImpl.java:182)
2026-02-21T06:06:24.3796950Z    at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestExecutorImpl.run(ChannelStateWriteRequestExecutorImpl.java:136)
2026-02-21T06:06:24.3797829Z    at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestExecutorImpl$$Lambda$1716/0x00000001009b8840.run(Unknown
 Source)
2026-02-21T06:06:24.3798516Z    at 
java.lang.Thread.run([email protected]/Thread.java:829)
{code}

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=72555&view=logs&j=2c3cbe13-dee0-5837-cf47-3053da9a8a78&t=d102aafb-3bbd-55e4-a35f-e8935afffc31&l=43361

> UnalignedCheckpointFailureHandlingITCase fails intermittently
> -------------------------------------------------------------
>
>                 Key: FLINK-38967
>                 URL: https://issues.apache.org/jira/browse/FLINK-38967
>             Project: Flink
>          Issue Type: Improvement
>          Components: Tests
>            Reporter: Mate Czagany
>            Assignee: Mate Czagany
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.3.0
>
>
> {code:java}
> Jan 23 09:35:49 09:35:49.853 [ERROR] Tests run: 1, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 8.913 s <<< FAILURE! -- in 
> org.apache.flink.test.checkpointing.UnalignedCheckpointFailureHandlingITCase
> Jan 23 09:35:49 09:35:49.853 [ERROR] 
> org.apache.flink.test.checkpointing.UnalignedCheckpointFailureHandlingITCase.testCheckpointSuccessAfterFailure
>  -- Time elapsed: 8.869 s <<< ERROR!
> Jan 23 09:35:49 java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.checkpoint.CheckpointException: Asynchronous task 
> checkpoint failed.
> Jan 23 09:35:49       at 
> java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
> Jan 23 09:35:49       at 
> java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073)
> Jan 23 09:35:49       at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointFailureHandlingITCase.testCheckpointSuccessAfterFailure(UnalignedCheckpointFailureHandlingITCase.java:123)
> Jan 23 09:35:49       at 
> java.base/java.lang.reflect.Method.invoke(Method.java:568)
> Jan 23 09:35:49 Caused by: 
> org.apache.flink.runtime.checkpoint.CheckpointException: Asynchronous task 
> checkpoint failed.
> Jan 23 09:35:49       at 
> org.apache.flink.runtime.checkpoint.PendingCheckpoint.abort(PendingCheckpoint.java:561)
> Jan 23 09:35:49       at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:2274)
> Jan 23 09:35:49       at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveDeclineMessage(CheckpointCoordinator.java:1175)
> Jan 23 09:35:49       at 
> org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$declineCheckpoint$3(ExecutionGraphHandler.java:123)
> Jan 23 09:35:49       at 
> org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$4(ExecutionGraphHandler.java:139)
> Jan 23 09:35:49       at 
> org.apache.flink.util.MdcUtils.lambda$wrapRunnable$1(MdcUtils.java:70)
> Jan 23 09:35:49       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> Jan 23 09:35:49       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> Jan 23 09:35:49       at java.base/java.lang.Thread.run(Thread.java:833)
> Jan 23 09:35:49 Caused by: 
> org.apache.flink.runtime.checkpoint.CheckpointException: 
> org.apache.flink.runtime.checkpoint.CheckpointException: Asynchronous task 
> checkpoint failed.
> Jan 23 09:35:49       at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:320)
> Jan 23 09:35:49       at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:155)
> Jan 23 09:35:49       ... 4 more
> Jan 23 09:35:49 Caused by: java.lang.Exception: java.lang.Exception: Could 
> not materialize checkpoint 2 for operator Source: num-source (2/2)#0.
> Jan 23 09:35:49       at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:298)
> Jan 23 09:35:49       ... 5 more
> Jan 23 09:35:49 Caused by: java.util.concurrent.ExecutionException: 
> java.util.concurrent.ExecutionException: 
> org.apache.flink.test.checkpointing.UnalignedCheckpointFailureHandlingITCase$TestException:
>  failure from closeAndGetHandle
> Jan 23 09:35:49       at 
> java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
> Jan 23 09:35:49       at 
> java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
> Jan 23 09:35:49       at 
> org.apache.flink.util.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:511)
> Jan 23 09:35:49       at 
> org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.create(OperatorSnapshotFinalizer.java:60)
> Jan 23 09:35:49       at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.finalizeNonFinishedSnapshots(AsyncCheckpointRunnable.java:192)
> Jan 23 09:35:49       at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:124)
> Jan 23 09:35:49       ... 4 more
> Jan 23 09:35:49 Caused by: 
> org.apache.flink.test.checkpointing.UnalignedCheckpointFailureHandlingITCase$TestException:
>  
> org.apache.flink.test.checkpointing.UnalignedCheckpointFailureHandlingITCase$TestException:
>  failure from closeAndGetHandle
> Jan 23 09:35:49       at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointFailureHandlingITCase$FailingOnceFsCheckpointOutputStream.closeAndGetHandle(UnalignedCheckpointFailureHandlingITCase.java:337)
> Jan 23 09:35:49       at 
> org.apache.flink.runtime.state.DefaultOperatorStateBackendSnapshotStrategy.lambda$asyncSnapshot$2(DefaultOperatorStateBackendSnapshotStrategy.java:218)
> Jan 23 09:35:49       at 
> org.apache.flink.runtime.state.SnapshotStrategyRunner$1.callInternal(SnapshotStrategyRunner.java:91)
> Jan 23 09:35:49       at 
> org.apache.flink.runtime.state.SnapshotStrategyRunner$1.callInternal(SnapshotStrategyRunner.java:88)
> Jan 23 09:35:49       at 
> org.apache.flink.runtime.state.AsyncSnapshotCallable.call(AsyncSnapshotCallable.java:78)
> Jan 23 09:35:49       at 
> java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> Jan 23 09:35:49       at 
> org.apache.flink.util.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:508)
> Jan 23 09:35:49       ... 7 more
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to