[ 
https://issues.apache.org/jira/browse/FLINK-31133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691693#comment-17691693
 ] 

Roman Khachatryan edited comment on FLINK-31133 at 2/21/23 4:26 PM:
--------------------------------------------------------------------

I think the issue is actually PartiallyFinishedSourcesITCase.

It starts at 3:40
{code:java}
03:40:55,702 [                main] INFO  
org.apache.flink.runtime.operators.lifecycle.PartiallyFinishedSourcesITCase [] -
================================================================================
Test test[complex graph ALL_SUBTASKS, failover: true, strategy: 
full](org.apache.flink.runtime.operators.lifecycle.PartiallyFinishedSourcesITCase)
 is running.
{code}
then a checkpoint fails because of a timeout:
{code:java}
03:41:10,775 [ChangelogRetryScheduler-1] INFO  
org.apache.flink.changelog.fs.RetryingExecutor               [] - failed with 3 
attempts: Attempt 3 timed out after 1000ms
03:41:10,777 [AsyncOperations-thread-1] INFO  
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable [] - 
transform-2-keyed (4/4)#0 - asynchronous part of checkpoint 2 could not be 
completed.
java.util.concurrent.CompletionException: java.io.IOException: 
java.util.concurrent.TimeoutException: Attempt 3 timed out after 1000ms
        at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
 ~[?:1.8.0_292]
        at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
 ~[?:1.8.0_292]
        at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607) 
~[?:1.8.0_292]
        at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
 ~[?:1.8.0_292]
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) 
~[?:1.8.0_292]
        at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
 ~[?:1.8.0_292]
        at 
org.apache.flink.changelog.fs.FsStateChangelogWriter$UploadCompletionListener.onFailure(FsStateChangelogWriter.java:383)
 ~[flink-dstl-dfs-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.changelog.fs.FsStateChangelogWriter.lambda$null$0(FsStateChangelogWriter.java:223)
 ~[flink-dstl-dfs-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at java.util.ArrayList.removeIf(ArrayList.java:1415) ~[?:1.8.0_292]
        at 
org.apache.flink.changelog.fs.FsStateChangelogWriter.lambda$handleUploadFailure$4(FsStateChangelogWriter.java:222)
 ~[flink-dstl-dfs-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
 ~[flink-streaming-java-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90) 
~[flink-streaming-java-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:338)
 ~[flink-streaming-java-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:324)
 ~[flink-streaming-java-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:201)
 ~[flink-streaming-java-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:807)
 ~[flink-streaming-java-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:756) 
~[flink-streaming-java-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:948)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:927) 
~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:741) 
~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:563) 
~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]
Caused by: java.io.IOException: java.util.concurrent.TimeoutException: Attempt 
3 timed out after 1000ms
        at 
org.apache.flink.changelog.fs.FsStateChangelogWriter$UploadCompletionListener.onFailure(FsStateChangelogWriter.java:377)
 ~[flink-dstl-dfs-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        ... 15 more
Caused by: java.util.concurrent.TimeoutException: Attempt 3 timed out after 
1000ms
        at 
org.apache.flink.changelog.fs.RetryingExecutor$RetriableActionAttempt.fmtError(RetryingExecutor.java:285)
 ~[flink-dstl-dfs-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.changelog.fs.RetryingExecutor$RetriableActionAttempt.lambda$scheduleTimeout$1(RetryingExecutor.java:280)
 ~[flink-dstl-dfs-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
~[?:1.8.0_292]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
~[?:1.8.0_292]
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
 ~[?:1.8.0_292]
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
 ~[?:1.8.0_292]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_292]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_292]
{code}
 

After which it runs normally, ~1.4K checkpoints succeed.

At 6:56, it finally reches no space left on device:
{code:java}
06:56:53,713 [jobmanager-io-thread-1] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed 
checkpoint 1396 for job 1e2e4e86643f8249324da01fe5f8a04a (34948548049 bytes, 
checkpointDuration=6869 ms, finalizationTime=12 ms).
06:56:53,715 [    Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering 
checkpoint 1397 (type=CheckpointType{name='Checkpoint', 
sharingFilesStrategy=FORWARD_BACKWARD}) @ 1676876213715 for job 
1e2e4e86643f8249324da01fe5f8a04a.
...
06:57:02,402 [AsyncOperations-thread-1] INFO  
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable [] - 
MultipleInputOperator (1/4)#1 - asynchronous part of checkpoint 1397 could not 
be completed.
java.util.concurrent.ExecutionException: java.io.IOException: No space left on 
device
        at 
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) 
~[?:1.8.0_292]
        at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) 
~[?:1.8.0_292]
        at 
org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:66)
 ~[flink-streaming-java-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.finalizeNonFinishedSnapshots(AsyncCheckpointRunnable.java:191)
 ~[flink-streaming-java-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:124)
 [flink-streaming-java-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_292]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_292]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]
Caused by: java.io.IOException: No space left on device
        at java.io.FileOutputStream.writeBytes(Native Method) ~[?:1.8.0_292]
        at java.io.FileOutputStream.write(FileOutputStream.java:326) 
~[?:1.8.0_292]
        at 
org.apache.flink.core.fs.local.LocalDataOutputStream.write(LocalDataOutputStream.java:68)
 ~[flink-core-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.core.fs.FSDataOutputStreamWrapper.write(FSDataOutputStreamWrapper.java:65)
 ~[flink-core-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.write(FsCheckpointStreamFactory.java:296)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAP  SHOT]
        at java.io.DataOutputStream.write(DataOutputStream.java:107) 
~[?:1.8.0_292]
        at java.io.FilterOutputStream.write(FilterOutputStream.java:97) 
~[?:1.8.0_292]
        at 
org.apache.flink.runtime.io.network.buffer.NetworkBuffer.getBytes(NetworkBuffer.java:397)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.shaded.netty4.io.netty.buffer.AbstractUnpooledSlicedByteBuf.getBytes(AbstractUnpooledSlicedByteBuf.java:392)
 ~[flink-shaded-netty-4.1.70.Final-15.0.jar:?]
        at 
org.apache.flink.shaded.netty4.io.netty.buffer.SlicedByteBuf.getBytes(SlicedByteBuf.java:26)
 ~[flink-shaded-netty-4.1.70.Final-15.0.jar:?]
        at 
org.apache.flink.shaded.netty4.io.netty.buffer.ReadOnlyByteBuf.getBytes(ReadOnlyByteBuf.java:264)
 ~[flink-shaded-netty-4.1.70.Final-15.0.jar:?]
        at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateSerializerImpl.writeData(ChannelStateSerializer.java:164)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateCheckpointWriter.lambda$write$2(ChannelStateCheckpointWriter.java:171)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateCheckpointWriter.runWithChecks(ChannelStateCheckpointWriter.java:295)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateCheckpointWriter.write(ChannelStateCheckpointWriter.java:165)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateCheckpointWriter.writeInput(ChannelStateCheckpointWriter.java:138)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequest.lambda$write$0(ChannelStateWriteRequest.java:63)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequest.lambda$buildWriteRequest$2(ChannelStateWriteRequest.java:92)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.channel.CheckpointInProgressRequest.execute(ChannelStateWriteRequest.java:211)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestDispatcherImpl.dispatchInternal(ChannelStateWriteRequestDispatcherImpl.java:85)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1  .15-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestDispatcherImpl.dispatch(ChannelStateWriteRequestDispatcherImpl.java:62)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAP  SHOT]
        at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestExecutorImpl.loop(ChannelStateWriteRequestExecutorImpl.java:96)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestExecutorImpl.run(ChannelStateWriteRequestExecutorImpl.java:75)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        ... 1 more
{code}
 

Some other tests failed as a result with serialization failures or "no such 
file"

 

I'm going to change the priority to Major because it's a pure test issue.


was (Author: roman_khachatryan):
I think the issue is actually PartiallyFinishedSourcesITCase.

It starts at 3:40
{code:java}
03:40:55,702 [                main] INFO  
org.apache.flink.runtime.operators.lifecycle.PartiallyFinishedSourcesITCase [] -
================================================================================
Test test[complex graph ALL_SUBTASKS, failover: true, strategy: 
full](org.apache.flink.runtime.operators.lifecycle.PartiallyFinishedSourcesITCase)
 is running.
{code}
then a checkpoint fails because of a timeout:
{code:java}
03:41:10,775 [ChangelogRetryScheduler-1] INFO  
org.apache.flink.changelog.fs.RetryingExecutor               [] - failed with 3 
attempts: Attempt 3 timed out after 1000ms
03:41:10,777 [AsyncOperations-thread-1] INFO  
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable [] - 
transform-2-keyed (4/4)#0 - asynchronous part of checkpoint 2 could not be 
completed.
java.util.concurrent.CompletionException: java.io.IOException: 
java.util.concurrent.TimeoutException: Attempt 3 timed out after 1000ms
        at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
 ~[?:1.8.0_292]
        at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
 ~[?:1.8.0_292]
        at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607) 
~[?:1.8.0_292]
        at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
 ~[?:1.8.0_292]
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) 
~[?:1.8.0_292]
        at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
 ~[?:1.8.0_292]
        at 
org.apache.flink.changelog.fs.FsStateChangelogWriter$UploadCompletionListener.onFailure(FsStateChangelogWriter.java:383)
 ~[flink-dstl-dfs-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.changelog.fs.FsStateChangelogWriter.lambda$null$0(FsStateChangelogWriter.java:223)
 ~[flink-dstl-dfs-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at java.util.ArrayList.removeIf(ArrayList.java:1415) ~[?:1.8.0_292]
        at 
org.apache.flink.changelog.fs.FsStateChangelogWriter.lambda$handleUploadFailure$4(FsStateChangelogWriter.java:222)
 ~[flink-dstl-dfs-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
 ~[flink-streaming-java-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90) 
~[flink-streaming-java-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:338)
 ~[flink-streaming-java-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:324)
 ~[flink-streaming-java-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:201)
 ~[flink-streaming-java-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:807)
 ~[flink-streaming-java-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:756) 
~[flink-streaming-java-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:948)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:927) 
~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:741) 
~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:563) 
~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]
Caused by: java.io.IOException: java.util.concurrent.TimeoutException: Attempt 
3 timed out after 1000ms
        at 
org.apache.flink.changelog.fs.FsStateChangelogWriter$UploadCompletionListener.onFailure(FsStateChangelogWriter.java:377)
 ~[flink-dstl-dfs-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        ... 15 more
Caused by: java.util.concurrent.TimeoutException: Attempt 3 timed out after 
1000ms
        at 
org.apache.flink.changelog.fs.RetryingExecutor$RetriableActionAttempt.fmtError(RetryingExecutor.java:285)
 ~[flink-dstl-dfs-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.changelog.fs.RetryingExecutor$RetriableActionAttempt.lambda$scheduleTimeout$1(RetryingExecutor.java:280)
 ~[flink-dstl-dfs-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
~[?:1.8.0_292]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
~[?:1.8.0_292]
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
 ~[?:1.8.0_292]
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
 ~[?:1.8.0_292]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_292]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_292]
{code}
 

After which it runs normally, ~1.4K checkpoints succeed.

At 6:56, it finally reches no space left on device:
{code:java}
06:56:53,713 [jobmanager-io-thread-1] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed 
checkpoint 1396 for job 1e2e4e86643f8249324da01fe5f8a04a (34948548049 bytes, 
checkpointDuration=6869 ms, finalizationTime=12 ms).
06:56:53,715 [    Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering 
checkpoint 1397 (type=CheckpointType{name='Checkpoint', 
sharingFilesStrategy=FORWARD_BACKWARD}) @ 1676876213715 for job 
1e2e4e86643f8249324da01fe5f8a04a.
...
06:57:02,402 [AsyncOperations-thread-1] INFO  
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable [] - 
MultipleInputOperator (1/4)#1 - asynchronous part of checkpoint 1397 could not 
be completed.
java.util.concurrent.ExecutionException: java.io.IOException: No space left on 
device
        at 
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) 
~[?:1.8.0_292]
        at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) 
~[?:1.8.0_292]
        at 
org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:66)
 ~[flink-streaming-java-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.finalizeNonFinishedSnapshots(AsyncCheckpointRunnable.java:191)
 ~[flink-streaming-java-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:124)
 [flink-streaming-java-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_292]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_292]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]
Caused by: java.io.IOException: No space left on device
        at java.io.FileOutputStream.writeBytes(Native Method) ~[?:1.8.0_292]
        at java.io.FileOutputStream.write(FileOutputStream.java:326) 
~[?:1.8.0_292]
        at 
org.apache.flink.core.fs.local.LocalDataOutputStream.write(LocalDataOutputStream.java:68)
 ~[flink-core-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.core.fs.FSDataOutputStreamWrapper.write(FSDataOutputStreamWrapper.java:65)
 ~[flink-core-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.write(FsCheckpointStreamFactory.java:296)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAP  SHOT]
        at java.io.DataOutputStream.write(DataOutputStream.java:107) 
~[?:1.8.0_292]
        at java.io.FilterOutputStream.write(FilterOutputStream.java:97) 
~[?:1.8.0_292]
        at 
org.apache.flink.runtime.io.network.buffer.NetworkBuffer.getBytes(NetworkBuffer.java:397)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.shaded.netty4.io.netty.buffer.AbstractUnpooledSlicedByteBuf.getBytes(AbstractUnpooledSlicedByteBuf.java:392)
 ~[flink-shaded-netty-4.1.70.Final-15.0.jar:?]
        at 
org.apache.flink.shaded.netty4.io.netty.buffer.SlicedByteBuf.getBytes(SlicedByteBuf.java:26)
 ~[flink-shaded-netty-4.1.70.Final-15.0.jar:?]
        at 
org.apache.flink.shaded.netty4.io.netty.buffer.ReadOnlyByteBuf.getBytes(ReadOnlyByteBuf.java:264)
 ~[flink-shaded-netty-4.1.70.Final-15.0.jar:?]
        at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateSerializerImpl.writeData(ChannelStateSerializer.java:164)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateCheckpointWriter.lambda$write$2(ChannelStateCheckpointWriter.java:171)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateCheckpointWriter.runWithChecks(ChannelStateCheckpointWriter.java:295)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateCheckpointWriter.write(ChannelStateCheckpointWriter.java:165)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateCheckpointWriter.writeInput(ChannelStateCheckpointWriter.java:138)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequest.lambda$write$0(ChannelStateWriteRequest.java:63)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequest.lambda$buildWriteRequest$2(ChannelStateWriteRequest.java:92)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.channel.CheckpointInProgressRequest.execute(ChannelStateWriteRequest.java:211)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestDispatcherImpl.dispatchInternal(ChannelStateWriteRequestDispatcherImpl.java:85)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1  .15-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestDispatcherImpl.dispatch(ChannelStateWriteRequestDispatcherImpl.java:62)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAP  SHOT]
        at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestExecutorImpl.loop(ChannelStateWriteRequestExecutorImpl.java:96)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestExecutorImpl.run(ChannelStateWriteRequestExecutorImpl.java:75)
 ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
        ... 1 more
{code}
 

> AdaptiveSchedulerITCase took extraordinary long to finish
> ---------------------------------------------------------
>
>                 Key: FLINK-31133
>                 URL: https://issues.apache.org/jira/browse/FLINK-31133
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.3
>            Reporter: Matthias Pohl
>            Assignee: Roman Khachatryan
>            Priority: Major
>              Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=46299&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=0c010d0c-3dec-5bf1-d408-7b18988b1b2b
> This build ran into a timeout. Based on the stacktraces reported, it was 
> either caused by 
> [SnapshotMigrationTestBase.restoreAndExecute|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=46299&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=0c010d0c-3dec-5bf1-d408-7b18988b1b2b&l=13475]:
> {code}
> "main" #1 prio=5 os_prio=0 tid=0x00007f23d800b800 nid=0x60cdd waiting on 
> condition [0x00007f23e1c0d000]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
>       at java.lang.Thread.sleep(Native Method)
>       at 
> org.apache.flink.test.checkpointing.utils.SnapshotMigrationTestBase.restoreAndExecute(SnapshotMigrationTestBase.java:382)
>       at 
> org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSnapshot(TypeSerializerSnapshotMigrationITCase.java:172)
>       at sun.reflect.GeneratedMethodAccessor47.invoke(Unknown Source)
> [...]
> {code}
> or 
> [PartiallyFinishedSourcesITCase.test|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=46299&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=0c010d0c-3dec-5bf1-d408-7b18988b1b2b&l=10401]:
> {code}
> 2023-02-20T07:13:05.6084711Z "main" #1 prio=5 os_prio=0 
> tid=0x00007fd35c00b800 nid=0x8c8a waiting on condition [0x00007fd363d0f000]
> 2023-02-20T07:13:05.6085149Z    java.lang.Thread.State: TIMED_WAITING 
> (sleeping)
> 2023-02-20T07:13:05.6085487Z  at java.lang.Thread.sleep(Native Method)
> 2023-02-20T07:13:05.6085925Z  at 
> org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:145)
> 2023-02-20T07:13:05.6086512Z  at 
> org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:138)
> 2023-02-20T07:13:05.6087103Z  at 
> org.apache.flink.runtime.testutils.CommonTestUtils.waitForSubtasksToFinish(CommonTestUtils.java:291)
> 2023-02-20T07:13:05.6087730Z  at 
> org.apache.flink.runtime.operators.lifecycle.TestJobExecutor.waitForSubtasksToFinish(TestJobExecutor.java:226)
> 2023-02-20T07:13:05.6088410Z  at 
> org.apache.flink.runtime.operators.lifecycle.PartiallyFinishedSourcesITCase.test(PartiallyFinishedSourcesITCase.java:138)
> 2023-02-20T07:13:05.6088957Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> [...]
> {code}
> Still, it sounds odd: Based on a code analysis it's quite unlikely that those 
> two caused the issue. The former one has a 5 min timeout (see related code in 
> [SnapshotMigrationTestBase:382|https://github.com/apache/flink/blob/release-1.15/flink-tests/src/test/java/org/apache/flink/test/checkpointing/utils/SnapshotMigrationTestBase.java#L382]).
>  For the other one, we found it being not responsible in the past when some 
> other concurrent test caused the issue (see FLINK-30261).
> An investigation on where we lose the time for the timeout revealed that 
> {{AdaptiveSchedulerITCase}} took 2980s to finish (see [build 
> logs|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=46299&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=0c010d0c-3dec-5bf1-d408-7b18988b1b2b&l=5265]).
> {code}
> 2023-02-20T03:43:55.4546050Z Feb 20 03:43:55 [ERROR] Picked up 
> JAVA_TOOL_OPTIONS: -XX:+HeapDumpOnOutOfMemoryError
> 2023-02-20T03:43:58.0448506Z Feb 20 03:43:58 [INFO] Running 
> org.apache.flink.test.scheduling.AdaptiveSchedulerITCase
> 2023-02-20T04:33:38.6824634Z Feb 20 04:33:38 [INFO] Tests run: 6, Failures: 
> 0, Errors: 0, Skipped: 0, Time elapsed: 2,980.445 s - in 
> org.apache.flink.test.scheduling.AdaptiveSchedulerITCase
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to