[jira] [Comment Edited] (FLINK-31036) StateCheckpointedITCase timed out due to deadlock

Piotr Nowojski (Jira) Thu, 16 Feb 2023 08:52:16 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-31036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689834#comment-17689834
 ]


Piotr Nowojski edited comment on FLINK-31036 at 2/16/23 4:51 PM:
-----------------------------------------------------------------

In this recent log I see two issues

# {{checkpoint 20}} is failing due to "Size of the state is larger than the 
maximum permitted memory-backed state. Size=5621456, maxSize=5242880. Consider 
using a different checkpoint storage, like the FileSystemCheckpointStorage"
# recovery from {{checkpoint 19}} is failing because 
"java.lang.RuntimeException: Test failed due to unexpected recovered state size 
0"



# Is probably caused by FLINK-26803, probably a benign configuration issue
# is just a minor bug/unsupported case in this test, since shortly before 
{{checkpoint 19}}, some tasks have finished. 
{{StateCheckpointedITCase.StringRichFilterFunction#restoreState}} simply 
doesn't support that. This test was created before FLIP-147 and doesn't expect 
the second failover caused by the 1.


was (Author: pnowojski):
In this recent log I see three issues

1. {{checkpoint 20}} is failing due to "Size of the state is larger than the 
maximum permitted memory-backed state. Size=5621456, maxSize=5242880. Consider 
using a different checkpoint storage, like the FileSystemCheckpointStorage"
2. recovery from {{checkpoint 19}} is failing because 
"java.lang.RuntimeException: Test failed due to unexpected recovered state size 
0"
3. test enters endless recovery loop from {{checkpoint 19}} but after some time 
(10 minutes?) it enters a deadlock with subtasks blocked on either requesting 
or releasing memory segments


1. Is probably caused by FLINK-26803, probably a benign configuration issue
2. is just a minor bug/unsupported case in this test, since shortly before 
{{checkpoint 19}}, some tasks have finished. 
{{StateCheckpointedITCase.StringRichFilterFunction#restoreState}} simply 
doesn't support that. This test was created before FLIP-147 and doesn't expect 
the second failover caused by the 1.
3. I don't know what's causing this atm.

The first reported failure I expect has hit the same problem, but from the 
stack trace it seems like the 3. has never happened in that case, so the logs 
grew waaayyy tooo large and that's why log upload has timed out.

> StateCheckpointedITCase timed out due to deadlock
> -------------------------------------------------
>
>                 Key: FLINK-31036
>                 URL: https://issues.apache.org/jira/browse/FLINK-31036
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.17.0
>            Reporter: Matthias Pohl
>            Assignee: Rui Fan
>            Priority: Blocker
>              Labels: test-stability
>         Attachments: image-2023-02-16-20-29-52-050.png
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=46023&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=0c010d0c-3dec-5bf1-d408-7b18988b1b2b&l=10608
> {code}
> "Legacy Source Thread - Source: Custom Source -> Filter (6/12)#69980" 
> #13718026 prio=5 os_prio=0 tid=0x00007f05f44f0800 nid=0x128157 waiting on 
> condition [0x00007f059feef000]
>    java.lang.Thread.State: WAITING (parking)
>       at sun.misc.Unsafe.park(Native Method)
>       - parking to wait for  <0x00000000f0a974e8> (a 
> java.util.concurrent.CompletableFuture$Signaller)
>       at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>       at 
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
>       at 
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
>       at 
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
>       at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
>       at 
> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegmentBlocking(LocalBufferPool.java:384)
>       at 
> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:356)
>       at 
> org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.requestNewBufferBuilderFromPool(BufferWritingResultPartition.java:414)
>       at 
> org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.requestNewUnicastBufferBuilder(BufferWritingResultPartition.java:390)
>       at 
> org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.appendUnicastDataForRecordContinuation(BufferWritingResultPartition.java:328)
>       at 
> org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.emitRecord(BufferWritingResultPartition.java:161)
>       at 
> org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107)
>       at 
> org.apache.flink.runtime.io.network.api.writer.ChannelSelectorRecordWriter.emit(ChannelSelectorRecordWriter.java:55)
>       at 
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:105)
>       at 
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:91)
>       at 
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:45)
>       at 
> org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:59)
>       at 
> org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:31)
>       at 
> org.apache.flink.streaming.api.operators.StreamFilter.processElement(StreamFilter.java:39)
>       at 
> org.apache.flink.streaming.runtime.io.RecordProcessorUtils$$Lambda$1311/1256184070.accept(Unknown
>  Source)
>       at 
> org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75)
>       at 
> org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50)
>       at 
> org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29)
>       at 
> org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollect(StreamSourceContexts.java:418)
>       at 
> org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collect(StreamSourceContexts.java:513)
>       - locked <0x00000000d55035c0> (a java.lang.Object)
>       at 
> org.apache.flink.streaming.api.operators.StreamSourceContexts$SwitchingOnClose.collect(StreamSourceContexts.java:103)
>       at 
> org.apache.flink.test.checkpointing.StateCheckpointedITCase$StringGeneratingSourceFunction.run(StateCheckpointedITCase.java:178)
>       - locked <0x00000000d55035c0> (a java.lang.Object)
>       at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:110)
>       at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:67)
>       at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:333)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-31036) StateCheckpointedITCase timed out due to deadlock

Reply via email to