[
https://issues.apache.org/jira/browse/FLINK-31036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689834#comment-17689834
]
Piotr Nowojski edited comment on FLINK-31036 at 2/16/23 4:51 PM:
-----------------------------------------------------------------
In this recent log I see two issues
# {{checkpoint 20}} is failing due to "Size of the state is larger than the
maximum permitted memory-backed state. Size=5621456, maxSize=5242880. Consider
using a different checkpoint storage, like the FileSystemCheckpointStorage"
# recovery from {{checkpoint 19}} is failing because
"java.lang.RuntimeException: Test failed due to unexpected recovered state size
0"
# Is probably caused by FLINK-26803, probably a benign configuration issue
# is just a minor bug/unsupported case in this test, since shortly before
{{checkpoint 19}}, some tasks have finished.
{{StateCheckpointedITCase.StringRichFilterFunction#restoreState}} simply
doesn't support that. This test was created before FLIP-147 and doesn't expect
the second failover caused by the 1.
was (Author: pnowojski):
In this recent log I see three issues
1. {{checkpoint 20}} is failing due to "Size of the state is larger than the
maximum permitted memory-backed state. Size=5621456, maxSize=5242880. Consider
using a different checkpoint storage, like the FileSystemCheckpointStorage"
2. recovery from {{checkpoint 19}} is failing because
"java.lang.RuntimeException: Test failed due to unexpected recovered state size
0"
3. test enters endless recovery loop from {{checkpoint 19}} but after some time
(10 minutes?) it enters a deadlock with subtasks blocked on either requesting
or releasing memory segments
1. Is probably caused by FLINK-26803, probably a benign configuration issue
2. is just a minor bug/unsupported case in this test, since shortly before
{{checkpoint 19}}, some tasks have finished.
{{StateCheckpointedITCase.StringRichFilterFunction#restoreState}} simply
doesn't support that. This test was created before FLIP-147 and doesn't expect
the second failover caused by the 1.
3. I don't know what's causing this atm.
The first reported failure I expect has hit the same problem, but from the
stack trace it seems like the 3. has never happened in that case, so the logs
grew waaayyy tooo large and that's why log upload has timed out.
> StateCheckpointedITCase timed out due to deadlock
> -------------------------------------------------
>
> Key: FLINK-31036
> URL: https://issues.apache.org/jira/browse/FLINK-31036
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.17.0
> Reporter: Matthias Pohl
> Assignee: Rui Fan
> Priority: Blocker
> Labels: test-stability
> Attachments: image-2023-02-16-20-29-52-050.png
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=46023&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=0c010d0c-3dec-5bf1-d408-7b18988b1b2b&l=10608
> {code}
> "Legacy Source Thread - Source: Custom Source -> Filter (6/12)#69980"
> #13718026 prio=5 os_prio=0 tid=0x00007f05f44f0800 nid=0x128157 waiting on
> condition [0x00007f059feef000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00000000f0a974e8> (a
> java.util.concurrent.CompletableFuture$Signaller)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
> at
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
> at
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
> at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
> at
> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegmentBlocking(LocalBufferPool.java:384)
> at
> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:356)
> at
> org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.requestNewBufferBuilderFromPool(BufferWritingResultPartition.java:414)
> at
> org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.requestNewUnicastBufferBuilder(BufferWritingResultPartition.java:390)
> at
> org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.appendUnicastDataForRecordContinuation(BufferWritingResultPartition.java:328)
> at
> org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.emitRecord(BufferWritingResultPartition.java:161)
> at
> org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107)
> at
> org.apache.flink.runtime.io.network.api.writer.ChannelSelectorRecordWriter.emit(ChannelSelectorRecordWriter.java:55)
> at
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:105)
> at
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:91)
> at
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:45)
> at
> org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:59)
> at
> org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:31)
> at
> org.apache.flink.streaming.api.operators.StreamFilter.processElement(StreamFilter.java:39)
> at
> org.apache.flink.streaming.runtime.io.RecordProcessorUtils$$Lambda$1311/1256184070.accept(Unknown
> Source)
> at
> org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75)
> at
> org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50)
> at
> org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29)
> at
> org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollect(StreamSourceContexts.java:418)
> at
> org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collect(StreamSourceContexts.java:513)
> - locked <0x00000000d55035c0> (a java.lang.Object)
> at
> org.apache.flink.streaming.api.operators.StreamSourceContexts$SwitchingOnClose.collect(StreamSourceContexts.java:103)
> at
> org.apache.flink.test.checkpointing.StateCheckpointedITCase$StringGeneratingSourceFunction.run(StateCheckpointedITCase.java:178)
> - locked <0x00000000d55035c0> (a java.lang.Object)
> at
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:110)
> at
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:67)
> at
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:333)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)