[
https://issues.apache.org/jira/browse/FLINK-31036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690125#comment-17690125
]
Rui Fan commented on FLINK-31036:
---------------------------------
Hi [~pnowojski] , thanks a lot for the analysis, sounds makes sense to me.
{quote}{{checkpoint 20}} is failing due to "Size of the state is larger than
the maximum permitted memory-backed state. Size=5621456, maxSize=5242880.
Consider using a different checkpoint storage, like the
FileSystemCheckpointStorage"
Is probably caused by FLINK-26803, probably a benign configuration issue
{quote}
The maxSize of memory checkpoint stream cannot be changed, and I see some
comments in the `JobManagerCheckpointStorage`, it means we cannot increase the
maxSize due to the limitation in the PRC side.
{code:java}
<p><b>WARNING:</b> Increasing the size of this value beyond the default value
({@value
* #DEFAULT_MAX_STATE_SIZE}) should be done with care. The checkpointed state
needs to be send
* to the JobManager via limited size RPC messages, and there and the JobManager
needs to be
* able to hold all aggregated state in its memory.{code}
Also, I think all tests may have this problem, especially some tests with high
parallelism. It is not introduced by FLINK-26803. If the test of the high
parallelism job does not merge the channel state file, this problem may also
exist.
So should we use the `FileSystemCheckpointStorage` instead of
`JobManagerCheckpointStorage`?
{quote}Recovery from {{checkpoint 19}} is failing because
"java.lang.RuntimeException: Test failed due to unexpected recovered state size
0"
Is just a minor bug/unsupported case in this test, since shortly before
{{{}checkpoint 19{}}}, some tasks have finished.
{{StateCheckpointedITCase.StringRichFilterFunction#restoreState}} simply
doesn't support that. This test was created before FLIP-147 and doesn't expect
the second failover caused by the 1.
{quote}
As I understand, when problem 1 is solved, this problem will also be solved
incidentally.
However, I'm not sure if these tests are robust after FLIP-147. When
OnceFailingAggregator throw exception after some source tasks are finished,
StateCheckpointedITCase will also fail(This is more likely to happen when the
dataset is small.). Therefore, I guess there may be many old checkpoint-related
tests that have similar risks after FLIP-147.
> StateCheckpointedITCase timed out
> ---------------------------------
>
> Key: FLINK-31036
> URL: https://issues.apache.org/jira/browse/FLINK-31036
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.17.0
> Reporter: Matthias Pohl
> Assignee: Rui Fan
> Priority: Major
> Labels: test-stability
> Attachments: image-2023-02-16-20-29-52-050.png
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=46023&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=0c010d0c-3dec-5bf1-d408-7b18988b1b2b&l=10608
> {code}
> "Legacy Source Thread - Source: Custom Source -> Filter (6/12)#69980"
> #13718026 prio=5 os_prio=0 tid=0x00007f05f44f0800 nid=0x128157 waiting on
> condition [0x00007f059feef000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00000000f0a974e8> (a
> java.util.concurrent.CompletableFuture$Signaller)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
> at
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
> at
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
> at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
> at
> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegmentBlocking(LocalBufferPool.java:384)
> at
> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:356)
> at
> org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.requestNewBufferBuilderFromPool(BufferWritingResultPartition.java:414)
> at
> org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.requestNewUnicastBufferBuilder(BufferWritingResultPartition.java:390)
> at
> org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.appendUnicastDataForRecordContinuation(BufferWritingResultPartition.java:328)
> at
> org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.emitRecord(BufferWritingResultPartition.java:161)
> at
> org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107)
> at
> org.apache.flink.runtime.io.network.api.writer.ChannelSelectorRecordWriter.emit(ChannelSelectorRecordWriter.java:55)
> at
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:105)
> at
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:91)
> at
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:45)
> at
> org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:59)
> at
> org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:31)
> at
> org.apache.flink.streaming.api.operators.StreamFilter.processElement(StreamFilter.java:39)
> at
> org.apache.flink.streaming.runtime.io.RecordProcessorUtils$$Lambda$1311/1256184070.accept(Unknown
> Source)
> at
> org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75)
> at
> org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50)
> at
> org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29)
> at
> org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollect(StreamSourceContexts.java:418)
> at
> org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collect(StreamSourceContexts.java:513)
> - locked <0x00000000d55035c0> (a java.lang.Object)
> at
> org.apache.flink.streaming.api.operators.StreamSourceContexts$SwitchingOnClose.collect(StreamSourceContexts.java:103)
> at
> org.apache.flink.test.checkpointing.StateCheckpointedITCase$StringGeneratingSourceFunction.run(StateCheckpointedITCase.java:178)
> - locked <0x00000000d55035c0> (a java.lang.Object)
> at
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:110)
> at
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:67)
> at
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:333)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)