[
https://issues.apache.org/jira/browse/FLINK-31036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689697#comment-17689697
]
Piotr Nowojski edited comment on FLINK-31036 at 2/16/23 11:13 AM:
------------------------------------------------------------------
I will try to take a quick look.
When trying to reproduce it locally, keep in mind that we have some
parameters/configuraiton randomisation implemented AFAIR based on the git
commit. Most likely one of the parameters that gets randomised is unaligned
checkpoints turned on/off, so if the failures are happening only in one of
those modes, make sure that locally you are using the same setting. If it's
randomised based on the git commit hash, then it's probably best to just loop
the test on the same commit that has failed in the CI.
Sometimes it also helps to stress the local machine much more, like loop the
same test 4 or 8 times running in parallel.
was (Author: pnowojski):
I will take a look.
When trying to reproduce it locally, keep in mind that we have some
parameters/configuraiton randomisation implemented AFAIR based on the git
commit. Most likely one of the parameters that gets randomised is unaligned
checkpoints turned on/off, so if the failures are happening only in one of
those modes, make sure that locally you are using the same setting. If it's
randomised based on the git commit hash, then it's probably best to just loop
the test on the same commit that has failed in the CI.
Sometimes it also helps to stress the local machine much more, like loop the
same test 4 or 8 times running in parallel.
> StateCheckpointedITCase timed out due to deadlock
> -------------------------------------------------
>
> Key: FLINK-31036
> URL: https://issues.apache.org/jira/browse/FLINK-31036
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.17.0
> Reporter: Matthias Pohl
> Assignee: Rui Fan
> Priority: Blocker
> Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=46023&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=0c010d0c-3dec-5bf1-d408-7b18988b1b2b&l=10608
> {code}
> "Legacy Source Thread - Source: Custom Source -> Filter (6/12)#69980"
> #13718026 prio=5 os_prio=0 tid=0x00007f05f44f0800 nid=0x128157 waiting on
> condition [0x00007f059feef000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00000000f0a974e8> (a
> java.util.concurrent.CompletableFuture$Signaller)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
> at
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
> at
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
> at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
> at
> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegmentBlocking(LocalBufferPool.java:384)
> at
> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:356)
> at
> org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.requestNewBufferBuilderFromPool(BufferWritingResultPartition.java:414)
> at
> org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.requestNewUnicastBufferBuilder(BufferWritingResultPartition.java:390)
> at
> org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.appendUnicastDataForRecordContinuation(BufferWritingResultPartition.java:328)
> at
> org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.emitRecord(BufferWritingResultPartition.java:161)
> at
> org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107)
> at
> org.apache.flink.runtime.io.network.api.writer.ChannelSelectorRecordWriter.emit(ChannelSelectorRecordWriter.java:55)
> at
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:105)
> at
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:91)
> at
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:45)
> at
> org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:59)
> at
> org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:31)
> at
> org.apache.flink.streaming.api.operators.StreamFilter.processElement(StreamFilter.java:39)
> at
> org.apache.flink.streaming.runtime.io.RecordProcessorUtils$$Lambda$1311/1256184070.accept(Unknown
> Source)
> at
> org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75)
> at
> org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50)
> at
> org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29)
> at
> org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollect(StreamSourceContexts.java:418)
> at
> org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collect(StreamSourceContexts.java:513)
> - locked <0x00000000d55035c0> (a java.lang.Object)
> at
> org.apache.flink.streaming.api.operators.StreamSourceContexts$SwitchingOnClose.collect(StreamSourceContexts.java:103)
> at
> org.apache.flink.test.checkpointing.StateCheckpointedITCase$StringGeneratingSourceFunction.run(StateCheckpointedITCase.java:178)
> - locked <0x00000000d55035c0> (a java.lang.Object)
> at
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:110)
> at
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:67)
> at
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:333)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)