[ 
https://issues.apache.org/jira/browse/FLINK-17640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17105899#comment-17105899
 ] 

Zhijiang commented on FLINK-17640:
----------------------------------

The purpose of this test is for guaranteeing no deadlock and buffer leak issues 
in race condition case. 
 # Task main thread is processes the recovered state buffer.
 # Unspilling IO thread is reading recovered state and inserting the buffer 
into input channel queue.
 # Canceler thread is closing the `SingleInputGate` and releasing the 
`RecoveredInputChannel`.

All the three processes can happen concurrently, so this test is necessary to 
verify this scenario. The general unit test can not find the potential bugs in 
this complicated scenario.

Actually the initial version of this test is stable to execute. After I picked 
up the commit from [~pnowojski]'s branch which would check `isReleased` state 
during `RecoverdInputChannel#getNextRecoveredStateBuffer`, then the custom 
message is changed and I forgot to adjust the verify message is this test 
accordingly. If we adjust the verify message by "Trying to read from released 
RecoveredInputChannel", it is still stable to run.

But as [~pnowojski] mentioned, maybe it is fragile to rely on this state to 
verify the result. I will think of another potential way to bypass it.

> RecoveredInputChannelTest.testConcurrentReadStateAndProcessAndRelease() failed
> ------------------------------------------------------------------------------
>
>                 Key: FLINK-17640
>                 URL: https://issues.apache.org/jira/browse/FLINK-17640
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>            Reporter: Arvid Heise
>            Priority: Blocker
>              Labels: test-stability
>             Fix For: 1.11.0
>
>
> Here is the instance 
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=1093&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=4ed44b66-cdd6-5dcf-5f6a-88b07dda665d].
> Easy to reproduce locally by running the test a few 100 times.
> {noformat}
> java.util.concurrent.ExecutionException: java.lang.AssertionError     at 
> java.util.concurrent.FutureTask.report(FutureTask.java:122)
>       at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannelTest.submitTasksAndWaitForResults(RemoteInputChannelTest.java:1228)
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.RecoveredInputChannelTest.testConcurrentReadStateAndProcessAndRelease(RecoveredInputChannelTest.java:215)
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.RecoveredInputChannelTest.testConcurrentReadStateAndProcessAndRelease(RecoveredInputChannelTest.java:82)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:498)
>       at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>       at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>       at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>       at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>       at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>       at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>       at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>       at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>       at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>       at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>       at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>       at org.junit.runners.Suite.runChild(Suite.java:128)
>       at org.junit.runners.Suite.runChild(Suite.java:27)
>       at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>       at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>       at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>       at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>       at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>       at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>       at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
>       at 
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
>       at 
> com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:33)
>       at 
> com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:230)
>       at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:58)
> Caused by: java.lang.AssertionError
>       at org.junit.Assert.fail(Assert.java:86)
>       at org.junit.Assert.assertTrue(Assert.java:41)
>       at org.junit.Assert.assertTrue(Assert.java:52)
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.RecoveredInputChannelTest.lambda$processRecoveredBufferTask$1(RecoveredInputChannelTest.java:257)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748){noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to