[ 
https://issues.apache.org/jira/browse/FLINK-27169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542043#comment-17542043
 ] 

Roman Khachatryan commented on FLINK-27169:
-------------------------------------------

Thanks for looking into the issue [~chesnay].
This is what I believe leads to test hanging up:
 # Checkpoint 1 completes
 # Several subsequent checkpoints fail due to a timeout while writing changelog 
segments (all 3 configured attempts exhausted)
 # Job graph gets restarted due to failure
 # FINISH_SOURCES command gets lost as a result
 # TestJobExecutor hangs in waitForSubtasksToFinish as a result

I suppose the root cause is an intermittent failure of the local disk. I'm 
going to increase the timeout and the number of attempts in test.
To prevent the test from hanging up, I'm going to add a timeout (restarts can 
not be disabled because they are required by the test scenario; and can not be 
detected easily)
To ease debugging I'm going to raise the log level TestJobExecutor to INFO.

 

I'll open a PR with the above changes.

> PartiallyFinishedSourcesITCase.test hangs on azure
> --------------------------------------------------
>
>                 Key: FLINK-27169
>                 URL: https://issues.apache.org/jira/browse/FLINK-27169
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.16.0
>            Reporter: Yun Gao
>            Assignee: Roman Khachatryan
>            Priority: Major
>              Labels: test-stability
>
> {code:java}
> Apr 10 08:32:18 "main" #1 prio=5 os_prio=0 tid=0x00007f553400b800 nid=0x8345 
> waiting on condition [0x00007f553be60000]
> Apr 10 08:32:18    java.lang.Thread.State: TIMED_WAITING (sleeping)
> Apr 10 08:32:18       at java.lang.Thread.sleep(Native Method)
> Apr 10 08:32:18       at 
> org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:145)
> Apr 10 08:32:18       at 
> org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:138)
> Apr 10 08:32:18       at 
> org.apache.flink.runtime.testutils.CommonTestUtils.waitForSubtasksToFinish(CommonTestUtils.java:291)
> Apr 10 08:32:18       at 
> org.apache.flink.runtime.operators.lifecycle.TestJobExecutor.waitForSubtasksToFinish(TestJobExecutor.java:226)
> Apr 10 08:32:18       at 
> org.apache.flink.runtime.operators.lifecycle.PartiallyFinishedSourcesITCase.test(PartiallyFinishedSourcesITCase.java:138)
> Apr 10 08:32:18       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
> Apr 10 08:32:18       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> Apr 10 08:32:18       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> Apr 10 08:32:18       at java.lang.reflect.Method.invoke(Method.java:498)
> Apr 10 08:32:18       at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
> Apr 10 08:32:18       at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> Apr 10 08:32:18       at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
> Apr 10 08:32:18       at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> Apr 10 08:32:18       at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> Apr 10 08:32:18       at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> Apr 10 08:32:18       at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54)
> Apr 10 08:32:18       at 
> org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
> Apr 10 08:32:18       at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61)
> Apr 10 08:32:18       at 
> org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
> Apr 10 08:32:18       at 
> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
> Apr 10 08:32:18       at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
> Apr 10 08:32:18       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
> Apr 10 08:32:18       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
> Apr 10 08:32:18       at 
> org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
> Apr 10 08:32:18       at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
> Apr 10 08:32:18       at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
> Apr 10 08:32:18       at 
> org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
> Apr 10 08:32:18       at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
> Apr 10 08:32:18       at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:413)
> Apr 10 08:32:18       at org.junit.runners.Suite.runChild(Suite.java:128)
> Apr 10 08:32:18       at org.junit.runners.Suite.runChild(Suite.java:27)
> Apr 10 08:32:18       at 
> org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=34484&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=0c010d0c-3dec-5bf1-d408-7b18988b1b2b&l=6757



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to