[ https://issues.apache.org/jira/browse/FLINK-29618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matthias Pohl updated FLINK-29618: ---------------------------------- Description: We experienced a [build failure|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=41931&view=logs&j=fc5181b0-e452-5c8f-68de-1097947f6483&t=995c650b-6573-581c-9ce6-7ad4cc038461&l=30284] that was caused (exclusively) by {{YARNSessionFIFOSecuredITCase.testDetachedMode}} running into a timeout. The test specific logs which were extracted from the build's are attached to this Jira issue. JUnit tries to stop the thread running the test but fails to due so because it's interrupting a sleep. The {{InterruptedException}} is not properly handled in [YarnTestBase:744|https://github.com/apache/flink/blob/573ed922346c791760d27653543c2b8df56f51f7/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YarnTestBase.java#L744] (it doesn't forward the exception). Therefore, we only see the warning being logged after 60s: {code} 11:33:51,124 [ForkJoinPool-1-worker-25] WARN org.apache.flink.yarn.YarnTestBase [] - Interruped java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_292] at org.apache.flink.yarn.YarnTestBase.sleep(YarnTestBase.java:716) ~[test-classes/:?] at org.apache.flink.yarn.YarnTestBase.startWithArgs(YarnTestBase.java:906) ~[test-classes/:?] at org.apache.flink.yarn.YARNSessionFIFOITCase.runDetachedModeTest(YARNSessionFIFOITCase.java:141) ~[test-classes/:?] at org.apache.flink.yarn.YARNSessionFIFOSecuredITCase.lambda$testDetachedMode$2(YARNSessionFIFOSecuredITCase.java:173) ~[test-classes/:?] at org.apache.flink.yarn.YarnTestBase.runTest(YarnTestBase.java:288) ~[test-classes/:?] at org.apache.flink.yarn.YARNSessionFIFOSecuredITCase.testDetachedMode(YARNSessionFIFOSecuredITCase.java:160) ~[test-classes/:?] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_292] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_292] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_292] at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_292] [...] {code} The test code itself eventually continues and succeeds (despite the interruption). The job submission takes suspiciously long, though. Removing the timeout from the test (as this is the desired approach for tests in general now) should solve this test instability. was: We experienced a [build failure|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=41931&view=logs&j=fc5181b0-e452-5c8f-68de-1097947f6483&t=995c650b-6573-581c-9ce6-7ad4cc038461&l=30284] that was caused (exclusively) by {{YARNSessionFIFOSecuredITCase.testDetachedMode}} running into a timeout. The actual issue might be that the test thread failed due to an {{InterruptedException}} while waiting for the job to be submitted: {code} 11:33:51,124 [ForkJoinPool-1-worker-25] WARN org.apache.flink.yarn.YarnTestBase [] - Interruped java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_292] at org.apache.flink.yarn.YarnTestBase.sleep(YarnTestBase.java:716) ~[test-classes/:?] at org.apache.flink.yarn.YarnTestBase.startWithArgs(YarnTestBase.java:906) ~[test-classes/:?] at org.apache.flink.yarn.YARNSessionFIFOITCase.runDetachedModeTest(YARNSessionFIFOITCase.java:141) ~[test-classes/:?] at org.apache.flink.yarn.YARNSessionFIFOSecuredITCase.lambda$testDetachedMode$2(YARNSessionFIFOSecuredITCase.java:173) ~[test-classes/:?] at org.apache.flink.yarn.YarnTestBase.runTest(YarnTestBase.java:288) ~[test-classes/:?] at org.apache.flink.yarn.YARNSessionFIFOSecuredITCase.testDetachedMode(YARNSessionFIFOSecuredITCase.java:160) ~[test-classes/:?] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_292] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_292] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_292] at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_292] [...] {code} The test specific logs which were extracted from the build's are attached to this Jira issue. > YARNSessionFIFOSecuredITCase.testDetachedMode timed out in Azure CI > ------------------------------------------------------------------- > > Key: FLINK-29618 > URL: https://issues.apache.org/jira/browse/FLINK-29618 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN, Tests > Affects Versions: 1.17.0 > Reporter: Matthias Pohl > Priority: Major > Labels: starter, test-stability > Attachments: > build-20221012.7.YARNSessionFIFOSecuredITCase.testDetachedMode.log > > > We experienced a [build > failure|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=41931&view=logs&j=fc5181b0-e452-5c8f-68de-1097947f6483&t=995c650b-6573-581c-9ce6-7ad4cc038461&l=30284] > that was caused (exclusively) by > {{YARNSessionFIFOSecuredITCase.testDetachedMode}} running into a timeout. > The test specific logs which were extracted from the build's are attached to > this Jira issue. > JUnit tries to stop the thread running the test but fails to due so because > it's interrupting a sleep. The {{InterruptedException}} is not properly > handled in > [YarnTestBase:744|https://github.com/apache/flink/blob/573ed922346c791760d27653543c2b8df56f51f7/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YarnTestBase.java#L744] > (it doesn't forward the exception). Therefore, we only see the warning being > logged after 60s: > {code} > 11:33:51,124 [ForkJoinPool-1-worker-25] WARN > org.apache.flink.yarn.YarnTestBase [] - Interruped > java.lang.InterruptedException: sleep interrupted > at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_292] > at org.apache.flink.yarn.YarnTestBase.sleep(YarnTestBase.java:716) > ~[test-classes/:?] > at > org.apache.flink.yarn.YarnTestBase.startWithArgs(YarnTestBase.java:906) > ~[test-classes/:?] > at > org.apache.flink.yarn.YARNSessionFIFOITCase.runDetachedModeTest(YARNSessionFIFOITCase.java:141) > ~[test-classes/:?] > at > org.apache.flink.yarn.YARNSessionFIFOSecuredITCase.lambda$testDetachedMode$2(YARNSessionFIFOSecuredITCase.java:173) > ~[test-classes/:?] > at org.apache.flink.yarn.YarnTestBase.runTest(YarnTestBase.java:288) > ~[test-classes/:?] > at > org.apache.flink.yarn.YARNSessionFIFOSecuredITCase.testDetachedMode(YARNSessionFIFOSecuredITCase.java:160) > ~[test-classes/:?] > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > ~[?:1.8.0_292] > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > ~[?:1.8.0_292] > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > ~[?:1.8.0_292] > at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_292] > [...] > {code} > The test code itself eventually continues and succeeds (despite the > interruption). The job submission takes suspiciously long, though. > Removing the timeout from the test (as this is the desired approach for tests > in general now) should solve this test instability. -- This message was sent by Atlassian Jira (v8.20.10#820010)