[
https://issues.apache.org/jira/browse/FLINK-29618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matthias Pohl updated FLINK-29618:
----------------------------------
Description:
We experienced a [build
failure|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=41931&view=logs&j=fc5181b0-e452-5c8f-68de-1097947f6483&t=995c650b-6573-581c-9ce6-7ad4cc038461&l=30284]
that was caused (exclusively) by
{{YARNSessionFIFOSecuredITCase.testDetachedMode}} running into a timeout.
The test specific logs which were extracted from the build's are attached to
this Jira issue.
JUnit tries to stop the thread running the test but fails to due so because
it's interrupting a sleep. The {{InterruptedException}} is not properly handled
in
[YarnTestBase:744|https://github.com/apache/flink/blob/573ed922346c791760d27653543c2b8df56f51f7/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YarnTestBase.java#L744]
(it doesn't forward the exception). Therefore, we only see the warning being
logged after 60s:
{code}
11:33:51,124 [ForkJoinPool-1-worker-25] WARN
org.apache.flink.yarn.YarnTestBase [] - Interruped
java.lang.InterruptedException: sleep interrupted
at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_292]
at org.apache.flink.yarn.YarnTestBase.sleep(YarnTestBase.java:716)
~[test-classes/:?]
at
org.apache.flink.yarn.YarnTestBase.startWithArgs(YarnTestBase.java:906)
~[test-classes/:?]
at
org.apache.flink.yarn.YARNSessionFIFOITCase.runDetachedModeTest(YARNSessionFIFOITCase.java:141)
~[test-classes/:?]
at
org.apache.flink.yarn.YARNSessionFIFOSecuredITCase.lambda$testDetachedMode$2(YARNSessionFIFOSecuredITCase.java:173)
~[test-classes/:?]
at org.apache.flink.yarn.YarnTestBase.runTest(YarnTestBase.java:288)
~[test-classes/:?]
at
org.apache.flink.yarn.YARNSessionFIFOSecuredITCase.testDetachedMode(YARNSessionFIFOSecuredITCase.java:160)
~[test-classes/:?]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
~[?:1.8.0_292]
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
~[?:1.8.0_292]
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
~[?:1.8.0_292]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_292]
[...]
{code}
The test code itself eventually continues and succeeds (despite the
interruption). The job submission takes suspiciously long, though.
Removing the timeout from the test (as this is the desired approach for tests
in general now) should solve this test instability.
was:
We experienced a [build
failure|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=41931&view=logs&j=fc5181b0-e452-5c8f-68de-1097947f6483&t=995c650b-6573-581c-9ce6-7ad4cc038461&l=30284]
that was caused (exclusively) by
{{YARNSessionFIFOSecuredITCase.testDetachedMode}} running into a timeout.
The actual issue might be that the test thread failed due to an
{{InterruptedException}} while waiting for the job to be submitted:
{code}
11:33:51,124 [ForkJoinPool-1-worker-25] WARN
org.apache.flink.yarn.YarnTestBase [] - Interruped
java.lang.InterruptedException: sleep interrupted
at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_292]
at org.apache.flink.yarn.YarnTestBase.sleep(YarnTestBase.java:716)
~[test-classes/:?]
at
org.apache.flink.yarn.YarnTestBase.startWithArgs(YarnTestBase.java:906)
~[test-classes/:?]
at
org.apache.flink.yarn.YARNSessionFIFOITCase.runDetachedModeTest(YARNSessionFIFOITCase.java:141)
~[test-classes/:?]
at
org.apache.flink.yarn.YARNSessionFIFOSecuredITCase.lambda$testDetachedMode$2(YARNSessionFIFOSecuredITCase.java:173)
~[test-classes/:?]
at org.apache.flink.yarn.YarnTestBase.runTest(YarnTestBase.java:288)
~[test-classes/:?]
at
org.apache.flink.yarn.YARNSessionFIFOSecuredITCase.testDetachedMode(YARNSessionFIFOSecuredITCase.java:160)
~[test-classes/:?]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
~[?:1.8.0_292]
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
~[?:1.8.0_292]
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
~[?:1.8.0_292]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_292]
[...]
{code}
The test specific logs which were extracted from the build's are attached to
this Jira issue.
> YARNSessionFIFOSecuredITCase.testDetachedMode timed out in Azure CI
> -------------------------------------------------------------------
>
> Key: FLINK-29618
> URL: https://issues.apache.org/jira/browse/FLINK-29618
> Project: Flink
> Issue Type: Bug
> Components: Deployment / YARN, Tests
> Affects Versions: 1.17.0
> Reporter: Matthias Pohl
> Priority: Major
> Labels: starter, test-stability
> Attachments:
> build-20221012.7.YARNSessionFIFOSecuredITCase.testDetachedMode.log
>
>
> We experienced a [build
> failure|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=41931&view=logs&j=fc5181b0-e452-5c8f-68de-1097947f6483&t=995c650b-6573-581c-9ce6-7ad4cc038461&l=30284]
> that was caused (exclusively) by
> {{YARNSessionFIFOSecuredITCase.testDetachedMode}} running into a timeout.
> The test specific logs which were extracted from the build's are attached to
> this Jira issue.
> JUnit tries to stop the thread running the test but fails to due so because
> it's interrupting a sleep. The {{InterruptedException}} is not properly
> handled in
> [YarnTestBase:744|https://github.com/apache/flink/blob/573ed922346c791760d27653543c2b8df56f51f7/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YarnTestBase.java#L744]
> (it doesn't forward the exception). Therefore, we only see the warning being
> logged after 60s:
> {code}
> 11:33:51,124 [ForkJoinPool-1-worker-25] WARN
> org.apache.flink.yarn.YarnTestBase [] - Interruped
> java.lang.InterruptedException: sleep interrupted
> at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_292]
> at org.apache.flink.yarn.YarnTestBase.sleep(YarnTestBase.java:716)
> ~[test-classes/:?]
> at
> org.apache.flink.yarn.YarnTestBase.startWithArgs(YarnTestBase.java:906)
> ~[test-classes/:?]
> at
> org.apache.flink.yarn.YARNSessionFIFOITCase.runDetachedModeTest(YARNSessionFIFOITCase.java:141)
> ~[test-classes/:?]
> at
> org.apache.flink.yarn.YARNSessionFIFOSecuredITCase.lambda$testDetachedMode$2(YARNSessionFIFOSecuredITCase.java:173)
> ~[test-classes/:?]
> at org.apache.flink.yarn.YarnTestBase.runTest(YarnTestBase.java:288)
> ~[test-classes/:?]
> at
> org.apache.flink.yarn.YARNSessionFIFOSecuredITCase.testDetachedMode(YARNSessionFIFOSecuredITCase.java:160)
> ~[test-classes/:?]
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> ~[?:1.8.0_292]
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> ~[?:1.8.0_292]
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> ~[?:1.8.0_292]
> at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_292]
> [...]
> {code}
> The test code itself eventually continues and succeeds (despite the
> interruption). The job submission takes suspiciously long, though.
> Removing the timeout from the test (as this is the desired approach for tests
> in general now) should solve this test instability.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)