[
https://issues.apache.org/jira/browse/FLINK-29618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17721655#comment-17721655
]
Matthias Pohl edited comment on FLINK-29618 at 5/11/23 7:21 AM:
----------------------------------------------------------------
I guess, you're right. The test timed out after 60 seconds (which is where the
log warning about the {{InterruptedException}} occurred). But the test itself
continues because we're not forwarding the interrupt exception within the sleep
call (see
[YarnTestBase:744|https://github.com/apache/flink/blob/573ed922346c791760d27653543c2b8df56f51f7/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YarnTestBase.java#L744])
and succeeds eventually.
Removing the timeout would have helped here to avoid test instabilities. Even
though, it would have been interesting to investigate why the job submission
takes that long. Unfortunately, the build artifacts are gone already.
I updated the ticket's description.
was (Author: mapohl):
I guess, you're right. The test timed out after 60 seconds (which is where the
log warning about the {{InterruptedException}} occurred). But the test itself
continues because we're not forwarding the interrupt exception within the sleep
call (see
[YarnTestBase:744|https://github.com/apache/flink/blob/573ed922346c791760d27653543c2b8df56f51f7/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YarnTestBase.java#L744])
and succeeds eventually.
Removing the timeout would have helped here to avoid test instabilities. Even
though, it would have been interesting to investigate why the job submission
takes that long. Unfortunately, the build artifacts are gone already.
> YARNSessionFIFOSecuredITCase.testDetachedMode timed out in Azure CI
> -------------------------------------------------------------------
>
> Key: FLINK-29618
> URL: https://issues.apache.org/jira/browse/FLINK-29618
> Project: Flink
> Issue Type: Bug
> Components: Deployment / YARN, Tests
> Affects Versions: 1.17.0
> Reporter: Matthias Pohl
> Priority: Major
> Labels: starter, test-stability
> Attachments:
> build-20221012.7.YARNSessionFIFOSecuredITCase.testDetachedMode.log
>
>
> We experienced a [build
> failure|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=41931&view=logs&j=fc5181b0-e452-5c8f-68de-1097947f6483&t=995c650b-6573-581c-9ce6-7ad4cc038461&l=30284]
> that was caused (exclusively) by
> {{YARNSessionFIFOSecuredITCase.testDetachedMode}} running into a timeout.
> The test specific logs which were extracted from the build's are attached to
> this Jira issue.
> JUnit tries to stop the thread running the test but fails to due so because
> it's interrupting a sleep. The {{InterruptedException}} is not properly
> handled in
> [YarnTestBase:744|https://github.com/apache/flink/blob/573ed922346c791760d27653543c2b8df56f51f7/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YarnTestBase.java#L744]
> (it doesn't forward the exception). Therefore, we only see the warning being
> logged after 60s:
> {code}
> 11:33:51,124 [ForkJoinPool-1-worker-25] WARN
> org.apache.flink.yarn.YarnTestBase [] - Interruped
> java.lang.InterruptedException: sleep interrupted
> at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_292]
> at org.apache.flink.yarn.YarnTestBase.sleep(YarnTestBase.java:716)
> ~[test-classes/:?]
> at
> org.apache.flink.yarn.YarnTestBase.startWithArgs(YarnTestBase.java:906)
> ~[test-classes/:?]
> at
> org.apache.flink.yarn.YARNSessionFIFOITCase.runDetachedModeTest(YARNSessionFIFOITCase.java:141)
> ~[test-classes/:?]
> at
> org.apache.flink.yarn.YARNSessionFIFOSecuredITCase.lambda$testDetachedMode$2(YARNSessionFIFOSecuredITCase.java:173)
> ~[test-classes/:?]
> at org.apache.flink.yarn.YarnTestBase.runTest(YarnTestBase.java:288)
> ~[test-classes/:?]
> at
> org.apache.flink.yarn.YARNSessionFIFOSecuredITCase.testDetachedMode(YARNSessionFIFOSecuredITCase.java:160)
> ~[test-classes/:?]
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> ~[?:1.8.0_292]
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> ~[?:1.8.0_292]
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> ~[?:1.8.0_292]
> at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_292]
> [...]
> {code}
> The test code itself eventually continues and succeeds (despite the
> interruption). The job submission takes suspiciously long, though.
> Removing the timeout from the test (as this is the desired approach for tests
> in general now) should solve this test instability.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)