[jira] [Comment Edited] (FLINK-23611) YARNSessionCapacitySchedulerITCase.testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots hangs on azure

Matthias (Jira) Wed, 25 Aug 2021 23:00:34 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-23611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17404949#comment-17404949
 ]


Matthias edited comment on FLINK-23611 at 8/26/21, 5:59 AM:
------------------------------------------------------------

We ran into a timeout because of the YARN Session cluster's 
[join|https://github.com/XComp/flink/blob/646ff2d36f40704f5dca017b8fffed78bd51b307/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNSessionCapacitySchedulerITCase.java#L357]
 call waiting for the thread to finish (see {{jps-traces.0}}):
{code:java}
"main" #1 prio=5 os_prio=0 tid=0x00007fedec00b800 nid=0x52d6 in Object.wait() 
[0x00007fedf5f8b000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1252)
        - locked <0x0000000095e06048> (a 
org.apache.flink.yarn.YarnTestBase$Runner)
        at java.lang.Thread.join(Thread.java:1326)
        at 
org.apache.flink.yarn.YARNSessionCapacitySchedulerITCase.lambda$testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots$3(YARNSessionCapacitySchedulerITCase.java:357)
        at 
org.apache.flink.yarn.YARNSessionCapacitySchedulerITCase$$Lambda$492/592858578.run(Unknown
 Source)
        at org.apache.flink.yarn.YarnTestBase.runTest(YarnTestBase.java:288)
        at 
org.apache.flink.yarn.YARNSessionCapacitySchedulerITCase.testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots(YARNSessionCapacitySchedulerITCase.java:293)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[...] {code}
This {{join}} command should have gotten back due to {{sendStop}} call that was 
triggered beforehand.

The job submission thread seem to have terminated. Only one Thread having the 
name "Frontend (CLI/YARN Client) runner thread (startWithArgs())." is listed.
{code:java}
"Frontend (CLI/YARN Client) runner thread (startWithArgs())." #1731 prio=5 
os_prio=0 tid=0x00007fedee5b0800 nid=0x682b waiting on condition 
[0x00007fe9bc670000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at 
org.apache.flink.yarn.cli.FlinkYarnSessionCli.repStep(FlinkYarnSessionCli.java:934)
        at 
org.apache.flink.yarn.cli.FlinkYarnSessionCli.runInteractiveCli(FlinkYarnSessionCli.java:910)
        at 
org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:657)
        at 
org.apache.flink.yarn.YarnTestBase$Runner.run(YarnTestBase.java:1132) {code}
I have to do another round to double-check but I guess the stop call never 
reached the thread because the previous failure of the job submission Runner 
resetted the System input/output streams which cut off the communication 
between the {{main}} and the YARN Session Cluster Thread as well.


was (Author: mapohl):
We ran into a timeout because of the YARN Session cluster's 
[join|https://github.com/XComp/flink/blob/646ff2d36f40704f5dca017b8fffed78bd51b307/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNSessionCapacitySchedulerITCase.java#L357]
 call waiting for the thread to finish (see {{jps-traces.0}}):
{code:java}
"main" #1 prio=5 os_prio=0 tid=0x00007fedec00b800 nid=0x52d6 in Object.wait() 
[0x00007fedf5f8b000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1252)
        - locked <0x0000000095e06048> (a 
org.apache.flink.yarn.YarnTestBase$Runner)
        at java.lang.Thread.join(Thread.java:1326)
        at 
org.apache.flink.yarn.YARNSessionCapacitySchedulerITCase.lambda$testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots$3(YARNSessionCapacitySchedulerITCase.java:357)
        at 
org.apache.flink.yarn.YARNSessionCapacitySchedulerITCase$$Lambda$492/592858578.run(Unknown
 Source)
        at org.apache.flink.yarn.YarnTestBase.runTest(YarnTestBase.java:288)
        at 
org.apache.flink.yarn.YARNSessionCapacitySchedulerITCase.testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots(YARNSessionCapacitySchedulerITCase.java:293)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[...] {code}
This {{join}} command should have gotten back due to {{sendStop}} call that was 
triggered beforehand.

I have to do another round to double-check but I guess the stop call never 
reached the thread because the previous failure of the job submission Runner 
resetted the System input/output streams which cut off the communication 
between the {{main}} and the YARN Session Cluster Thread as well.

> YARNSessionCapacitySchedulerITCase.testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots
>  hangs on azure
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-23611
>                 URL: https://issues.apache.org/jira/browse/FLINK-23611
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN
>    Affects Versions: 1.14.0, 1.12.5
>            Reporter: Xintong Song
>            Assignee: Matthias
>            Priority: Major
>              Labels: pull-request-available, test-stability
>             Fix For: 1.14.0, 1.12.6
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=21439&view=logs&j=245e1f2e-ba5b-5570-d689-25ae21e5302f&t=e7f339b2-a7c3-57d9-00af-3712d4b15354&l=28959



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-23611) YARNSessionCapacitySchedulerITCase.testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots hangs on azure

Reply via email to