[
https://issues.apache.org/jira/browse/FLINK-24960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17544256#comment-17544256
]
Niklas Semmler commented on FLINK-24960:
----------------------------------------
Ok, so I think what happens here is roughly the following.
The test submits a packaged program. This program is invoked
[here|https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-clients/src/main/java/org/apache/flink/client/ClientUtils.java#L114].
There is a thread local context setup that determines how the program is
executed. If set up correctly, the program will be run using a Yarn specific
code path (using the {{YarnSessionClusterExecutorFactory}} and
{{YarnClusterDescriptor}}). In this case, the rest address of the job manager
will be retrieved via {{YarnClusterDescriptor}}. Everything is well.
If things don't work out correctly, I _think_ the Yarn specific code path is
not set up and during the program execution the rest address is extracted from
the default config file. This leads to the use of the wrong {{localhost}}
address.
Now the big question is: Why is the Yarn specific code path not set up
correctly for the erroneous executions? Again, I _think_ this has something to
do with the timing of the different threads. But how exactly is still unclear
to me.
> YARNSessionCapacitySchedulerITCase.testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots
> hangs on azure
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-24960
> URL: https://issues.apache.org/jira/browse/FLINK-24960
> Project: Flink
> Issue Type: Bug
> Components: Deployment / YARN
> Affects Versions: 1.15.0, 1.14.3
> Reporter: Yun Gao
> Assignee: Niklas Semmler
> Priority: Critical
> Labels: pull-request-available, test-stability
> Fix For: 1.16.0
>
>
> {code:java}
> Nov 18 22:37:08
> ================================================================================
> Nov 18 22:37:08 Test
> testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots(org.apache.flink.yarn.YARNSessionCapacitySchedulerITCase)
> is running.
> Nov 18 22:37:08
> --------------------------------------------------------------------------------
> Nov 18 22:37:25 22:37:25,470 [ main] INFO
> org.apache.flink.yarn.YARNSessionCapacitySchedulerITCase [] - Extracted
> hostname:port: 5718b812c7ab:38622
> Nov 18 22:52:36
> ==============================================================================
> Nov 18 22:52:36 Process produced no output for 900 seconds.
> Nov 18 22:52:36
> ==============================================================================
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=26722&view=logs&j=f450c1a5-64b1-5955-e215-49cb1ad5ec88&t=cc452273-9efa-565d-9db8-ef62a38a0c10&l=36395
--
This message was sent by Atlassian Jira
(v8.20.7#820007)