[ 
https://issues.apache.org/jira/browse/FLINK-24960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17506780#comment-17506780
 ] 

Niklas Semmler commented on FLINK-24960:
----------------------------------------

[~mapohl] and I debugged this some more.

It looks like the external address of the rest server is set by the 
[YarnClusterDescriptor|https://github.com/apache/flink/blob/c16e4b4ce20704a0ad4387591894f13105d5e530/flink-yarn/src/main/java/org/apache/flink/yarn/YarnClusterDescriptor.java#L1801].
 In short, the MiniYarnCluster propages the YarnClusterDescriptor into the 
execution process of the RestClusterClient. The address is then set via leader 
retrieval, but there is no actual leader retrieval taking place. Instead, the 
StandaloneHaServices returns the preconfigured rest server address.

In principle, this should never return a "localhost" address. To better debug 
future scenarios of this bug, we add a PR that ensures that the log line in the 
code above is always printed. If this returns "localhost", then there is 
something going wrong with the address the YARN application report includes. If 
instead, it returns an external address but the RestClusterClient can still not 
connect, then we missed another place where this property is set. Finally, if 
the log line does not appear at all, then we need to figure out if there is yet 
another code path.

> YARNSessionCapacitySchedulerITCase.testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots
>  hangs on azure
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-24960
>                 URL: https://issues.apache.org/jira/browse/FLINK-24960
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN
>    Affects Versions: 1.15.0, 1.14.3
>            Reporter: Yun Gao
>            Assignee: Niklas Semmler
>            Priority: Critical
>              Labels: pull-request-available, test-stability
>             Fix For: 1.15.0
>
>
> {code:java}
> Nov 18 22:37:08 
> ================================================================================
> Nov 18 22:37:08 Test 
> testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots(org.apache.flink.yarn.YARNSessionCapacitySchedulerITCase)
>  is running.
> Nov 18 22:37:08 
> --------------------------------------------------------------------------------
> Nov 18 22:37:25 22:37:25,470 [                main] INFO  
> org.apache.flink.yarn.YARNSessionCapacitySchedulerITCase     [] - Extracted 
> hostname:port: 5718b812c7ab:38622
> Nov 18 22:52:36 
> ==============================================================================
> Nov 18 22:52:36 Process produced no output for 900 seconds.
> Nov 18 22:52:36 
> ==============================================================================
>  {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=26722&view=logs&j=f450c1a5-64b1-5955-e215-49cb1ad5ec88&t=cc452273-9efa-565d-9db8-ef62a38a0c10&l=36395



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to