[ 
https://issues.apache.org/jira/browse/FLINK-24960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17506281#comment-17506281
 ] 

Niklas Semmler commented on FLINK-24960:
----------------------------------------

Anyhow, the failure alone doesn't explain why the test gets stuck.

We have three threads here
# The outer thread running the test.
# A Flink cluster running the JobManager and TaskManager. 
# A rest client thread that submits the job.

The rest client thread fails:
# Contacting the Rest Server fails (it retries for about 1 minute)
# The job submission fails
# The Session interface returns a nonzero exit code
# An assertion failure is created
# The assertion failure is caught and stored on the Runner class
# The inner thread exits

The outer thread should in principle either re-throw the exception or fail due 
to the timeout (also 1 minute). There should be no way out of this thread that 
doesn't lead to an exception. Yet, we see no indicator that the outer thread 
acknowledges the failure of the inner thread. Instead, the thread hangs in the 
finally part of the try-finally of the main thread. It is unable to close the 
Flink Cluster.

In contrast, from the JobManager logs we see that it doesn't receive any stop 
signal. So, somehow the connection seems to be severed?

> YARNSessionCapacitySchedulerITCase.testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots
>  hangs on azure
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-24960
>                 URL: https://issues.apache.org/jira/browse/FLINK-24960
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN
>    Affects Versions: 1.15.0, 1.14.3
>            Reporter: Yun Gao
>            Assignee: Niklas Semmler
>            Priority: Critical
>              Labels: test-stability
>             Fix For: 1.15.0
>
>
> {code:java}
> Nov 18 22:37:08 
> ================================================================================
> Nov 18 22:37:08 Test 
> testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots(org.apache.flink.yarn.YARNSessionCapacitySchedulerITCase)
>  is running.
> Nov 18 22:37:08 
> --------------------------------------------------------------------------------
> Nov 18 22:37:25 22:37:25,470 [                main] INFO  
> org.apache.flink.yarn.YARNSessionCapacitySchedulerITCase     [] - Extracted 
> hostname:port: 5718b812c7ab:38622
> Nov 18 22:52:36 
> ==============================================================================
> Nov 18 22:52:36 Process produced no output for 900 seconds.
> Nov 18 22:52:36 
> ==============================================================================
>  {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=26722&view=logs&j=f450c1a5-64b1-5955-e215-49cb1ad5ec88&t=cc452273-9efa-565d-9db8-ef62a38a0c10&l=36395



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to