[ 
https://issues.apache.org/jira/browse/FLINK-17849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17113079#comment-17113079
 ] 

Yang Wang commented on FLINK-17849:
-----------------------------------

When i check the jobmanager.log of the failed test 
{{org.apache.flink.yarn.YARNHighAvailabilityITCase#testJobRecoversAfterKillingTaskManager}},
 i find that the jobmanager failed over because of zookeeper client timeout. 
The timeout is configured to 1000ms. Maybe something is wrong with the network 
at that time. This unexpected failover makes the 
{{restClusterClient.getJobDetails}} failed with timeout exception.

 
{code:java}
193 2020-05-20 16:29:30,141 WARN  
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Client 
session timed out, have not heard from server in 4001ms for sessionid 
0x17232eac2850000
194 2020-05-20 16:29:30,141 INFO  
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Client 
session timed out, have not heard from server in 4001ms for sessionid 
0x17232eac2850000, closing socket connection and attempting reconnect
{code}

> YARNHighAvailabilityITCase hangs in Azure Pipelines CI
> ------------------------------------------------------
>
>                 Key: FLINK-17849
>                 URL: https://issues.apache.org/jira/browse/FLINK-17849
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN
>    Affects Versions: 1.11.0
>            Reporter: Stephan Ewen
>            Priority: Blocker
>             Fix For: 1.11.0
>
>
> The test seems to hang for 15 minutes, then gets killed.
> Full logs: 
> https://dev.azure.com/sewen0794/19b23adf-d190-4fb4-ae6e-2e92b08923a3/_apis/build/builds/25/logs/121



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to