[ https://issues.apache.org/jira/browse/FLINK-17849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17113079#comment-17113079 ]
Yang Wang commented on FLINK-17849: ----------------------------------- When i check the jobmanager.log of the failed test {{org.apache.flink.yarn.YARNHighAvailabilityITCase#testJobRecoversAfterKillingTaskManager}}, i find that the jobmanager failed over because of zookeeper client timeout. The timeout is configured to 1000ms. Maybe something is wrong with the network at that time. This unexpected failover makes the {{restClusterClient.getJobDetails}} failed with timeout exception. {code:java} 193 2020-05-20 16:29:30,141 WARN org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Client session timed out, have not heard from server in 4001ms for sessionid 0x17232eac2850000 194 2020-05-20 16:29:30,141 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Client session timed out, have not heard from server in 4001ms for sessionid 0x17232eac2850000, closing socket connection and attempting reconnect {code} > YARNHighAvailabilityITCase hangs in Azure Pipelines CI > ------------------------------------------------------ > > Key: FLINK-17849 > URL: https://issues.apache.org/jira/browse/FLINK-17849 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN > Affects Versions: 1.11.0 > Reporter: Stephan Ewen > Priority: Blocker > Fix For: 1.11.0 > > > The test seems to hang for 15 minutes, then gets killed. > Full logs: > https://dev.azure.com/sewen0794/19b23adf-d190-4fb4-ae6e-2e92b08923a3/_apis/build/builds/25/logs/121 -- This message was sent by Atlassian Jira (v8.3.4#803005)