[ 
https://issues.apache.org/jira/browse/FLINK-11835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907117#comment-16907117
 ] 

Chesnay Schepler commented on FLINK-11835:
------------------------------------------

I'm still somewhat working on it. I checked out Florian's branch and could 
reproduce one instance of the problem.

What seemed to happen is that the slot allocation fails on one of the first 
dispatchers, resulting in a failure of the job. Given that we shutdown the 
ResourceManager in the test it kinda makes sense that this can happen. The job 
was marked as done in ZK, and subsequent dispatchers never recovered the job, 
hence the job status can not be queried later.

The exact underlying cause for the RM shutdown usually resulting in a JM 
failover, but sometime in a job failure, is still unknown.

> ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange failed
> ------------------------------------------------------------------------------
>
>                 Key: FLINK-11835
>                 URL: https://issues.apache.org/jira/browse/FLINK-11835
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.8.0
>            Reporter: Gary Yao
>            Assignee: Chesnay Schepler
>            Priority: Critical
>              Labels: test-stability
>             Fix For: 1.10.0
>
>         Attachments: scratch_22.txt
>
>
> {noformat}
> 20:44:07.264 [ERROR] 
> testJobExecutionOnClusterWithLeaderChange(org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase)
>   Time elapsed: 4.625 s  <<< ERROR!
> java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find 
> Flink job (2e957dc4f49feaed042eb8b4a7932610)
>       at 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:152)
> Caused by: org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could 
> not find Flink job (2e957dc4f49feaed042eb8b4a7932610)
>       at 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:149)
> {noformat}
> https://api.travis-ci.org/v3/job/502210892/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to