[ 
https://issues.apache.org/jira/browse/FLINK-15150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016965#comment-17016965
 ] 

Chesnay Schepler edited comment on FLINK-15150 at 1/16/20 2:01 PM:
-------------------------------------------------------------------

What seems to happen is that, while JobMaster is in the process of allocating 
slots, the test is calling {{Dispatcher#shutDownCluster}}. The dispatcher 
starts shutting down components, but the JobMaster is _not_ shut down first.
This results in a job failure since the slotmanager shuts down during the 
allocation, which fails the allocation, which fails the task, which fails the 
job.
Job restarts are forbidden by the restart strategy, so the job ends up in a 
terminal state, and subsequent dispatchers don't recover the job.

MInimized log with some additional log messages added:
{code}
...
18660 [dispatcher-5] INFO  o.a.f.r.dispatcher.StandaloneDispatcher  - 
Submitting job 411378b9f7181f9ba431960b08b701d7 (Blocking test job).
18660 [dispatcher-5] DEBUG o.a.f.r.jobmanager.ZooKeeperJobGraphStore  - Adding 
job graph 411378b9f7181f9ba431960b08b701d7 to 
flink/default/jobgraphs/411378b9f7181f9ba431960b08b701d7.
18670 [dispatcher-5] INFO  o.a.f.r.jobmanager.ZooKeeperJobGraphStore  - Added 
JobGraph(jobId: 411378b9f7181f9ba431960b08b701d7) to ZooKeeper.
18670 [dispatcher-2] INFO  o.a.f.r.rpc.akka.AkkaRpcService  - Starting RPC 
endpoint for o.a.f.r.jobmaster.JobMaster at akka://flink/user/jobmanager_4 .
...
18700 [main-EventThread] INFO  o.a.f.r.jobmaster.JobManagerRunnerImpl  - 
JobManager runner for job Blocking test job (411378b9f7181f9ba431960b08b701d7) 
was granted leadership with session id 3ac17f36-f3d9-4b63-ae1e-18572db0cb23 at 
akka://flink/user/jobmanager_4.
18700 [main-EventThread] DEBUG 
o.a.f.r.highavailability.zookeeper.ZooKeeperRunningJobsRegistry  - Setting 
scheduling state to RUNNING for job 411378b9f7181f9ba431960b08b701d7.
...
18720 [dispatcher-4] INFO  o.a.f.r.executiongraph.ExecutionGraph  - Job 
Blocking test job (411378b9f7181f9ba431960b08b701d7) switched from state 
CREATED to RUNNING.
18720 [main] INFO  
org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase  - 
Initiated cluster shutdown.
...
18736 [dispatcher-2] INFO  o.a.f.r.jobmaster.JobMaster  - JobManager 
successfully registered at ResourceManager, leader id: 
a64a7f6db3146cf378a7d898cf6044d1.
18736 [dispatcher-2] INFO  o.a.f.r.jobmaster.slotpool.SlotPoolImpl  - 
Requesting new slot [SlotRequestId{7a9512bdaf255cc687212211142abd4b}] and 
profile ResourceProfile{UNKNOWN} from resource manager.
18736 [dispatcher-3] INFO  o.a.f.r.resourcemanager.StandaloneResourceManager  - 
Request slot with profile ResourceProfile{UNKNOWN} for job 
411378b9f7181f9ba431960b08b701d7 with allocation id 
0e5171488810f0cc1c3265e9b278bb17.
...
18740 [dispatcher-3] INFO  o.a.f.r.resourcemanager.slotmanager.SlotManagerImpl  
- Suspending the SlotManager.
18740 [dispatcher-5] INFO  o.a.f.r.jobmaster.slotpool.SlotPoolImpl  - Failing 
pending slot request [SlotRequestId{7a9512bdaf255cc687212211142abd4b}]: The 
slot manager is being suspended.
...
18750 [dispatcher-5] INFO  o.a.f.r.executiongraph.ExecutionGraph  - blocking 
operator (1/1) (927b48379633675aad810f12af2470a6) switched from SCHEDULED to 
FAILED.
java.util.concurrent.CompletionException: 
o.a.f.r.resourcemanager.slotmanager.SlotManagerException: The slot manager is 
being suspended.
...
18750 [dispatcher-5] INFO  o.a.f.r.executiongraph.ExecutionGraph  - Job 
Blocking test job (411378b9f7181f9ba431960b08b701d7) switched from state 
RUNNING to FAILING.
o.a.f.r.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy
Caused by: java.util.concurrent.CompletionException: 
o.a.f.r.resourcemanager.slotmanager.SlotManagerException: The slot manager is 
being suspended.
...
18760 [dispatcher-5] INFO  o.a.f.r.executiongraph.ExecutionGraph  - Job 
Blocking test job (411378b9f7181f9ba431960b08b701d7) switched from state 
FAILING to FAILED.
o.a.f.r.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy
Caused by: java.util.concurrent.CompletionException: 
o.a.f.r.resourcemanager.slotmanager.SlotManagerException: The slot manager is 
being suspended.
...
18765 [jobmanager-future-thread-2] DEBUG o.a.f.r.jobmaster.JobManagerRunnerImpl 
 - Job 411378b9f7181f9ba431960b08b701d7 reached globally terminal state FAILED.
18765 [jobmanager-future-thread-2] DEBUG o.a.f.r.jobmaster.JobManagerRunnerImpl 
 - Unregistering job from 411378b9f7181f9ba431960b08b701d7 from 
high-availability.
18765 [jobmanager-future-thread-2] DEBUG 
o.a.f.r.highavailability.zookeeper.ZooKeeperRunningJobsRegistry  - Setting 
scheduling state to DONE for job 411378b9f7181f9ba431960b08b701d7.
...
18770 [mini-cluster-io-thread-2] INFO  
o.a.f.r.jobmanager.ZooKeeperJobGraphStore  - Recovered JobGraph(jobId: 
411378b9f7181f9ba431960b08b701d7).
18770 [mini-cluster-io-thread-2] INFO  
o.a.f.r.dispatcher.runner.SessionDispatcherLeaderProcess  - Successfully 
recovered 1 persisted job graphs.
...
23748 [main-EventThread] INFO  o.a.f.r.jobmaster.JobManagerRunnerImpl  - 
Granted leader ship but job 411378b9f7181f9ba431960b08b701d7 has been finished.
{code}

>From what I can tell the shutdown logic should be adjusted to first shutdown 
>the JobMaster before shutting down any service the JM may be relying on.
Unfortunately the shutdown logic is one of the most obfuscated things I have 
seen in a while, and I can't make heads-or-tails of how it is even supposed to 
work.

[~trohrmann] WDYT?


was (Author: zentol):
What seems to happen is that, while JobMaster is in the process of allocating 
slots, the test is calling {{Dispatcher#shutDownCluster}}. The dispatcher 
starts shutting down components, but the JobMaster is _not_ shut down first.
This results in a job failure since the slotpool is shutdown during the 
allocation, which fails the allocation, which fails the task, which fails the 
job.
Job restarts are forbidden by the restart strategy, so the job ends up in a 
terminal state, and subsequent dispatchers don't recover the job.

MInimized log with some additional log messages added:
{code}
...
18660 [dispatcher-5] INFO  o.a.f.r.dispatcher.StandaloneDispatcher  - 
Submitting job 411378b9f7181f9ba431960b08b701d7 (Blocking test job).
18660 [dispatcher-5] DEBUG o.a.f.r.jobmanager.ZooKeeperJobGraphStore  - Adding 
job graph 411378b9f7181f9ba431960b08b701d7 to 
flink/default/jobgraphs/411378b9f7181f9ba431960b08b701d7.
18670 [dispatcher-5] INFO  o.a.f.r.jobmanager.ZooKeeperJobGraphStore  - Added 
JobGraph(jobId: 411378b9f7181f9ba431960b08b701d7) to ZooKeeper.
18670 [dispatcher-2] INFO  o.a.f.r.rpc.akka.AkkaRpcService  - Starting RPC 
endpoint for o.a.f.r.jobmaster.JobMaster at akka://flink/user/jobmanager_4 .
...
18700 [main-EventThread] INFO  o.a.f.r.jobmaster.JobManagerRunnerImpl  - 
JobManager runner for job Blocking test job (411378b9f7181f9ba431960b08b701d7) 
was granted leadership with session id 3ac17f36-f3d9-4b63-ae1e-18572db0cb23 at 
akka://flink/user/jobmanager_4.
18700 [main-EventThread] DEBUG 
o.a.f.r.highavailability.zookeeper.ZooKeeperRunningJobsRegistry  - Setting 
scheduling state to RUNNING for job 411378b9f7181f9ba431960b08b701d7.
...
18720 [dispatcher-4] INFO  o.a.f.r.executiongraph.ExecutionGraph  - Job 
Blocking test job (411378b9f7181f9ba431960b08b701d7) switched from state 
CREATED to RUNNING.
18720 [main] INFO  
org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase  - 
Initiated cluster shutdown.
...
18736 [dispatcher-2] INFO  o.a.f.r.jobmaster.JobMaster  - JobManager 
successfully registered at ResourceManager, leader id: 
a64a7f6db3146cf378a7d898cf6044d1.
18736 [dispatcher-2] INFO  o.a.f.r.jobmaster.slotpool.SlotPoolImpl  - 
Requesting new slot [SlotRequestId{7a9512bdaf255cc687212211142abd4b}] and 
profile ResourceProfile{UNKNOWN} from resource manager.
18736 [dispatcher-3] INFO  o.a.f.r.resourcemanager.StandaloneResourceManager  - 
Request slot with profile ResourceProfile{UNKNOWN} for job 
411378b9f7181f9ba431960b08b701d7 with allocation id 
0e5171488810f0cc1c3265e9b278bb17.
...
18740 [dispatcher-3] INFO  o.a.f.r.resourcemanager.slotmanager.SlotManagerImpl  
- Suspending the SlotManager.
18740 [dispatcher-5] INFO  o.a.f.r.jobmaster.slotpool.SlotPoolImpl  - Failing 
pending slot request [SlotRequestId{7a9512bdaf255cc687212211142abd4b}]: The 
slot manager is being suspended.
...
18750 [dispatcher-5] INFO  o.a.f.r.executiongraph.ExecutionGraph  - blocking 
operator (1/1) (927b48379633675aad810f12af2470a6) switched from SCHEDULED to 
FAILED.
java.util.concurrent.CompletionException: 
o.a.f.r.resourcemanager.slotmanager.SlotManagerException: The slot manager is 
being suspended.
...
18750 [dispatcher-5] INFO  o.a.f.r.executiongraph.ExecutionGraph  - Job 
Blocking test job (411378b9f7181f9ba431960b08b701d7) switched from state 
RUNNING to FAILING.
o.a.f.r.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy
Caused by: java.util.concurrent.CompletionException: 
o.a.f.r.resourcemanager.slotmanager.SlotManagerException: The slot manager is 
being suspended.
...
18760 [dispatcher-5] INFO  o.a.f.r.executiongraph.ExecutionGraph  - Job 
Blocking test job (411378b9f7181f9ba431960b08b701d7) switched from state 
FAILING to FAILED.
o.a.f.r.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy
Caused by: java.util.concurrent.CompletionException: 
o.a.f.r.resourcemanager.slotmanager.SlotManagerException: The slot manager is 
being suspended.
...
18765 [jobmanager-future-thread-2] DEBUG o.a.f.r.jobmaster.JobManagerRunnerImpl 
 - Job 411378b9f7181f9ba431960b08b701d7 reached globally terminal state FAILED.
18765 [jobmanager-future-thread-2] DEBUG o.a.f.r.jobmaster.JobManagerRunnerImpl 
 - Unregistering job from 411378b9f7181f9ba431960b08b701d7 from 
high-availability.
18765 [jobmanager-future-thread-2] DEBUG 
o.a.f.r.highavailability.zookeeper.ZooKeeperRunningJobsRegistry  - Setting 
scheduling state to DONE for job 411378b9f7181f9ba431960b08b701d7.
...
18770 [mini-cluster-io-thread-2] INFO  
o.a.f.r.jobmanager.ZooKeeperJobGraphStore  - Recovered JobGraph(jobId: 
411378b9f7181f9ba431960b08b701d7).
18770 [mini-cluster-io-thread-2] INFO  
o.a.f.r.dispatcher.runner.SessionDispatcherLeaderProcess  - Successfully 
recovered 1 persisted job graphs.
...
23748 [main-EventThread] INFO  o.a.f.r.jobmaster.JobManagerRunnerImpl  - 
Granted leader ship but job 411378b9f7181f9ba431960b08b701d7 has been finished.
{code}

>From what I can tell the shutdown logic should be adjusted to first shutdown 
>the JobMaster before shutting down any service the JM may be relying on.
Unfortunately the shutdown logic is one of the most obfuscated things I have 
seen in a while, and I can't make heads-or-tails of how it is even supposed to 
work.

[~trohrmann] WDYT?

> ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange 
> failed on Travis
> ----------------------------------------------------------------------------------------
>
>                 Key: FLINK-15150
>                 URL: https://issues.apache.org/jira/browse/FLINK-15150
>             Project: Flink
>          Issue Type: Bug
>          Components: Tests
>    Affects Versions: 1.10.0
>            Reporter: Congxian Qiu(klion26)
>            Assignee: Chesnay Schepler
>            Priority: Critical
>             Fix For: 1.10.0
>
>
>  
> 06:37:18.423 [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time 
> elapsed: 14.014 s <<< FAILURE! - in 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase
> 375406:37:18.423 [ERROR] 
> testJobExecutionOnClusterWithLeaderChange(org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase)
>  Time elapsed: 14.001 s <<< ERROR!
> 3755java.util.concurrent.ExecutionException: 
> org.apache.flink.util.FlinkException: JobMaster has been shut down.
> 3756 at 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.lambda$testJobExecutionOnClusterWithLeaderChange$1(ZooKeeperLeaderElectionITCase.java:131)
> 3757 at 
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:131)
> 3758Caused by: org.apache.flink.util.FlinkException: JobMaster has been shut 
> down.
> 3759
>  
> [https://travis-ci.com/flink-ci/flink/jobs/264972218]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to