[
https://issues.apache.org/jira/browse/FLINK-15150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016965#comment-17016965
]
Chesnay Schepler commented on FLINK-15150:
------------------------------------------
What seems to happen is that, while JobMaster is in the process of allocating
slots, the test is calling {{Dispatcher#shutDownCluster}}. The dispatcher
starts shutting down components, but the JobMaster is _not_ shut down first.
This results in a job failure since the slotpool is shutdown during the
allocation, which fails the allocation, which fails the task, which fails the
job.
Job restarts are forbidden by the restart strategy, so the job ends up in a
terminal state, and subsequent dispatchers don't recover the job.
MInimized log with some additional log messages added:
{code}
...
18660 [dispatcher-5] INFO o.a.f.r.dispatcher.StandaloneDispatcher -
Submitting job 411378b9f7181f9ba431960b08b701d7 (Blocking test job).
18660 [dispatcher-5] DEBUG o.a.f.r.jobmanager.ZooKeeperJobGraphStore - Adding
job graph 411378b9f7181f9ba431960b08b701d7 to
flink/default/jobgraphs/411378b9f7181f9ba431960b08b701d7.
18670 [dispatcher-5] INFO o.a.f.r.jobmanager.ZooKeeperJobGraphStore - Added
JobGraph(jobId: 411378b9f7181f9ba431960b08b701d7) to ZooKeeper.
18670 [dispatcher-2] INFO o.a.f.r.rpc.akka.AkkaRpcService - Starting RPC
endpoint for o.a.f.r.jobmaster.JobMaster at akka://flink/user/jobmanager_4 .
...
18700 [main-EventThread] INFO o.a.f.r.jobmaster.JobManagerRunnerImpl -
JobManager runner for job Blocking test job (411378b9f7181f9ba431960b08b701d7)
was granted leadership with session id 3ac17f36-f3d9-4b63-ae1e-18572db0cb23 at
akka://flink/user/jobmanager_4.
18700 [main-EventThread] DEBUG
o.a.f.r.highavailability.zookeeper.ZooKeeperRunningJobsRegistry - Setting
scheduling state to RUNNING for job 411378b9f7181f9ba431960b08b701d7.
...
18720 [dispatcher-4] INFO o.a.f.r.executiongraph.ExecutionGraph - Job
Blocking test job (411378b9f7181f9ba431960b08b701d7) switched from state
CREATED to RUNNING.
18720 [main] INFO
org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase -
Initiated cluster shutdown.
...
18736 [dispatcher-2] INFO o.a.f.r.jobmaster.JobMaster - JobManager
successfully registered at ResourceManager, leader id:
a64a7f6db3146cf378a7d898cf6044d1.
18736 [dispatcher-2] INFO o.a.f.r.jobmaster.slotpool.SlotPoolImpl -
Requesting new slot [SlotRequestId{7a9512bdaf255cc687212211142abd4b}] and
profile ResourceProfile{UNKNOWN} from resource manager.
18736 [dispatcher-3] INFO o.a.f.r.resourcemanager.StandaloneResourceManager -
Request slot with profile ResourceProfile{UNKNOWN} for job
411378b9f7181f9ba431960b08b701d7 with allocation id
0e5171488810f0cc1c3265e9b278bb17.
...
18740 [dispatcher-3] INFO o.a.f.r.resourcemanager.slotmanager.SlotManagerImpl
- Suspending the SlotManager.
18740 [dispatcher-5] INFO o.a.f.r.jobmaster.slotpool.SlotPoolImpl - Failing
pending slot request [SlotRequestId{7a9512bdaf255cc687212211142abd4b}]: The
slot manager is being suspended.
...
18750 [dispatcher-5] INFO o.a.f.r.executiongraph.ExecutionGraph - blocking
operator (1/1) (927b48379633675aad810f12af2470a6) switched from SCHEDULED to
FAILED.
java.util.concurrent.CompletionException:
o.a.f.r.resourcemanager.slotmanager.SlotManagerException: The slot manager is
being suspended.
...
18750 [dispatcher-5] INFO o.a.f.r.executiongraph.ExecutionGraph - Job
Blocking test job (411378b9f7181f9ba431960b08b701d7) switched from state
RUNNING to FAILING.
o.a.f.r.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy
Caused by: java.util.concurrent.CompletionException:
o.a.f.r.resourcemanager.slotmanager.SlotManagerException: The slot manager is
being suspended.
...
18760 [dispatcher-5] INFO o.a.f.r.executiongraph.ExecutionGraph - Job
Blocking test job (411378b9f7181f9ba431960b08b701d7) switched from state
FAILING to FAILED.
o.a.f.r.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy
Caused by: java.util.concurrent.CompletionException:
o.a.f.r.resourcemanager.slotmanager.SlotManagerException: The slot manager is
being suspended.
...
18765 [jobmanager-future-thread-2] DEBUG o.a.f.r.jobmaster.JobManagerRunnerImpl
- Job 411378b9f7181f9ba431960b08b701d7 reached globally terminal state FAILED.
18765 [jobmanager-future-thread-2] DEBUG o.a.f.r.jobmaster.JobManagerRunnerImpl
- Unregistering job from 411378b9f7181f9ba431960b08b701d7 from
high-availability.
18765 [jobmanager-future-thread-2] DEBUG
o.a.f.r.highavailability.zookeeper.ZooKeeperRunningJobsRegistry - Setting
scheduling state to DONE for job 411378b9f7181f9ba431960b08b701d7.
...
18770 [mini-cluster-io-thread-2] INFO
o.a.f.r.jobmanager.ZooKeeperJobGraphStore - Recovered JobGraph(jobId:
411378b9f7181f9ba431960b08b701d7).
18770 [mini-cluster-io-thread-2] INFO
o.a.f.r.dispatcher.runner.SessionDispatcherLeaderProcess - Successfully
recovered 1 persisted job graphs.
...
23748 [main-EventThread] INFO o.a.f.r.jobmaster.JobManagerRunnerImpl -
Granted leader ship but job 411378b9f7181f9ba431960b08b701d7 has been finished.
{code}
>From what I can tell the shutdown logic should be adjusted to first shutdown
>the JobMaster before shutting down any service the JM may be relying on.
Unfortunately the shutdown logic is one of the most obfuscated things I have
seen in a while, and I can't make heads-or-tails of how it is even supposed to
work.
[~trohrmann] WDYT?
> ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange
> failed on Travis
> ----------------------------------------------------------------------------------------
>
> Key: FLINK-15150
> URL: https://issues.apache.org/jira/browse/FLINK-15150
> Project: Flink
> Issue Type: Bug
> Components: Tests
> Affects Versions: 1.10.0
> Reporter: Congxian Qiu(klion26)
> Assignee: Chesnay Schepler
> Priority: Critical
> Fix For: 1.10.0
>
>
>
> 06:37:18.423 [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time
> elapsed: 14.014 s <<< FAILURE! - in
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase
> 375406:37:18.423 [ERROR]
> testJobExecutionOnClusterWithLeaderChange(org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase)
> Time elapsed: 14.001 s <<< ERROR!
> 3755java.util.concurrent.ExecutionException:
> org.apache.flink.util.FlinkException: JobMaster has been shut down.
> 3756 at
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.lambda$testJobExecutionOnClusterWithLeaderChange$1(ZooKeeperLeaderElectionITCase.java:131)
> 3757 at
> org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:131)
> 3758Caused by: org.apache.flink.util.FlinkException: JobMaster has been shut
> down.
> 3759
>
> [https://travis-ci.com/flink-ci/flink/jobs/264972218]
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)