zyh created FLINK-36570:
---------------------------
Summary: Fatal error occurred in the cluster entrypoint
Key: FLINK-36570
URL: https://issues.apache.org/jira/browse/FLINK-36570
Project: Flink
Issue Type: Bug
Components: Deployment / Kubernetes
Affects Versions: 1.17.2
Environment: * flink 1.17.2
* native kubernetes session cluster HA(3 jobmanager replica)
Reporter: zyh
I commit batch jobs to my session cluster with rest api. The jobmanager pod
would restart when occurred the error.
Seems to be because the new leader elected and exist running job at the same
time. Then the job send to the new leader and error.
{code:java}
2024-10-18 03:07:22,107 INFO
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] -
Submitting Job with JobId=a9d339b6ba26ab51746514cc7aea0537.2024-10-18
03:07:22,546 INFO
org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector [] -
New leader elected 57c4be1d-58f0-4c2c-89d8-11aefe1ec273 for
flink-cluster-cluster-config-map.2024-10-18 03:07:22,549 ERROR
org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error
occurred in the cluster entrypoint.org.apache.flink.util.FlinkException:
JobMaster for job 99c02051a54c77499f53f09cd4b7a0d9 failed.at
org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:1360)
~[flink-dist-1.17.2.jar:1.17.2]
at
org.apache.flink.runtime.dispatcher.Dispatcher.jobManagerRunnerFailed(Dispatcher.java:772)
~[flink-dist-1.17.2.jar:1.17.2]
at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$runJob$6(Dispatcher.java:694)
~[flink-dist-1.17.2.jar:1.17.2]
at java.util.concurrent.CompletableFuture.uniHandle(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture$UniHandle.tryFire(Unknown Source)
~[?:?]
at java.util.concurrent.CompletableFuture$Completion.run(Unknown Source) ~[?:?]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRunAsync$4(AkkaRpcActor.java:453)
~[flink-rpc-akka_27725420-d3ff-407e-864b-d8e6936565db.jar:1.17.2]
at
org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
~[flink-rpc-akka_27725420-d3ff-407e-864b-d8e6936565db.jar:1.17.2]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:453)
~[flink-rpc-akka_27725420-d3ff-407e-864b-d8e6936565db.jar:1.17.2]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:218)
~[flink-rpc-akka_27725420-d3ff-407e-864b-d8e6936565db.jar:1.17.2]
at
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:84)
~[flink-rpc-akka_27725420-d3ff-407e-864b-d8e6936565db.jar:1.17.2]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:168)
~[flink-rpc-akka_27725420-d3ff-407e-864b-d8e6936565db.jar:1.17.2]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24)
[flink-rpc-akka_27725420-d3ff-407e-864b-d8e6936565db.jar:1.17.2]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20)
[flink-rpc-akka_27725420-d3ff-407e-864b-d8e6936565db.jar:1.17.2]
at scala.PartialFunction.applyOrElse(PartialFunction.scala:127)
[flink-rpc-akka_27725420-d3ff-407e-864b-d8e6936565db.jar:1.17.2]
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126)
[flink-rpc-akka_27725420-d3ff-407e-864b-d8e6936565db.jar:1.17.2]
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)
[flink-rpc-akka_27725420-d3ff-407e-864b-d8e6936565db.jar:1.17.2]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:175)
[flink-rpc-akka_27725420-d3ff-407e-864b-d8e6936565db.jar:1.17.2]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
[flink-rpc-akka_27725420-d3ff-407e-864b-d8e6936565db.jar:1.17.2]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
[flink-rpc-akka_27725420-d3ff-407e-864b-d8e6936565db.jar:1.17.2]
at akka.actor.Actor.aroundReceive(Actor.scala:537)
[flink-rpc-akka_27725420-d3ff-407e-864b-d8e6936565db.jar:1.17.2]
at akka.actor.Actor.aroundReceive$(Actor.scala:535)
[flink-rpc-akka_27725420-d3ff-407e-864b-d8e6936565db.jar:1.17.2]
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220)
[flink-rpc-akka_27725420-d3ff-407e-864b-d8e6936565db.jar:1.17.2]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:579)
[flink-rpc-akka_27725420-d3ff-407e-864b-d8e6936565db.jar:1.17.2]
at akka.actor.ActorCell.invoke(ActorCell.scala:547)
[flink-rpc-akka_27725420-d3ff-407e-864b-d8e6936565db.jar:1.17.2]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270)
[flink-rpc-akka_27725420-d3ff-407e-864b-d8e6936565db.jar:1.17.2]
at akka.dispatch.Mailbox.run(Mailbox.scala:231)
[flink-rpc-akka_27725420-d3ff-407e-864b-d8e6936565db.jar:1.17.2]
at akka.dispatch.Mailbox.exec(Mailbox.scala:243)
[flink-rpc-akka_27725420-d3ff-407e-864b-d8e6936565db.jar:1.17.2]
at java.util.concurrent.ForkJoinTask.doExec(Unknown Source) [?:?]
at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source)
[?:?]
at java.util.concurrent.ForkJoinPool.scan(Unknown Source) [?:?]
at java.util.concurrent.ForkJoinPool.runWorker(Unknown Source) [?:?]
at java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source) [?:?]
ed by: org.apache.flink.util.FlinkException: Could not suspend the job manager.
at
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.lambda$null$13(JobMasterServiceLeadershipRunner.java:438)
~[flink-dist-1.17.2.jar:1.17.2]
at
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.runIfStateRunning(JobMasterServiceLeadershipRunner.java:456)
~[flink-dist-1.17.2.jar:1.17.2]
at
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.lambda$handleAsyncOperationError$14(JobMasterServiceLeadershipRunner.java:436)
~[flink-dist-1.17.2.jar:1.17.2]
at java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture.uniWhenCompleteStage(Unknown Source)
~[?:?]
at java.util.concurrent.CompletableFuture.whenComplete(Unknown Source) ~[?:?]
at
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.handleAsyncOperationError(JobMasterServiceLeadershipRunner.java:433)
~[flink-dist-1.17.2.jar:1.17.2]
at
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.stopJobMasterServiceProcessAsync(JobMasterServiceLeadershipRunner.java:405)
~[flink-dist-1.17.2.jar:1.17.2]
at
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.runIfStateRunning(JobMasterServiceLeadershipRunner.java:456)
~[flink-dist-1.17.2.jar:1.17.2]
at
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.revokeLeadership(JobMasterServiceLeadershipRunner.java:390)
~[flink-dist-1.17.2.jar:1.17.2]
at
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.onRevokeLeadership(DefaultLeaderElectionService.java:236)
~[flink-dist-1.17.2.jar:1.17.2]
at
org.apache.flink.runtime.leaderelection.DefaultMultipleComponentLeaderElectionService.lambda$forEachLeaderElectionEventHandler$2(DefaultMultipleComponentLeaderElectionService.java:225)
~[flink-dist-1.17.2.jar:1.17.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
at java.lang.Thread.run(Unknown Source) ~[?:?]
ed by: java.util.concurrent.CompletionException:
java.lang.UnsupportedOperationException: Still waiting for the leadership.
at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture.uniComposeStage(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture.thenCompose(Unknown Source) ~[?:?]
at
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.stopJobMasterServiceProcessAsync(JobMasterServiceLeadershipRunner.java:398)
~[flink-dist-1.17.2.jar:1.17.2]
at
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.runIfStateRunning(JobMasterServiceLeadershipRunner.java:456)
~[flink-dist-1.17.2.jar:1.17.2]
at
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.revokeLeadership(JobMasterServiceLeadershipRunner.java:390)
~[flink-dist-1.17.2.jar:1.17.2]
at
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.onRevokeLeadership(DefaultLeaderElectionService.java:236)
~[flink-dist-1.17.2.jar:1.17.2]
at
org.apache.flink.runtime.leaderelection.DefaultMultipleComponentLeaderElectionService.lambda$forEachLeaderElectionEventHandler$2(DefaultMultipleComponentLeaderElectionService.java:225)
~[flink-dist-1.17.2.jar:1.17.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
at java.lang.Thread.run(Unknown Source) ~[?:?]
ed by: java.lang.UnsupportedOperationException: Still waiting for the
leadership.
at
org.apache.flink.runtime.jobmaster.JobMasterServiceProcess$WaitingForLeadership.getLeaderSessionId(JobMasterServiceProcess.java:71)
~[flink-dist-1.17.2.jar:1.17.2]
at
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.stopJobMasterServiceProcess(JobMasterServiceLeadershipRunner.java:414)
~[flink-dist-1.17.2.jar:1.17.2]
at
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.callIfRunning(JobMasterServiceLeadershipRunner.java:469)
~[flink-dist-1.17.2.jar:1.17.2]
at
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.lambda$stopJobMasterServiceProcessAsync$12(JobMasterServiceLeadershipRunner.java:400)
~[flink-dist-1.17.2.jar:1.17.2]
at java.util.concurrent.CompletableFuture.uniComposeStage(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture.thenCompose(Unknown Source) ~[?:?]
at
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.stopJobMasterServiceProcessAsync(JobMasterServiceLeadershipRunner.java:398)
~[flink-dist-1.17.2.jar:1.17.2]
at
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.runIfStateRunning(JobMasterServiceLeadershipRunner.java:456)
~[flink-dist-1.17.2.jar:1.17.2]
at
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.revokeLeadership(JobMasterServiceLeadershipRunner.java:390)
~[flink-dist-1.17.2.jar:1.17.2]
at
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.onRevokeLeadership(DefaultLeaderElectionService.java:236)
~[flink-dist-1.17.2.jar:1.17.2]
at
org.apache.flink.runtime.leaderelection.DefaultMultipleComponentLeaderElectionService.lambda$forEachLeaderElectionEventHandler$2(DefaultMultipleComponentLeaderElectionService.java:225)
~[flink-dist-1.17.2.jar:1.17.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
at java.lang.Thread.run(Unknown Source) ~[?:?]
INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] -
Shutting KubernetesSessionClusterEntrypoint down with application status
UNKNOWN. Diagnostics Cluster entrypoint has been closed externally..INFO
org.apache.flink.runtime.blob.BlobServer [] - Stopped BLOB
server at 0.0.0.0:6124
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)