[
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Martijn Visser updated FLINK-34007:
-----------------------------------
Priority: Major (was: Blocker)
> Flink Job stuck in suspend state after recovery from failure in HA Mode
> -----------------------------------------------------------------------
>
> Key: FLINK-34007
> URL: https://issues.apache.org/jira/browse/FLINK-34007
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.18.1, 1.18.2
> Reporter: Zhenqiu Huang
> Priority: Major
>
> The observation is that Job manager goes to suspend state with a failed
> container not able to register itself to resource manager after timeout.
> JM Log:
> 2024-01-04 02:58:39,210 INFO
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner [] -
> JobMasterServiceLeadershipRunner for job 217cee964b2cfdc3115fb74cac0ec550 was
> revoked leadership with leader id eda6fee6-ce02-4076-9a99-8c43a92629f7.
> Stopping current JobMasterServiceProcess.
> 2024-01-04 02:58:58,347 INFO
> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] -
> http://172.16.71.11:8081 lost leadership
> 2024-01-04 02:58:58,347 INFO
> org.apache.flink.runtime.resourcemanager.ResourceManagerServiceImpl [] -
> Resource manager service is revoked leadership with session id
> eda6fee6-ce02-4076-9a99-8c43a92629f7.
> 2024-01-04 02:58:58,348 INFO
> org.apache.flink.runtime.dispatcher.runner.DefaultDispatcherRunner [] -
> DefaultDispatcherRunner was revoked the leadership with leader id
> eda6fee6-ce02-4076-9a99-8c43a92629f7. Stopping the DispatcherLeaderProcess.
> 2024-01-04 02:58:58,348 INFO
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess []
> - Stopping SessionDispatcherLeaderProcess.
> 2024-01-04 02:58:58,349 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopping
> dispatcher pekko.tcp://[email protected]:6123/user/rpc/dispatcher_1.
> 2024-01-04 02:58:58,349 INFO org.apache.flink.runtime.jobmaster.JobMaster
> [] - Stopping the JobMaster for job
> 'amp-ade-fitness-clickstream-projection-uat'
> (217cee964b2cfdc3115fb74cac0ec550).
> 2024-01-04 02:58:58,349 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopping
> all currently running jobs of dispatcher
> pekko.tcp://[email protected]:6123/user/rpc/dispatcher_1.
> 2024-01-04 02:58:58,351 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Job
> 217cee964b2cfdc3115fb74cac0ec550 reached terminal state SUSPENDED.
> 2024-01-04 02:58:58,352 INFO
> org.apache.flink.runtime.security.token.DefaultDelegationTokenManager [] -
> Stopping credential renewal
> 2024-01-04 02:58:58,352 INFO
> org.apache.flink.runtime.security.token.DefaultDelegationTokenManager [] -
> Stopped credential renewal
> 2024-01-04 02:58:58,352 INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager
> [] - Closing the slot manager.
> 2024-01-04 02:58:58,351 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job
> amp-ade-fitness-clickstream-projection-uat (217cee964b2cfdc3115fb74cac0ec550)
> switched from state RUNNING to SUSPENDED.
> org.apache.flink.util.FlinkException: AdaptiveScheduler is being stopped.
> at
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler.closeAsync(AdaptiveScheduler.java:474)
> ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase]
> at
> org.apache.flink.runtime.jobmaster.JobMaster.stopScheduling(JobMaster.java:1093)
> ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase]
> at
> org.apache.flink.runtime.jobmaster.JobMaster.stopJobExecution(JobMaster.java:1056)
> ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase]
> at
> org.apache.flink.runtime.jobmaster.JobMaster.onStop(JobMaster.java:454)
> ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase]
> at
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:239)
> ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase]
> at
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor$StartedState.lambda$terminate$0(PekkoRpcActor.java:574)
> ~[flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase]
> at
> org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83)
> ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase]
> at
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor$StartedState.terminate(PekkoRpcActor.java:573)
> ~[flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase]
> at
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleControlMessage(PekkoRpcActor.java:196)
> ~[flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase]
> at
> org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:33)
> [flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase]
> at
> org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:29)
> [flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase]
> at scala.PartialFunction.applyOrElse(PartialFunction.scala:127)
> [flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase]
> at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126)
> [flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase]
> TM Error Log:
> 2024-01-04 11:23:01,334 ERROR
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Fatal error
> occurred in TaskExecutor
> pekko.tcp://[email protected]:6122/user/rpc/taskmanager_0. │
> │
> org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException:
> Could not register at the ResourceManager within the specified maximum
> registration duration PT5M. This indicates a p │
> │ at
> org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1558)
> ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase]
> │
> │ at
> org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$18(TaskExecutor.java:1543)
> ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase]
> │
> │ at
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:451)
> ~[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
> │
> │ at
> org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
> ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase]
> │
> │ at
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRunAsync(PekkoRpcActor.java:451)
> ~[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
> │
> │ at
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcMessage(PekkoRpcActor.java:218)
> ~[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
> │
> │ at
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleMessage(PekkoRpcActor.java:168)
> ~[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
> │
> │ at
> org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:33)
> [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
> │
> │ at
> org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:29)
> [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
> │
> │ at scala.PartialFunction.applyOrElse(PartialFunction.scala:127)
> [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
> │
> │ at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126)
> [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
> │
> │ at
> org.apache.pekko.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:29)
> [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
> │
> │ at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:175)
> [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
> │
> │ at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
> [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
> │
> │ at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
> [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
> │
> │ at org.apache.pekko.actor.Actor.aroundReceive(Actor.scala:547)
> [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
> │
> │ at org.apache.pekko.actor.Actor.aroundReceive$(Actor.scala:545)
> [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
> │
> │ at
> org.apache.pekko.actor.AbstractActor.aroundReceive(AbstractActor.scala:229)
> [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
> │
> │ at org.apache.pekko.actor.ActorCell.receiveMessage(ActorCell.scala:590)
> [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
> │
> │ at org.apache.pekko.actor.ActorCell.invoke(ActorCell.scala:557)
> [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
> │
> │ at org.apache.pekko.dispatch.Mailbox.processMailbox(Mailbox.scala:280)
> [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
> │
> │ at org.apache.pekko.dispatch.Mailbox.run(Mailbox.scala:241)
> [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
> │
> │ at org.apache.pekko.dispatch.Mailbox.exec(Mailbox.scala:253)
> [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
> │
> │ at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
> [?:?]
> │
> │ at
> java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
> [?:?]
> │
> │ at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) [?:?]
>
> │
> │ at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
> [?:?]
> │
> │ at
> java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
> [?:?]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)