[
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gyula Fora updated FLINK-34007:
-------------------------------
Description:
The observation is that Job manager goes to suspend state with a failed
container not able to register itself to resource manager after timeout.
JM Log, see attached
was:
The observation is that Job manager goes to suspend state with a failed
container not able to register itself to resource manager after timeout.
JM Log:
2024-01-04 02:58:39,210 INFO
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner [] -
JobMasterServiceLeadershipRunner for job 217cee964b2cfdc3115fb74cac0ec550 was
revoked leadership with leader id eda6fee6-ce02-4076-9a99-8c43a92629f7.
Stopping current JobMasterServiceProcess.
2024-01-04 02:58:58,347 INFO
org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] -
http://172.16.71.11:8081 lost leadership
2024-01-04 02:58:58,347 INFO
org.apache.flink.runtime.resourcemanager.ResourceManagerServiceImpl [] -
Resource manager service is revoked leadership with session id
eda6fee6-ce02-4076-9a99-8c43a92629f7.
2024-01-04 02:58:58,348 INFO
org.apache.flink.runtime.dispatcher.runner.DefaultDispatcherRunner [] -
DefaultDispatcherRunner was revoked the leadership with leader id
eda6fee6-ce02-4076-9a99-8c43a92629f7. Stopping the DispatcherLeaderProcess.
2024-01-04 02:58:58,348 INFO
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] -
Stopping SessionDispatcherLeaderProcess.
2024-01-04 02:58:58,349 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopping
dispatcher pekko.tcp://[email protected]:6123/user/rpc/dispatcher_1.
2024-01-04 02:58:58,349 INFO org.apache.flink.runtime.jobmaster.JobMaster
[] - Stopping the JobMaster for job
'amp-ade-fitness-clickstream-projection-uat' (217cee964b2cfdc3115fb74cac0ec550).
2024-01-04 02:58:58,349 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopping all
currently running jobs of dispatcher
pekko.tcp://[email protected]:6123/user/rpc/dispatcher_1.
2024-01-04 02:58:58,351 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Job
217cee964b2cfdc3115fb74cac0ec550 reached terminal state SUSPENDED.
2024-01-04 02:58:58,352 INFO
org.apache.flink.runtime.security.token.DefaultDelegationTokenManager [] -
Stopping credential renewal
2024-01-04 02:58:58,352 INFO
org.apache.flink.runtime.security.token.DefaultDelegationTokenManager [] -
Stopped credential renewal
2024-01-04 02:58:58,352 INFO
org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager []
- Closing the slot manager.
2024-01-04 02:58:58,351 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job
amp-ade-fitness-clickstream-projection-uat (217cee964b2cfdc3115fb74cac0ec550)
switched from state RUNNING to SUSPENDED.
org.apache.flink.util.FlinkException: AdaptiveScheduler is being stopped.
at
org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler.closeAsync(AdaptiveScheduler.java:474)
~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase]
at
org.apache.flink.runtime.jobmaster.JobMaster.stopScheduling(JobMaster.java:1093)
~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase]
at
org.apache.flink.runtime.jobmaster.JobMaster.stopJobExecution(JobMaster.java:1056)
~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase]
at
org.apache.flink.runtime.jobmaster.JobMaster.onStop(JobMaster.java:454)
~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase]
at
org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:239)
~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase]
at
org.apache.flink.runtime.rpc.pekko.PekkoRpcActor$StartedState.lambda$terminate$0(PekkoRpcActor.java:574)
~[flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase]
at
org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83)
~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase]
at
org.apache.flink.runtime.rpc.pekko.PekkoRpcActor$StartedState.terminate(PekkoRpcActor.java:573)
~[flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase]
at
org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleControlMessage(PekkoRpcActor.java:196)
~[flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase]
at
org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:33)
[flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase]
at
org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:29)
[flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase]
at scala.PartialFunction.applyOrElse(PartialFunction.scala:127)
[flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase]
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126)
[flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase]
TM Error Log:
2024-01-04 11:23:01,334 ERROR
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Fatal error
occurred in TaskExecutor
pekko.tcp://[email protected]:6122/user/rpc/taskmanager_0. │
│
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException:
Could not register at the ResourceManager within the specified maximum
registration duration PT5M. This indicates a p │
│ at
org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1558)
~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase]
│
│ at
org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$18(TaskExecutor.java:1543)
~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase]
│
│ at
org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:451)
~[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
│
│ at
org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase]
│
│ at
org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRunAsync(PekkoRpcActor.java:451)
~[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
│
│ at
org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcMessage(PekkoRpcActor.java:218)
~[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
│
│ at
org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleMessage(PekkoRpcActor.java:168)
~[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
│
│ at
org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:33)
[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
│
│ at
org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:29)
[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
│
│ at scala.PartialFunction.applyOrElse(PartialFunction.scala:127)
[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
│
│ at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126)
[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
│
│ at
org.apache.pekko.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:29)
[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
│
│ at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:175)
[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
│
│ at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
│
│ at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
│
│ at org.apache.pekko.actor.Actor.aroundReceive(Actor.scala:547)
[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
│
│ at org.apache.pekko.actor.Actor.aroundReceive$(Actor.scala:545)
[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
│
│ at
org.apache.pekko.actor.AbstractActor.aroundReceive(AbstractActor.scala:229)
[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
│
│ at org.apache.pekko.actor.ActorCell.receiveMessage(ActorCell.scala:590)
[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
│
│ at org.apache.pekko.actor.ActorCell.invoke(ActorCell.scala:557)
[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
│
│ at org.apache.pekko.dispatch.Mailbox.processMailbox(Mailbox.scala:280)
[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
│
│ at org.apache.pekko.dispatch.Mailbox.run(Mailbox.scala:241)
[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
│
│ at org.apache.pekko.dispatch.Mailbox.exec(Mailbox.scala:253)
[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]
│
│ at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) [?:?]
│
│ at
java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
[?:?]
│
│ at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) [?:?]
│
│ at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
[?:?]
│
│ at
java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
[?:?]
> Flink Job stuck in suspend state after recovery from failure in HA Mode
> -----------------------------------------------------------------------
>
> Key: FLINK-34007
> URL: https://issues.apache.org/jira/browse/FLINK-34007
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.18.1, 1.18.2
> Reporter: Zhenqiu Huang
> Priority: Major
>
> The observation is that Job manager goes to suspend state with a failed
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)