[
https://issues.apache.org/jira/browse/FLINK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Flink Jira Bot updated FLINK-22516:
-----------------------------------
Labels: auto-deprioritized-major (was: stale-major)
Priority: Minor (was: Major)
This issue was labeled "stale-major" 7 days ago and has not received any
updates so it is being deprioritized. If this ticket is actually Major, please
raise the priority and ask a committer to assign you the issue or revive the
public discussion.
> ResourceManager cannot establish leadership
> -------------------------------------------
>
> Key: FLINK-22516
> URL: https://issues.apache.org/jira/browse/FLINK-22516
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.9.3
> Environment: 1.9.3 Flink version, on kubernetes.
> Reporter: Ricky Burnett
> Priority: Minor
> Labels: auto-deprioritized-major
> Attachments: jm1.log, jm2.log, jobmanager_leadership.log
>
>
> We are running Flink clusters with 2 Jobmanagers in HA mode. After a
> Zookeeper restart the two JMs begin leadership election end up in state where
> they are both trying to start their ResourceManager and until one of them
> writes to `leader/<jobid>/resource_manager_lock` and the other Jobmanager's
> JobMaster proceeds to execute `notifyOfNewResourceManagerLeader` which
> restarts the ResourceManager. This in turn writes to
> `leader/<jobid>/resource_manager_lock` which triggers the first JobMaster to
> restart it's ResourceManager. We can see this in the logs from the
> "ResourceManager leader changed to new address" log, that goes back and forth
> between the two JMs and the two IP addresses. This cycle appears to continue
> indefinitely with outside interruption.
> I've attached combined logs from two JMs in our environment that got into
> this state. The logs start with the loss of connection and end with a couple
> of cycles of back and forth. The two relevant hosts are
> "flink-jm-828d4aa2-d4d4-457b-995d-feb56d08c1fb-784cdb9c57-tsxb7" and
> "flink-jm-828d4aa2-d4d4-457b-995d-feb56d08c1fb-784cdb9c57-mpf9x".
> *-tsxb7 appears to be the last host that was granted leadership.
> {code:java}
> {"thread":"Curator-Framework-0-EventThread","level":"INFO","loggerName":"org.apache.flink.runtime.jobmaster.JobManagerRunner","message":"JobManager
> runner for job tenant: ssademo, pipeline:
> 828d4aa2-d4d4-457b-995d-feb56d08c1fb, name: integration-test-detection
> (33e12948df69077ab3b33316eacbb5e4) was granted leadership with session id
> 97992805-9c60-40ba-8260-aaf036694cde at
> akka.tcp://[email protected]:6123/user/jobmanager_3.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","instant":{"epochSecond":1617129712,"nanoOfSecond":447000000},"contextMap":{},"threadId":152,"threadPriority":5,"source":{"class":"org.apache.flink.runtime.jobmaster.JobManagerRunner","method":"startJobMaster","file":"JobManagerRunner.java","line":313},"service":"streams","time":"2021-03-30T18:41:52.447UTC","hostname":"flink-jm-828d4aa2-d4d4-457b-995d-feb56d08c1fb-784cdb9c57-tsxb7"}
> {code}
> But *-mpf9x continues to try to wrestle control back.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)