[ 
https://issues.apache.org/jira/browse/FLINK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ricky Burnett updated FLINK-22516:
----------------------------------
    Description: 
We are running Flink clusters with 2 Jobmanagers in HA mode.  After a Zookeeper 
restart the two JMs begin leadership election end up in state where they are 
both trying to start their ResourceManager and until one of them writes to 
`leader/<jobid>/resource_manager_lock` and the other Jobmanager's JobMaster 
proceeds to execute `notifyOfNewResourceManagerLeader` which restarts the 
ResourceManager.  This in turn writes to `leader/<jobid>/resource_manager_lock` 
which triggers the first JobMaster to restart it's ResourceManager.  We can see 
this in the logs from the "ResourceManager leader changed to new address" log, 
that goes back and forth between the two JMs and the two IP addresses.  This 
cycle appears to continue indefinitely with outside interruption.  

I've attached combined logs from two JMs in our environment that got into this 
state.  The logs start with the loss of connection and end with a couple of 
cycles of back and forth.   The two relevant hosts are 
"flink-jm-828d4aa2-d4d4-457b-995d-feb56d08c1fb-784cdb9c57-tsxb7" and 
"flink-jm-828d4aa2-d4d4-457b-995d-feb56d08c1fb-784cdb9c57-mpf9x".

*-tsxb7 appears to be the last host that was granted leadership. 
{code:java}
{"thread":"Curator-Framework-0-EventThread","level":"INFO","loggerName":"org.apache.flink.runtime.jobmaster.JobManagerRunner","message":"JobManager
 runner for job tenant: ssademo, pipeline: 
828d4aa2-d4d4-457b-995d-feb56d08c1fb, name: integration-test-detection 
(33e12948df69077ab3b33316eacbb5e4) was granted leadership with session id 
97992805-9c60-40ba-8260-aaf036694cde at 
akka.tcp://[email protected]:6123/user/jobmanager_3.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","instant":{"epochSecond":1617129712,"nanoOfSecond":447000000},"contextMap":{},"threadId":152,"threadPriority":5,"source":{"class":"org.apache.flink.runtime.jobmaster.JobManagerRunner","method":"startJobMaster","file":"JobManagerRunner.java","line":313},"service":"streams","time":"2021-03-30T18:41:52.447UTC","hostname":"flink-jm-828d4aa2-d4d4-457b-995d-feb56d08c1fb-784cdb9c57-tsxb7"}
{code}
But  *-mpf9x continues to try to wrestle control back.

  was:
We are running Flink clusters with 2 Jobmanagers in HA mode.  After a Zookeeper 
restart the two JMs begin leadership election end up in state where they are 
both trying to start their ResourceManager and until one of them writes to 
`leader/<jobid>/resource_manager_lock` and the JobMaster proceeds to execute 
`notifyOfNewResourceManagerLeader` which restarts the ResourceManager.  This in 
turn writes to `leader/<jobid>/resource_manager_lock` which triggers the other 
JobMaster to restart it's ResourceManager.  We can see this in the logs from 
the "ResourceManager leader changed to new address" log, that goes back and 
forth between the two JMs and the two IP addresses.  This cycle appears to 
continue indefinitely with outside interruption.  

I've attached combined logs from two JMs in our environment that got into this 
state.  The logs start with the loss of connection and end with a couple of 
cycles of back and forth.   The two relevant hosts are 
"flink-jm-828d4aa2-d4d4-457b-995d-feb56d08c1fb-784cdb9c57-tsxb7" and 
"flink-jm-828d4aa2-d4d4-457b-995d-feb56d08c1fb-784cdb9c57-mpf9x".

*-tsxb7 appears to be the last host that was granted leadership. 
{code:java}
{"thread":"Curator-Framework-0-EventThread","level":"INFO","loggerName":"org.apache.flink.runtime.jobmaster.JobManagerRunner","message":"JobManager
 runner for job tenant: ssademo, pipeline: 
828d4aa2-d4d4-457b-995d-feb56d08c1fb, name: integration-test-detection 
(33e12948df69077ab3b33316eacbb5e4) was granted leadership with session id 
97992805-9c60-40ba-8260-aaf036694cde at 
akka.tcp://[email protected]:6123/user/jobmanager_3.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","instant":{"epochSecond":1617129712,"nanoOfSecond":447000000},"contextMap":{},"threadId":152,"threadPriority":5,"source":{"class":"org.apache.flink.runtime.jobmaster.JobManagerRunner","method":"startJobMaster","file":"JobManagerRunner.java","line":313},"service":"streams","time":"2021-03-30T18:41:52.447UTC","hostname":"flink-jm-828d4aa2-d4d4-457b-995d-feb56d08c1fb-784cdb9c57-tsxb7"}
{code}
But  *-mpf9x continues to try to wrestle control back.


> ResourceManager cannot establish leadership
> -------------------------------------------
>
>                 Key: FLINK-22516
>                 URL: https://issues.apache.org/jira/browse/FLINK-22516
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Ricky Burnett
>            Priority: Major
>         Attachments: jobmanager_leadership.log
>
>
> We are running Flink clusters with 2 Jobmanagers in HA mode.  After a 
> Zookeeper restart the two JMs begin leadership election end up in state where 
> they are both trying to start their ResourceManager and until one of them 
> writes to `leader/<jobid>/resource_manager_lock` and the other Jobmanager's 
> JobMaster proceeds to execute `notifyOfNewResourceManagerLeader` which 
> restarts the ResourceManager.  This in turn writes to 
> `leader/<jobid>/resource_manager_lock` which triggers the first JobMaster to 
> restart it's ResourceManager.  We can see this in the logs from the 
> "ResourceManager leader changed to new address" log, that goes back and forth 
> between the two JMs and the two IP addresses.  This cycle appears to continue 
> indefinitely with outside interruption.  
> I've attached combined logs from two JMs in our environment that got into 
> this state.  The logs start with the loss of connection and end with a couple 
> of cycles of back and forth.   The two relevant hosts are 
> "flink-jm-828d4aa2-d4d4-457b-995d-feb56d08c1fb-784cdb9c57-tsxb7" and 
> "flink-jm-828d4aa2-d4d4-457b-995d-feb56d08c1fb-784cdb9c57-mpf9x".
> *-tsxb7 appears to be the last host that was granted leadership. 
> {code:java}
> {"thread":"Curator-Framework-0-EventThread","level":"INFO","loggerName":"org.apache.flink.runtime.jobmaster.JobManagerRunner","message":"JobManager
>  runner for job tenant: ssademo, pipeline: 
> 828d4aa2-d4d4-457b-995d-feb56d08c1fb, name: integration-test-detection 
> (33e12948df69077ab3b33316eacbb5e4) was granted leadership with session id 
> 97992805-9c60-40ba-8260-aaf036694cde at 
> akka.tcp://[email protected]:6123/user/jobmanager_3.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","instant":{"epochSecond":1617129712,"nanoOfSecond":447000000},"contextMap":{},"threadId":152,"threadPriority":5,"source":{"class":"org.apache.flink.runtime.jobmaster.JobManagerRunner","method":"startJobMaster","file":"JobManagerRunner.java","line":313},"service":"streams","time":"2021-03-30T18:41:52.447UTC","hostname":"flink-jm-828d4aa2-d4d4-457b-995d-feb56d08c1fb-784cdb9c57-tsxb7"}
> {code}
> But  *-mpf9x continues to try to wrestle control back.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to