Jason Kania created FLINK-9371:
----------------------------------
Summary: High Availability JobManager Registration Failure
Key: FLINK-9371
URL: https://issues.apache.org/jira/browse/FLINK-9371
Project: Flink
Issue Type: Bug
Components: Core
Affects Versions: 1.4.2
Reporter: Jason Kania
The following error is happening intermittently on an 3 node cluster with 2 Job
Managers configured in HA mode. When this happens, the two JobManager instances
are associated with one another.
2018-05-15 19:00:06,400 INFO
org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager
- Trying to associate with JobManager leader
akka.tcp://flink@aaa-1:50000/user/jobmanager
2018-05-15 19:00:06,404 WARN org.apache.flink.runtime.jobmanager.JobManager
- Discard message
LeaderSessionMessage(0bbe70c4-2642-4a08-912f-6cc09646281f,RegisterResourceManager
akka://flink/user/resourcemanager-d6567c5d-85f4-4b18-8eac-cf9725d076a5)
because there is currently no valid leader id known.
2018-05-15 19:00:16,418 ERROR
org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager
- Resource manager could not register at JobManager
akka.pattern.AskTimeoutException: Ask timed out on
[ActorSelection[Anchor(akka://flink/), Path(/user/jobmanager)]] after [10000
ms]. Sender[null] sent message of type
"org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage".
at
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
at
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
at
akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
at
akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
at
akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
at java.lang.Thread.run(Thread.java:748)
Sometimes the following type of log also comes out following the previous log:
2018-05-15 19:13:47,525 WARN
org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager
- Discard message
LeaderSessionMessage(5cab29b9-10d3-4b25-b934-f06b82be15b5,TriggerRegistrationAtJobManager
akka.tcp://flink@aaa-1:50000/user/jobmanager) because the expected leader
session ID 61075587-51da-4e58-ac4f-9ea118ccdde9 did not equal the received
leader session ID 5cab29b9-10d3-4b25-b934-f06b82be15b5.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)