[ 
https://issues.apache.org/jira/browse/FLINK-14316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17071999#comment-17071999
 ] 

Till Rohrmann commented on FLINK-14316:
---------------------------------------

Hi [~stevenz3wu] and [~pgoyal], sorry for my late reply. 

The logs did not show any specific problems and it looked as if there were some 
network issues causing the TM to think that the JM lost its leadership. 
However, what Piyush wrote and the patch itself pointed me into the right 
direction. The problem is that we don't reset the {{rpcConnection}} in the 
{{JobManagerLeaderListener}} if the JM loses its leadership. Due to this, Flink 
will try to reuse the {{rpcConnection}} if a TM learns about the new old leader 
(JM with the same leader session id). This, however, fails because of the 
{{checkState}} in {{RegisteredRpcConnection.start}} method because it already 
contains a {{pendingRegistration}}.

The patch works because it will always create a new {{RpcConnection}}. 
Alternatively, one could null the {{rpcConnection}} when the leader loses the 
leadership. I actually fixed this problem in 1.11 (FLINK-16836) because of 
another problem. I will backport this fix to 1.10 and 1.9.

Last but not least, you might ask why did the problem not occur in the logs 
you've attached. I think the reason is because it took ZooKeeper long enough to 
realize that the old JobManager is still the leader so that the slots on the 
{{TaskExecutor}} could time out. Once all slots belonging a job have timed out, 
the {{JobManagerLeaderListener}} will be closed.

Thanks again for reporting this issue and sorry for the long waiting time.



> stuck in "Job leader ... lost leadership" error
> -----------------------------------------------
>
>                 Key: FLINK-14316
>                 URL: https://issues.apache.org/jira/browse/FLINK-14316
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.7.2
>            Reporter: Steven Zhen Wu
>            Priority: Major
>         Attachments: FLINK-14316.tgz, RpcConnection.patch
>
>
> This is the first exception caused restart loop. Later exceptions are the 
> same. Job seems to stuck in this permanent failure state.
> {code}
> 2019-10-03 21:42:46,159 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: 
> clpevents -> device_filter -> processed_imps -> ios_processed_impression -> i
> mps_ts_assigner (449/1360) (d237f5e99b6a4a580498821473763edb) switched from 
> SCHEDULED to FAILED.
> java.lang.Exception: Job leader for job id ecb9ad9be934edf7b1a4f7b9dd6df365 
> lost leadership.
>         at 
> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$1(TaskExecutor.java:1526)
>         at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
>         at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
>         at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
>         at 
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>         at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>         at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>         at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>         at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>         at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>         at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to