[jira] [Comment Edited] (FLINK-34672) HA deadlock between JobMasterServiceLeadershipRunner and DefaultLeaderElectionService

Matthias Pohl (Jira) Tue, 19 Mar 2024 02:21:11 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-34672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828221#comment-17828221
 ]


Matthias Pohl edited comment on FLINK-34672 at 3/19/24 9:20 AM:
----------------------------------------------------------------

The two locks shouldn't be nested because any leadership-related operation 
should run sequentially in the leadership operation executor.

I also verified that this is not something that was introduced in Flink 1.18 
with the 
[FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box]
 changes. AFAIS, it can also happen in 1.17- (I didn't check the 
pre-FLINK-24038 code but only looked into {{release-1.17}}).

One solution would be to move the async callback of 
[JobMasterServiceLeadershipRunner#forwardIfValidLeader|https://github.com/apache/flink/blob/c9fcb0c74b1354f4f0f1b7c7f62191b8cc6b5725/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMasterServiceLeadershipRunner.java#L547]
 into the leader-operation executor of the {{DefaultLeaderElectionService}} to 
force sequential execution of leadership-related operations.

This is only an issue in the {{JobMasterServiceLeadershipRunner}} because we're 
executing the creation asynchronously in an io thread. The other place where we 
check within the contender whether leadership is acquired is the 
{{DefaultDispatcherRunner}}. But we're not doing any async calls there during 
leadership handling (the {{DefaultDispatcherRunner}} is created directly in the 
leader-operation executor while handling the leadership acquired event).

The {{LeaderElection}}/{{LeaderElectionService}} interface is flawed in this 
sense. We shouldn't expose the {{hasLeadership}} but have a 
{{runLeaderOperation}} method that abstracts away the leadership check. In 
{{DefaultLeaderElectionService}} all these operations would be handled by the 
leader-operation executor.


was (Author: mapohl):
The two locks shouldn't be nested becaus any leadership-related operation 
should run sequentially in the leadership operation executor.

I also verified that this is not something that was introduced in Flink 1.18 
with the 
[FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box]
 changes. AFAIS, it can also happen in 1.17- (I didn't check the 
pre-FLINK-24038 code but only looked into {{release-1.17}}).

One solution would be to move the async callback of 
[JobMasterServiceLeadershipRunner#forwardIfValidLeader|https://github.com/apache/flink/blob/c9fcb0c74b1354f4f0f1b7c7f62191b8cc6b5725/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMasterServiceLeadershipRunner.java#L547]
 into the leader-operation executor of the {{DefaultLeaderElectionService}} to 
force sequential execution of leadership-related operations.

This is only an issue in the {{JobMasterServiceLeadershipRunner}} because we're 
executing the creation asynchronously in an io thread. The other place where we 
check within the contender whether leadership is acquired is the 
{{DefaultDispatcherRunner}}. But we're not doing any async calls there during 
leadership handling (the {{DefaultDispatcherRunner}} is created directly in the 
leader-operation executor while handling the leadership acquired event).

The {{LeaderElection}}/{{LeaderElectionService}} interface is flawed in this 
sense. We shouldn't expose the {{hasLeadership}} but have a 
{{runLeaderOperation}} method that abstracts away the leadership check. In 
{{DefaultLeaderElectionService}} all these operations would be handled by the 
leader-operation executor.

> HA deadlock between JobMasterServiceLeadershipRunner and 
> DefaultLeaderElectionService
> -------------------------------------------------------------------------------------
>
>                 Key: FLINK-34672
>                 URL: https://issues.apache.org/jira/browse/FLINK-34672
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.18.1
>            Reporter: Chesnay Schepler
>            Priority: Major
>             Fix For: 1.18.2, 1.20.0, 1.19.1
>
>
> We recently observed a deadlock in the JM within the HA system.
> (see below for the thread dump)
> [~mapohl] and I looked a bit into it and there appears to be a race condition 
> when leadership is revoked while a JobMaster is being started.
> It appears to be caused by 
> {{JobMasterServiceLeadershipRunner#createNewJobMasterServiceProcess}} 
> forwarding futures while holding a lock; depending on whether the forwarded 
> future is already complete the next stage may or may not run while holding 
> that same lock.
> We haven't determined yet whether we should be holding that lock or not.
> {code}
> "DefaultLeaderElectionService-leadershipOperationExecutor-thread-1" #131 
> daemon prio=5 os_prio=0 cpu=157.44ms elapsed=78749.65s tid=0x00007f531f43d000 
> nid=0x19d waiting for monitor entry  [0x00007f53084fd000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at 
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.runIfStateRunning(JobMasterServiceLeadershipRunner.java:462)
>         - waiting to lock <0x00000000f1c0e088> (a java.lang.Object)
>         at 
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.revokeLeadership(JobMasterServiceLeadershipRunner.java:397)
>         at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.notifyLeaderContenderOfLeadershipLoss(DefaultLeaderElectionService.java:484)
>         at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService$$Lambda$1252/0x0000000840ddec40.accept(Unknown
>  Source)
>         at java.util.HashMap.forEach([email protected]/HashMap.java:1337)
>         at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.onRevokeLeadershipInternal(DefaultLeaderElectionService.java:452)
>         at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService$$Lambda$1251/0x0000000840dcf840.run(Unknown
>  Source)
>         at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.lambda$runInLeaderEventThread$3(DefaultLeaderElectionService.java:549)
>         - locked <0x00000000f0e3f4d8> (a java.lang.Object)
>         at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService$$Lambda$1075/0x0000000840c23040.run(Unknown
>  Source)
>         at 
> java.util.concurrent.CompletableFuture$AsyncRun.run([email protected]/CompletableFuture.java:1736)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
>         at java.lang.Thread.run([email protected]/Thread.java:829)
> {code}
> {code}
> "jobmanager-io-thread-1" #636 daemon prio=5 os_prio=0 cpu=125.56ms 
> elapsed=78699.01s tid=0x00007f5321c6e800 nid=0x396 waiting for monitor entry  
> [0x00007f530567d000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.hasLeadership(DefaultLeaderElectionService.java:366)
>         - waiting to lock <0x00000000f0e3f4d8> (a java.lang.Object)
>         at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElection.hasLeadership(DefaultLeaderElection.java:52)
>         at 
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.isValidLeader(JobMasterServiceLeadershipRunner.java:509)
>         at 
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.lambda$forwardIfValidLeader$15(JobMasterServiceLeadershipRunner.java:520)
>         - locked <0x00000000f1c0e088> (a java.lang.Object)
>         at 
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner$$Lambda$1320/0x0000000840e1a840.accept(Unknown
>  Source)
>         at 
> java.util.concurrent.CompletableFuture.uniWhenComplete([email protected]/CompletableFuture.java:859)
>         at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire([email protected]/CompletableFuture.java:837)
>         at 
> java.util.concurrent.CompletableFuture.postComplete([email protected]/CompletableFuture.java:506)
>         at 
> java.util.concurrent.CompletableFuture.complete([email protected]/CompletableFuture.java:2079)
>         at 
> org.apache.flink.runtime.jobmaster.DefaultJobMasterServiceProcess.registerJobMasterServiceFutures(DefaultJobMasterServiceProcess.java:124)
>         at 
> org.apache.flink.runtime.jobmaster.DefaultJobMasterServiceProcess.lambda$new$0(DefaultJobMasterServiceProcess.java:114)
>         at 
> org.apache.flink.runtime.jobmaster.DefaultJobMasterServiceProcess$$Lambda$1319/0x0000000840e1a440.accept(Unknown
>  Source)
>         at 
> java.util.concurrent.CompletableFuture.uniWhenComplete([email protected]/CompletableFuture.java:859)
>         at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire([email protected]/CompletableFuture.java:837)
>         at 
> java.util.concurrent.CompletableFuture.postComplete([email protected]/CompletableFuture.java:506)
>         at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run([email protected]/CompletableFuture.java:1705)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
>         at java.lang.Thread.run([email protected]/Thread.java:829)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-34672) HA deadlock between JobMasterServiceLeadershipRunner and DefaultLeaderElectionService

Reply via email to