XComp commented on PR #21137:
URL: https://github.com/apache/flink/pull/21137#issuecomment-1292206053

   Thanks @reswqa for this PR. I'm wondering how executing the leadership 
granting/revocation being called from within another thread would help fixing 
the issue. The locks might be still acquired concurrently in opposite orders 
leading to the deadlock situation.
   
   The usecase that was described in FLINK-29234 essentially happens because 
the Dispatcher is stopped (which, as a consequence, would stop 
`JobMasterServiceLeadershipRunner`) while the 
`JobMasterServiceLeadershipRunner` is granted leadership causing the locks to 
be acquired in the opposite order.
   
   I think the problem is that we're still trying to acquire the lock in 
[JobMasterServiceLeadershipRunner#runIfStateRunning:453](https://github.com/apache/flink/blob/bfe4f9cc3d67d37a2258ab4226d70b6a7d24f22c/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMasterServiceLeadershipRunner.java#L453)
 even though the `JobMasterServiceLeadershipRunner` is already switched to 
`STOPPED` state. I'm wondering whether we could make 
`JobMasterServiceLeadershipRunner#state` volatile and check the instance being 
in `RUNNING` state outside of the lock. But this wouldn't solve the issue 
entirely because there's still a slight chance that the state changes after the 
state check is processed but before entering the lock... :thinking: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to