XComp commented on PR #21137: URL: https://github.com/apache/flink/pull/21137#issuecomment-1292206053
Thanks @reswqa for this PR. I'm wondering how executing the leadership granting/revocation being called from within another thread would help fixing the issue. The locks might be still acquired concurrently in opposite orders leading to the deadlock situation. The usecase that was described in FLINK-29234 essentially happens because the Dispatcher is stopped (which, as a consequence, would stop `JobMasterServiceLeadershipRunner`) while the `JobMasterServiceLeadershipRunner` is granted leadership causing the locks to be acquired in the opposite order. I think the problem is that we're still trying to acquire the lock in [JobMasterServiceLeadershipRunner#runIfStateRunning:453](https://github.com/apache/flink/blob/bfe4f9cc3d67d37a2258ab4226d70b6a7d24f22c/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMasterServiceLeadershipRunner.java#L453) even though the `JobMasterServiceLeadershipRunner` is already switched to `STOPPED` state. I'm wondering whether we could make `JobMasterServiceLeadershipRunner#state` volatile and check the instance being in `RUNNING` state outside of the lock. But this wouldn't solve the issue entirely because there's still a slight chance that the state changes after the state check is processed but before entering the lock... :thinking: -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
