xintongsong commented on PR #21137: URL: https://github.com/apache/flink/pull/21137#issuecomment-1303003420
This is indeed a fascinating discussion. I see the point that there are some calls in `cloaseAsync` that may or may not run under the lock. I also agree that it would be nice the lock only covers the calls that need to be guarded and leave the unnecessary calls outside. I think my focus is more about preventing (or at least reducing the chances of) similar problems. In that sense, it's probably better that the runner does not need to (or as least as possible) worry about another lock from the leader election service. To that end, I think @XComp's proposal makes more sense than mine. Ideally, leader election service should not call `contender.grant/revokeLeadership()` under a lock while the same lock can be accessed by the contender via `stop()`. To clarify, fixing the problem on the service side does not justify that calls in `cloaseAsync` may or may not run under the lock. Ideally, `A` should not call `B.b()` under a lock while another method `A.a()` called by `B` can access the same lock. That applies to both the service and the runner, and maybe we should fix that for both. All in all, I'd suggest: - Fix the runner side issue in `closeAsync` in this PR. For this purpose solely, we probably don't need a dedicated executor. - Fix the leader election service side issue with a dedicated executor, in a follow up ticket. That would affect all contenders and we need to carefully check that no existing contenders are relying on the current behavior that `grant/revokeLeadership` are called under lock. We should also clean up things like `ResourceManagerServiceImpl.handleLeaderEventExecutor`. WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
