georgew5656 opened a new pull request, #17546: URL: https://github.com/apache/druid/pull/17546
It's possible for the stopAndCreateNewLeaderLatch to behave poorly for CuratorDruidLeaderSelector.notLeader if listener.stopBeingLeader() takes too long. From the comments in the code it seems like the intent of the code is to stop the overlord/coordinator that is going through the CuratorDruidLeaderSelector.notLeader process from regaining leadership, but the current order of the code is. 1. listener.stopBeingLeader(); 2. close the old leader latch and create a new one (but don't start it) 3. wait a period 4. start the new leader latch Before step 2, the overlord is able to be reassigned leadership in ZK (since its latch hasn't been closed yet). this can be a problem if listener.stopBeingLeader() takes too long (this can take a while on the overlord, specifically the process of shutting down the SupervisorManager (see: https://github.com/apache/druid/pull/17535)), since the overlord can regain leadership as it is shutting down. This PR changes the order of the logic to prevent this issue. 1. close the old leader latch and create a new one (but don't start it) 2. listener.stopBeingLeader(); 3. wait a period 4. start the new leader latch I have tested this in a test druid cluster by killing ZK leaders and observing faster recovery than without this change. ### Description - Separate recreateLeaderLatch into stopAndCreateNewLeaderLatch/startLeaderLatch. - During a graceful leadership shutdown, call stopAndCreateNewLeaderLatch before notifying listeners to prevent the overlord/coordinator from being reassigned leadership. #### Release note Improve resilience of druid service leadership to zk outages ##### Key changed/added classes in this PR * `CuratorDruidLeaderSelector` This PR has: - [X] been self-reviewed. - [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.) - [ ] added documentation for new or modified features or behaviors. - [ ] a release note entry in the PR description. - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links. - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md) - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader. - [ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met. - [ ] added integration tests. - [X] been tested in a test Druid cluster. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
