georgew5656 opened a new pull request, #17546:
URL: https://github.com/apache/druid/pull/17546

   It's possible for the stopAndCreateNewLeaderLatch to behave poorly for 
CuratorDruidLeaderSelector.notLeader if listener.stopBeingLeader() takes too 
long. From the comments in the code it seems like the intent of the code is to 
stop the overlord/coordinator that is going through the 
CuratorDruidLeaderSelector.notLeader process from regaining leadership, but the 
current order of the code is.
   
   1. listener.stopBeingLeader();
   2. close the old leader latch and create a new one (but don't start it)
   3. wait a period
   4. start the new leader latch
   
   Before step 2, the overlord is able to be reassigned leadership in ZK (since 
its latch hasn't been closed yet). this can be a problem if 
listener.stopBeingLeader() takes too long (this can take a while on the 
overlord, specifically the process of shutting down the SupervisorManager (see: 
https://github.com/apache/druid/pull/17535)), since the overlord can regain 
leadership as it is shutting down.
   
   This PR changes the order of the logic to prevent this issue.
   1. close the old leader latch and create a new one (but don't start it)
   2. listener.stopBeingLeader();
   3. wait a period
   4. start the new leader latch
   
   I have tested this in a test druid cluster by killing ZK leaders and 
observing faster recovery than without this change.
   ### Description
   - Separate recreateLeaderLatch into 
stopAndCreateNewLeaderLatch/startLeaderLatch.
   - During a graceful leadership shutdown, call stopAndCreateNewLeaderLatch 
before notifying listeners to prevent the overlord/coordinator from being 
reassigned leadership.
   
   
   #### Release note
   Improve resilience of druid service leadership to zk outages
   
   ##### Key changed/added classes in this PR
    * `CuratorDruidLeaderSelector`
   
   
   This PR has:
   
   - [X] been self-reviewed.
      - [ ] using the [concurrency 
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
 (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] a release note entry in the PR description.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [X] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to