kfaraz opened a new pull request, #14385:
URL: https://github.com/apache/druid/pull/14385

   ### Description
   
   If the current leader coordinator is asked to stop being leader, the 
following happens:
   - The `DruidCoordinator.balancerExec` (used for strategy cost computations) 
is shutdown
   - The currently running duty finishes execution normally and no more duties 
are executed
   - An exception to this is the `BalanceSegments` duty, which can exit 
abnormally or even get stuck in the scenarios explained below.
   
   #### ✅ Case 1: `BalanceSegments` duty throws exception
   
   Typical sequence of events:
   - Current coordinator stops being leader and `balancerExec` is shutdown
   - `CostBalancerStrategy.findNewSegmentHomeBalancer()` or any other method is 
invoked
   - `computeCost()` tasks are submitted to the executor
   - Since the executor has already been shutdown, submission of new tasks 
throws an exception and ends the coordinator run as desired
   
   #### ❌ Case 2: `BalanceSegments` duty gets stuck
   
   Typical sequence of events:
   - `BalanceSegments` duty is in progress
   - `CostBalancerStrategy.findNewSegmentHomeBalancer()` is invoked for some 
segment
   - `computeCost()` tasks are submitted to the executor
   - Current coordinator stops being leader and `balancerExec` is shutdown
   - Since the `computeCost()` tasks do not handle interrupts, the method waits 
indefinitely on the task futures
   
   #### ✅ Case 3: Change in `balancerComputeThreads` dynamic config
   
   A change in this config also results in a shutdown of the `balancerExec`. 
But this shutdown is never done concurrently with the coordinator duties and 
thus doesn't cause the coordinator to get stuck.
   
   ### Changes
   - Add a timeout of 1 minute to the `resultFuture.get()`. 1 minute is the 
typical time for a full coordinator run and is more than enough time for cost 
computations of a single segment.
   - Raise an alert if an exception is encountered while computing costs and if 
the executor has not been shutdown. This is because a shutdown is intentional 
and does not require an alert.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to