LZD-PratyushBhatt opened a new issue, #3033:
URL: https://github.com/apache/helix/issues/3033

   When a cluster enters maintenance mode, partition movement is paused and 
only state maintenance is performed. Maintenance mode can be triggered manually 
(via user config) or automatically by the Helix controller if the number of 
offline nodes exceeds a threshold.
   
   Maintenance mode is frequently used to batch multiple config changes, 
ensuring the Helix controller only recalculates assignments after all updates 
are complete. The workflow is:
   
   1. Enter maintenance mode
   2. Apply config changes
   3. Exit maintenance mode
   
   However, if an operation script fails mid-process, the cluster may remain 
stuck in maintenance mode, impacting availability.
   
   ### Proposed Solution
   
   - Introduce a timeout parameter for manual maintenance mode entry.
   - When entering maintenance mode manually, users can specify a timeout.
   - If the cluster is still in maintenance mode after the timeout, Helix will 
automatically exit maintenance mode.
   - If no timeout is specified, auto-exit does not occur.
   - For controller-triggered maintenance mode (due to offline nodes), the 
existing behavior remains: the cluster exits maintenance mode only when the 
offline node count drops below the threshold.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to