LZD-PratyushBhatt opened a new issue, #3033: URL: https://github.com/apache/helix/issues/3033
When a cluster enters maintenance mode, partition movement is paused and only state maintenance is performed. Maintenance mode can be triggered manually (via user config) or automatically by the Helix controller if the number of offline nodes exceeds a threshold. Maintenance mode is frequently used to batch multiple config changes, ensuring the Helix controller only recalculates assignments after all updates are complete. The workflow is: 1. Enter maintenance mode 2. Apply config changes 3. Exit maintenance mode However, if an operation script fails mid-process, the cluster may remain stuck in maintenance mode, impacting availability. ### Proposed Solution - Introduce a timeout parameter for manual maintenance mode entry. - When entering maintenance mode manually, users can specify a timeout. - If the cluster is still in maintenance mode after the timeout, Helix will automatically exit maintenance mode. - If no timeout is specified, auto-exit does not occur. - For controller-triggered maintenance mode (due to offline nodes), the existing behavior remains: the cluster exits maintenance mode only when the offline node count drops below the threshold. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
