Add a new section to document the new disarm-ha and arm-ha commands and their interaction with some other commands or situations.
Signed-off-by: Thomas Lamprecht <[email protected]> --- ha-manager.adoc | 127 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 127 insertions(+) diff --git a/ha-manager.adoc b/ha-manager.adoc index ee254be..5547f7c 100644 --- a/ha-manager.adoc +++ b/ha-manager.adoc @@ -1024,6 +1024,19 @@ when no HA resources are configured yet or the cluster just started. The CRM watchdog is not open. Fencing automatically transitions to `armed` once a CRM takes over as master. +disarming:: + +A `disarm-ha` command was issued. The CRM is freezing or removing services +from tracking and waiting for all LRMs to release their watchdogs. The CRM +watchdog is still active during this phase. Each LRM entry's watchdog status +changes to `released` as it acknowledges the disarm. + +disarmed:: + +All watchdogs have been released cluster-wide. No automatic fencing, +failover, or recovery takes place. See +xref:ha_manager_disarm[Disarming HA for Cluster Maintenance]. + NOTE: The `watchdog-mux` service keeps the underlying `/dev/watchdog` device open for its entire lifetime, even when no HA client is connected. This prevents other processes from claiming the device and ensures the HA stack can @@ -1281,6 +1294,120 @@ NOTE: Please do not 'kill' services like `pve-ha-crm`, `pve-ha-lrm` or immediate node reboot or even reset. +[[ha_manager_disarm]] +Disarming HA for Cluster Maintenance +------------------------------------- + +Certain cluster maintenance tasks, such as reconfiguring the network or the +cluster communication stack (corosync), can cause temporary quorum loss or +network partitions. Normally, HA would interpret this as a node failure and +trigger self-fencing, disrupting services unnecessarily. + +The disarm mechanism releases all CRM and LRM watchdogs cluster-wide, allowing +you to perform such maintenance safely without the risk of nodes being fenced. + +IMPORTANT: While disarmed, HA does not protect your services. Failures during +this period are not automatically recovered. Keep the disarm window as short +as possible. + +.Resource Modes + +When disarming HA, you must choose a resource mode that controls how HA +managed resources are handled while disarmed. The current state of resources +is not affected. + +freeze:: + +New commands and state changes are not applied. Services stay in their current +state, but the HA stack does not react to failures or process new requests. +This is the safest choice when you expect all nodes to remain running. + +ignore:: + +Resources are removed from HA tracking and can be managed as if they were not +HA managed. This allows you to manually start, stop, or migrate services +while HA is disarmed. Use this when you need to manually relocate services +during maintenance. + +.Disarming and Re-Arming + +To disarm HA with the desired resource mode: + +---- +# ha-manager crm-command disarm-ha freeze +---- + +or: + +---- +# ha-manager crm-command disarm-ha ignore +---- + +To re-arm HA after maintenance is complete: + +---- +# ha-manager crm-command arm-ha +---- + +You can monitor the current state with: + +---- +# ha-manager status +---- + +The fencing status line shows the current state of the fencing mechanism (see +xref:ha_manager_fencing_status[Fencing Status]), including the CRM and LRM +watchdog states. + +.The Disarm Process + +After you request disarm, the following sequence happens: + +. The CRM freezes all services or removes them from tracking, depending on + the chosen resource mode. +. Each LRM finishes its active workers, then releases its agent lock and + watchdog. +. Once all online LRMs are idle, the CRM releases its own watchdog too. + +The CRM keeps the manager lock throughout this process, so it can accept and +process the `arm-ha` command to reverse it. + +If any services are currently being fenced or recovered, the disarm is +deferred until fencing completes. This ensures that partially fenced services +do not end up in an inconsistent state. + +.Nodes Offline During Disarm + +If a node is offline when HA is disarmed, its LRM cannot process the disarm +request. The CRM proceeds to the disarmed state once all *online* LRMs have +completed their part. The offline node does not block this. + +When the offline node comes back online while HA is still disarmed, its LRM +picks up the disarm state and releases its watchdog without attempting any +service recovery. + +When you re-arm HA, any services that were on the offline node are handled +according to normal HA recovery rules: they are fenced and recovered if the +node is still unreachable, or restarted on the node if it has come back +online. + +.Interaction with Maintenance Mode + +If a node is already in maintenance mode when disarm is requested, the +maintenance migration continues until all services have been moved away. Once +no active services and workers remain, the LRM releases its lock and watchdog +as part of the disarm process. + +When HA is re-armed, the maintenance mode state is preserved. The node remains +in maintenance and services are not moved back until maintenance mode is +explicitly disabled. + +CAUTION: While the HA stack is disarmed, no automatic recovery, failover, or +fencing takes place. A node failure during this window is not detected or +handled by HA. Keep the disarm window as short as possible and ensure that the +cluster is in a healthy state before re-arming. + + [[ha_manager_crs]] Cluster Resource Scheduling --------------------------- -- 2.47.3
