andrijapanicsb opened a new pull request, #13377: URL: https://github.com/apache/cloudstack/pull/13377
### Description When a KVM host with host-HA + out-of-band management (OOBM) enabled is hard powered off (forced chassis-off from the BMC, or a real power/cable failure), CloudStack never transitions the host to `Down` and therefore never restarts its VMs on other hosts — the host stays in `Alert`/`Disconnected` indefinitely. **Root cause:** the host-HA state machine declares a host dead (`HAState.Fenced` → investigator `Status.Down`) only after a **successful** OOBM **power-off**. Against an already-off chassis the BMC rejects the power-off (the Redfish driver maps `OFF` to `GracefulShutdown`, which returns **HTTP 409** when the system is already off), so `KVMHAProvider.fence()` reports failure and the host stays stuck in the `Fencing` state — which `HAManagerImpl.getHostStatusFromHAConfig()` maps to `Status.Disconnected`, not `Status.Down`. VM-HA is therefore never invoked, and the VMs are only recovered once the original (dead) host is powered back on, at which point the pending power-off finally succeeds. Observed in production with Redfish/iDRAC. Full root-cause analysis and management-server log evidence are in #13376. ### Fix Fencing now succeeds based on the **actual chassis power state**, not the power-off command's return code: - if the host is already powered off (`OOBM STATUS == Off`) → treat it as fenced (no power-off issued); - otherwise issue a best-effort power-off and then **confirm via OOBM STATUS**; - only a confirmed `Off` state counts as a successful fence; if the state cannot be confirmed (e.g. an unreachable BMC) the fence fails and is retried, to avoid split-brain. This is OOBM-driver-agnostic (works for ipmitool, Redfish and nested-cloudstack drivers). Additionally, the Redfish driver now maps `PowerOperation.OFF` to `ForceOff` (a hard power-off) instead of `GracefulShutdown` — consistent with the ipmitool driver and appropriate for fencing an unresponsive host; `SOFT` remains the graceful ACPI shutdown. Also fixes a latent `String.format` argument-count bug on the Redfish `STATUS` branch. Fixes: #13376 ### Types of changes - [ ] Breaking change (fix or feature that would cause existing functionality to change) - [x] New feature/enhancement (non-breaking change which adds functionality) - [x] Bug fix (non-breaking change which fixes an issue) - [ ] Enhancement (improves an existing feature and functionality) - [ ] Cleanup (Code refactoring and cleanup, that may add test cases) - [ ] build/CI ### Bug Severity - [ ] BLOCKER - [x] Critical - [ ] Major - [ ] Minor - [ ] Trivial ### How Has This Been Tested? Unit tests added to `KVMHostHATest` (all green) covering the fence behaviour: - host already off → fenced without issuing a power-off; - power-off succeeds, STATUS confirms `Off` → fenced; - **power-off command fails (HTTP 409) but STATUS confirms `Off` → still fenced** (the regression for this issue); - power state cannot be confirmed (unreachable BMC) → fence fails (no split-brain); - OOBM not enabled → fence fails. ``` mvn -pl plugins/hypervisors/kvm -Dtest=KVMHostHATest test => Tests run: 9, Failures: 0, Errors: 0, Skipped: 0 ``` Note on reproduction: the original symptom reproduces on real Redfish hardware (power-off-when-off → HTTP 409). Software/nested OOBM drivers whose power-off is idempotent (e.g. the nested-cloudstack driver's `stopVirtualMachine`, which is a no-op on an already-stopped VM) do not exhibit the bug, so the deterministic coverage is provided by the unit tests above. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
