andrijapanicsb opened a new pull request, #13377:
URL: https://github.com/apache/cloudstack/pull/13377

   ### Description
   
   When a KVM host with host-HA + out-of-band management (OOBM) enabled is hard 
powered off (forced chassis-off from the BMC, or a real power/cable failure), 
CloudStack never transitions the host to `Down` and therefore never restarts 
its VMs on other hosts — the host stays in `Alert`/`Disconnected` indefinitely.
   
   **Root cause:** the host-HA state machine declares a host dead 
(`HAState.Fenced` → investigator `Status.Down`) only after a **successful** 
OOBM **power-off**. Against an already-off chassis the BMC rejects the 
power-off (the Redfish driver maps `OFF` to `GracefulShutdown`, which returns 
**HTTP 409** when the system is already off), so `KVMHAProvider.fence()` 
reports failure and the host stays stuck in the `Fencing` state — which 
`HAManagerImpl.getHostStatusFromHAConfig()` maps to `Status.Disconnected`, not 
`Status.Down`. VM-HA is therefore never invoked, and the VMs are only recovered 
once the original (dead) host is powered back on, at which point the pending 
power-off finally succeeds.
   
   Observed in production with Redfish/iDRAC. Full root-cause analysis and 
management-server log evidence are in #13376.
   
   ### Fix
   
   Fencing now succeeds based on the **actual chassis power state**, not the 
power-off command's return code:
   - if the host is already powered off (`OOBM STATUS == Off`) → treat it as 
fenced (no power-off issued);
   - otherwise issue a best-effort power-off and then **confirm via OOBM 
STATUS**;
   - only a confirmed `Off` state counts as a successful fence; if the state 
cannot be confirmed (e.g. an unreachable BMC) the fence fails and is retried, 
to avoid split-brain.
   
   This is OOBM-driver-agnostic (works for ipmitool, Redfish and 
nested-cloudstack drivers).
   
   Additionally, the Redfish driver now maps `PowerOperation.OFF` to `ForceOff` 
(a hard power-off) instead of `GracefulShutdown` — consistent with the ipmitool 
driver and appropriate for fencing an unresponsive host; `SOFT` remains the 
graceful ACPI shutdown. Also fixes a latent `String.format` argument-count bug 
on the Redfish `STATUS` branch.
   
   Fixes: #13376
   
   ### Types of changes
   
   - [ ] Breaking change (fix or feature that would cause existing 
functionality to change)
   - [x] New feature/enhancement (non-breaking change which adds functionality)
   - [x] Bug fix (non-breaking change which fixes an issue)
   - [ ] Enhancement (improves an existing feature and functionality)
   - [ ] Cleanup (Code refactoring and cleanup, that may add test cases)
   - [ ] build/CI
   
   ### Bug Severity
   
   - [ ] BLOCKER
   - [x] Critical
   - [ ] Major
   - [ ] Minor
   - [ ] Trivial
   
   ### How Has This Been Tested?
   
   Unit tests added to `KVMHostHATest` (all green) covering the fence behaviour:
   - host already off → fenced without issuing a power-off;
   - power-off succeeds, STATUS confirms `Off` → fenced;
   - **power-off command fails (HTTP 409) but STATUS confirms `Off` → still 
fenced** (the regression for this issue);
   - power state cannot be confirmed (unreachable BMC) → fence fails (no 
split-brain);
   - OOBM not enabled → fence fails.
   
   ```
   mvn -pl plugins/hypervisors/kvm -Dtest=KVMHostHATest test
   => Tests run: 9, Failures: 0, Errors: 0, Skipped: 0
   ```
   
   Note on reproduction: the original symptom reproduces on real Redfish 
hardware (power-off-when-off → HTTP 409). Software/nested OOBM drivers whose 
power-off is idempotent (e.g. the nested-cloudstack driver's 
`stopVirtualMachine`, which is a no-op on an already-stopped VM) do not exhibit 
the bug, so the deterministic coverage is provided by the unit tests above.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to