andrijapanicsb commented on PR #13377:
URL: https://github.com/apache/cloudstack/pull/13377#issuecomment-4676871111

   ## TL;DR --> hw/sw setup + what was tested + clock-measured timing results 
for VM HA to kick in:
   
   - KMV host: HPE DL360 / iLO5 (Ubuntu 24.04 for both mgmt and kvm host)
   - Driver: **ipmi** 2.0 (**yet to test RedFish driver** - which is the main 
reason for this PR)
   - CloudStack: 4.22.1.0 alone vs. this 4.22.1.0+this PR (fat JAR replacing)
   - Primary storage: OCFS2 SharedMountPoint
   
   ### What was measured/tested:
   -  Functional testing (feature not broken + **yet to test RedFish, which was 
the reason for this patch**) 
      - NFS Primary storage not tested (and assuming NOT NFSv3 = no locking of 
qcow2 = not an important factor/variable)
   - Semi-tuning was done (see global config below) due to focus being put on a 
completely different thing (and not minimal VM downtime)
   - **PR also reduced VM downtime:**
      - **down from 8 minutes to 2.5 minutes (confirmed with running test 2 
times, not only once)**
   -  Clock-measured timing/results with AND without this patch/PR
   
   ### KVM.ha global configs changed:
   
   | Setting | Test Value | Default Value |
   |---|---|---|
   | `kvm.ha.health.check.timeout` | 15 | 10 |
   | `kvm.ha.activity.check.timeout` | 30 | 60 |
   | `kvm.ha.activity.check.interval` | 30 | 60 |
   | `kvm.ha.activity.check.max.attempts` | 5 | 10 |
   | `kvm.ha.activity.check.failure.ratio` | 0.6 | 0.7 |
   | `kvm.ha.degraded.max.period` | 180 | 300 |
   | `kvm.ha.recover.wait.period` | 180 | 600 |
   | `kvm.ha.fence.timeout` | 120 | 60 |
   | **`kvm.ha.recover.failure.threshold`** | 0 | 1 |
   
   The last setting ensures that CloudStack skip one or more attempts to 
"recover" the host  by using the BMC POWER RESET command (a.k.a tries 0 times) 
- it rather fences it immediately via the BMC POWER OFF command (since the host 
already has reached "Degraded" state and needs help - kill or fix)
   
   - Testing premise: we don't care about the host being recovered or staying 
powered off.
   - **We care about minimal VM downtime** when the host is messed up
     - (i.e. when declared as "I'm messed up" - STONITH/fence it immediately 
and ensure VM-HA kicks in - instead of retrying 1 or more times to reset the 
host and NOT trigger VM-HA (we can't guarantee that after that the host will be 
fine after the OS re-boot - don't risk long VM downtime during the recovery 
period)
   
   # Host HA fencing improvement: handle already-powered-off hosts and reduce 
HA VM restart delay
   
   This PR addresses a Host HA fencing scenario observed during testing on a 
physical environment using HPE iLO5 / BMC-based out-of-band management with 
IPMI driver (yet to test RedFish, which
   
   The test environment was based on Apache CloudStack 4.22.1 with KVM. Primary 
storage was configured as CloudStack shared mount point storage backed by an 
OCFS2 clustered filesystem, which is now supported for Host HA. Host HA was 
enabled only on a single selected host for this test.
   
   On that host, we placed two VMs:
   
   | VM type | HA setting | Expected behavior after host failure |
   |---|---:|---|
   | HA-enabled VM | Created from a compute offering with HA enabled | Should 
be restarted on another suitable host after fencing |
   | Non-HA VM | Created from a compute offering without HA | Should remain 
stopped and not be restarted automatically |
   
   A fat jar was produced from a branch based directly on the CloudStack 4.22.1 
tag. The jar was extracted from the built RPM package and used for testing.
   
   ## Scenario being tested
   
   The test intentionally simulated a somewhat unusual but important failure 
scenario: the host was manually powered off through the BMC / IPMI / iLO 
interface before CloudStack completed its Host HA fencing flow.
   
   This scenario matters because, depending on the out-of-band driver 
implementation, sending a power-off command to a chassis that is already 
powered off may return an error (Redfish does this, IPMI not affected) or 
otherwise be interpreted as a failed fencing operation
   
   The important point is that CloudStack should not treat “the host is already 
powered off” as a fencing failure. If the final power state is off, the host is 
effectively fenced and VM HA can safely proceed.
   
   ## Logic introduced by the patch
   
   The patched logic changes the fencing flow to be state-driven instead of 
**relying only on the return status** of the bmc power-off command.
   
   The intended behavior (after host reacheds Degraded state) is:
   
   1. Before sending a power-off command, query the current chassis power state.
   2. If the chassis is already powered off, treat the host as already fenced.
   3. If the chassis is still powered on, send the power-off command.
   4. Do not rely only on the raw command return code.
   5. **After the command completes, query the chassis power state again.**
   6. If the chassis is confirmed powered off, mark the host as fenced / down 
and allow VM HA to proceed.
   7. If the chassis is still powered on, fencing should not be considered 
successful.
   
   In short: the final observed power state is what matters. If the chassis is 
off, the host is fenced.
   
   ## Test results
   
   The test confirmed the expected VM HA behavior:
   
   | Test case | Manual chassis power-off time | Host reached Alert state | 
Host marked Down / fenced | HA-caused "VM.START" event | Approx. time until HA 
restart |
   |---|---:|---:|---:|---:|---:|
   | Before patch | 16:00:00 | 16:02:30 | 16:07:55 | 16:07:56 | ~7m 56s |
   | With patch | 16:16:00 | Not separately recorded | 16:18:39 | 16:18:40 | 
~2m 40s |
   
   Before the patch, the host reached Alert state after approximately 2 minutes 
and 30 seconds, but it was not marked Down / fenced until 16:07:55. The VM-HA 
fired a VM start (for HA-enabled VM only), i.e. VM.SSTART event was observed 
one second later, at 16:07:56. This means the HA-enabled VM experienced roughly 
8 minutes of downtime before the restart began.
   VMs which are not HA-enabled were marked as down (it's debatable if this 
"OK" behaviour - if the underlying infra dies, the user still expect his VM to 
be running)
   
   With the patched logic (replacing the fat jar), the same type of test was 
repeated. The chassis was manually powered off at 16:16:00. The host was marked 
Down / fenced at 16:18:39, and the HA-enabled VM start event was observed one 
second later, at 16:18:40. This reduced the time before HA restart from roughly 
8 minutes to roughly 2 minutes and 40 seconds.
   
   The non-HA VM was not restarted in either case, which is the expected 
behavior.
   
   ## Result
   
   The patch reduced the observed HA VM restart delay by approximately 5 
minutes and 16 seconds in this test scenario.
   
   More importantly, it makes the fencing logic safer and more deterministic: 
if the host is already powered off, CloudStack should recognize that condition 
as a successful fencing state rather than waiting longer or treating the 
operation as failed because the power-off command itself did not behave as 
expected (Redfish protocol)
   
   This allows Host HA to proceed much sooner while still preserving the 
important safety rule: VM HA should only be triggered after the host has been 
confirmed powered off / fenced.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to