jmsperu opened a new issue, #12821:
URL: https://github.com/apache/cloudstack/issues/12821
## Description
When using the NAS backup plugin on KVM, if a backup job fails (e.g. due to
backup storage being full or I/O errors on the NFS target), the VM remains
**indefinitely paused** at the hypervisor level. CloudStack marks the backup as
`Error` but does not resume the VM, leaving it unresponsive until manually
resumed via `virsh resume`.
## Steps to Reproduce
1. Configure NAS backup with NFS storage for a running KVM VM
2. Fill up the NFS backup storage to 100% capacity
3. Wait for the scheduled backup to trigger
4. Observe the VM becomes paused and never resumes
## Expected Behavior
The VM should be automatically resumed after a backup failure. The backup
should be marked as failed, but the VM should continue running normally.
## Actual Behavior
The VM remains in a `paused` state indefinitely. The backup monitoring loop
in `nasbackup.sh` enters an infinite cycle:
1. `virsh backup-begin` pauses the QEMU domain for consistent snapshot
2. Backup write fails (storage full)
3. `domjobinfo` reports `Failed` status
4. `cleanup()` is called but **does not resume the VM**
5. No `exit` statement after cleanup — loop continues, repeatedly detecting
the failed job
## Root Cause Analysis
Three bugs in `scripts/vm/hypervisor/kvm/nasbackup.sh`:
### Bug 1: Missing exit after failed backup cleanup (line 144)
```bash
case "$status" in
Failed)
echo "Virsh backup job failed"
cleanup ;; # <-- no exit, falls through to sleep and loops forever
esac
```
### Bug 2: cleanup() never resumes the VM (line 222)
The `cleanup()` function only removes files and unmounts storage. It never
checks if the VM is paused or attempts to resume it, even though `virsh
backup-begin` may have paused the domain.
### Bug 3: Missing exit in backup_stopped_vm() (line 181)
Similar to Bug 1, `backup_stopped_vm()` calls `cleanup()` on `qemu-img
convert` failure but does not exit, allowing the loop to continue processing
subsequent disks.
## Impact
- **Production outage**: All services on the affected VM become unresponsive
- **Cascading failures**: When backup storage fills up, ALL VMs being backed
up get paused simultaneously
- **Silent failure**: CloudStack UI shows the VM as "Running" while it is
actually paused at the KVM level
- **No automatic recovery**: Manual intervention (`virsh resume`) is
required per VM
In our environment, NFS backup storage filling to 100% caused **8 production
VMs** to become paused simultaneously across 3 KVM hosts, with some VMs
remaining paused for over 6 hours before detection.
## Environment
- CloudStack 4.19/4.20/main (code is unchanged across versions)
- KVM hypervisor
- NAS backup plugin with NFS storage
- File: `scripts/vm/hypervisor/kvm/nasbackup.sh`
## Proposed Fix
PR forthcoming with the following changes:
1. Add VM state check and `virsh resume` to `cleanup()` function
2. Add missing `exit 1` after `cleanup()` in the `Failed` backup job case
3. Add missing `exit 1` after `cleanup()` in `backup_stopped_vm()` on
`qemu-img convert` failure
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]