[I] KVM NAS backup: VM remains paused indefinitely when backup job fails [cloudstack]

via GitHub Mon, 16 Mar 2026 20:56:26 -0700


jmsperu opened a new issue, #12821:
URL: https://github.com/apache/cloudstack/issues/12821


   ## Description
   
   When using the NAS backup plugin on KVM, if a backup job fails (e.g. due to 
backup storage being full or I/O errors on the NFS target), the VM remains 
**indefinitely paused** at the hypervisor level. CloudStack marks the backup as 
`Error` but does not resume the VM, leaving it unresponsive until manually 
resumed via `virsh resume`.
   
   ## Steps to Reproduce
   
   1. Configure NAS backup with NFS storage for a running KVM VM
   2. Fill up the NFS backup storage to 100% capacity
   3. Wait for the scheduled backup to trigger
   4. Observe the VM becomes paused and never resumes
   
   ## Expected Behavior
   
   The VM should be automatically resumed after a backup failure. The backup 
should be marked as failed, but the VM should continue running normally.
   
   ## Actual Behavior
   
   The VM remains in a `paused` state indefinitely. The backup monitoring loop 
in `nasbackup.sh` enters an infinite cycle:
   1. `virsh backup-begin` pauses the QEMU domain for consistent snapshot
   2. Backup write fails (storage full)
   3. `domjobinfo` reports `Failed` status
   4. `cleanup()` is called but **does not resume the VM**
   5. No `exit` statement after cleanup — loop continues, repeatedly detecting 
the failed job
   
   ## Root Cause Analysis
   
   Three bugs in `scripts/vm/hypervisor/kvm/nasbackup.sh`:
   
   ### Bug 1: Missing exit after failed backup cleanup (line 144)
   ```bash
   case "$status" in
     Failed)
       echo "Virsh backup job failed"
       cleanup ;;   # <-- no exit, falls through to sleep and loops forever
   esac
   ```
   
   ### Bug 2: cleanup() never resumes the VM (line 222)
   The `cleanup()` function only removes files and unmounts storage. It never 
checks if the VM is paused or attempts to resume it, even though `virsh 
backup-begin` may have paused the domain.
   
   ### Bug 3: Missing exit in backup_stopped_vm() (line 181)
   Similar to Bug 1, `backup_stopped_vm()` calls `cleanup()` on `qemu-img 
convert` failure but does not exit, allowing the loop to continue processing 
subsequent disks.
   
   ## Impact
   
   - **Production outage**: All services on the affected VM become unresponsive
   - **Cascading failures**: When backup storage fills up, ALL VMs being backed 
up get paused simultaneously
   - **Silent failure**: CloudStack UI shows the VM as "Running" while it is 
actually paused at the KVM level
   - **No automatic recovery**: Manual intervention (`virsh resume`) is 
required per VM
   
   In our environment, NFS backup storage filling to 100% caused **8 production 
VMs** to become paused simultaneously across 3 KVM hosts, with some VMs 
remaining paused for over 6 hours before detection.
   
   ## Environment
   
   - CloudStack 4.19/4.20/main (code is unchanged across versions)
   - KVM hypervisor
   - NAS backup plugin with NFS storage
   - File: `scripts/vm/hypervisor/kvm/nasbackup.sh`
   
   ## Proposed Fix
   
   PR forthcoming with the following changes:
   1. Add VM state check and `virsh resume` to `cleanup()` function
   2. Add missing `exit 1` after `cleanup()` in the `Failed` backup job case
   3. Add missing `exit 1` after `cleanup()` in `backup_stopped_vm()` on 
`qemu-img convert` failure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] KVM NAS backup: VM remains paused indefinitely when backup job fails [cloudstack]

Reply via email to