jmsperu opened a new issue, #12829:
URL: https://github.com/apache/cloudstack/issues/12829

   ## Description
   
   When a NAS backup repository becomes temporarily unreachable, running VMs on 
the KVM host can be affected — including VMs that have no relationship to the 
backup operation. This is because the NFS mount used by `nasbackup.sh` defaults 
to `hard` mode, which blocks indefinitely when the NFS server is unresponsive.
   
   ## Root Cause
   
   The `backup_repository.mount_opts` column defaults to empty. When 
`nasbackup.sh` calls `mount_operation()`, it mounts the backup NFS share with 
no special options, which defaults to NFS `hard` mode:
   
   ```bash
   # nasbackup.sh mount_operation()
   mount -t ${NAS_TYPE} ${NAS_ADDRESS} ${mount_point} $([[ ! -z "${MOUNT_OPTS}" 
]] && echo -o ${MOUNT_OPTS})
   ```
   
   With `hard` mode, any I/O operation on the NFS mount blocks indefinitely 
when the server is unreachable. This causes a cascade:
   
   1. `nasbackup.sh` hangs on NFS I/O (write, sync, or umount)
   2. The CloudStack agent is blocked because `nasbackup.sh` runs as a child 
process of the agent JVM
   3. The blocked agent cannot process **any** VM operations (PlugNic, Stop, 
Migrate, etc.) — all commands queue behind the stuck backup
   4. On the host kernel level, NFS `hard` mount stalls can cause I/O waits 
that affect all processes, including QEMU instances for unrelated VMs
   5. VMs experience I/O timeouts — Windows guests BSOD with 
`KERNEL_DATA_INPAGE_ERROR`, Linux guests may freeze
   
   ## Evidence (from production CloudStack 4.20)
   
   NFS server 172.16.3.63 experienced intermittent connectivity issues. Host 
dmesg showed repeated cycles:
   ```
   nfs: server 172.16.3.63 not responding, still trying
   nfs: server 172.16.3.63 OK
   nfs: server 172.16.3.63 not responding, still trying
   nfs: server 172.16.3.63 OK
   ```
   
   Impact:
   - A Windows VM (`citytravelsacco`, i-2-1651-VM) on the same host **crashed 
with BSOD** (`KERNEL_DATA_INPAGE_ERROR`) even though its disk is on **local 
storage**, not NFS
   - The CloudStack agent was blocked for 3+ hours by a stuck `nasbackup.sh` 
process, preventing all VM management operations on the host
   - A NIC hot-plug operation queued for 30+ minutes waiting for the agent to 
become responsive
   
   ## Suggested Fix
   
   1. **Default `mount_opts` for NAS backup repositories to 
`soft,timeo=50,retrans=3`** — this causes NFS operations to fail after ~15 
seconds instead of blocking forever. A failed backup is far preferable to 
crashing production VMs.
   
   2. **Add a timeout wrapper to `nasbackup.sh`** — if the entire backup 
operation exceeds a configurable duration, kill it cleanly (resume paused VM, 
unmount, exit with error).
   
   3. **Document the risk** — warn administrators that empty `mount_opts` on 
NAS backup repositories defaults to NFS `hard` mode, which can cause host-wide 
I/O stalls.
   
   Note: `soft` mount may cause backup data corruption if the NFS server 
recovers mid-write, but this only affects the backup copy — not the production 
VM. A corrupted backup can be retried; a crashed production VM cannot be 
un-crashed.
   
   ## Workaround
   
   Manually set mount options on the backup repository:
   ```sql
   UPDATE cloud.backup_repository SET mount_opts='soft,timeo=50,retrans=3' 
WHERE id=<repo_id>;
   ```
   
   And update `/etc/fstab` on KVM hosts if the NFS backup share is persistently 
mounted:
   ```
   172.16.3.63:/ACS /tmp/nasbackup nfs soft,timeo=50,retrans=3,_netdev 0 0
   ```
   
   ## Versions
   
   - CloudStack 4.20
   - NFS v4.1
   - KVM hosts: Debian/Ubuntu with kernel 5.x/6.x


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to