jmsperu opened a new issue, #12829:
URL: https://github.com/apache/cloudstack/issues/12829
## Description
When a NAS backup repository becomes temporarily unreachable, running VMs on
the KVM host can be affected — including VMs that have no relationship to the
backup operation. This is because the NFS mount used by `nasbackup.sh` defaults
to `hard` mode, which blocks indefinitely when the NFS server is unresponsive.
## Root Cause
The `backup_repository.mount_opts` column defaults to empty. When
`nasbackup.sh` calls `mount_operation()`, it mounts the backup NFS share with
no special options, which defaults to NFS `hard` mode:
```bash
# nasbackup.sh mount_operation()
mount -t ${NAS_TYPE} ${NAS_ADDRESS} ${mount_point} $([[ ! -z "${MOUNT_OPTS}"
]] && echo -o ${MOUNT_OPTS})
```
With `hard` mode, any I/O operation on the NFS mount blocks indefinitely
when the server is unreachable. This causes a cascade:
1. `nasbackup.sh` hangs on NFS I/O (write, sync, or umount)
2. The CloudStack agent is blocked because `nasbackup.sh` runs as a child
process of the agent JVM
3. The blocked agent cannot process **any** VM operations (PlugNic, Stop,
Migrate, etc.) — all commands queue behind the stuck backup
4. On the host kernel level, NFS `hard` mount stalls can cause I/O waits
that affect all processes, including QEMU instances for unrelated VMs
5. VMs experience I/O timeouts — Windows guests BSOD with
`KERNEL_DATA_INPAGE_ERROR`, Linux guests may freeze
## Evidence (from production CloudStack 4.20)
NFS server 172.16.3.63 experienced intermittent connectivity issues. Host
dmesg showed repeated cycles:
```
nfs: server 172.16.3.63 not responding, still trying
nfs: server 172.16.3.63 OK
nfs: server 172.16.3.63 not responding, still trying
nfs: server 172.16.3.63 OK
```
Impact:
- A Windows VM (`citytravelsacco`, i-2-1651-VM) on the same host **crashed
with BSOD** (`KERNEL_DATA_INPAGE_ERROR`) even though its disk is on **local
storage**, not NFS
- The CloudStack agent was blocked for 3+ hours by a stuck `nasbackup.sh`
process, preventing all VM management operations on the host
- A NIC hot-plug operation queued for 30+ minutes waiting for the agent to
become responsive
## Suggested Fix
1. **Default `mount_opts` for NAS backup repositories to
`soft,timeo=50,retrans=3`** — this causes NFS operations to fail after ~15
seconds instead of blocking forever. A failed backup is far preferable to
crashing production VMs.
2. **Add a timeout wrapper to `nasbackup.sh`** — if the entire backup
operation exceeds a configurable duration, kill it cleanly (resume paused VM,
unmount, exit with error).
3. **Document the risk** — warn administrators that empty `mount_opts` on
NAS backup repositories defaults to NFS `hard` mode, which can cause host-wide
I/O stalls.
Note: `soft` mount may cause backup data corruption if the NFS server
recovers mid-write, but this only affects the backup copy — not the production
VM. A corrupted backup can be retried; a crashed production VM cannot be
un-crashed.
## Workaround
Manually set mount options on the backup repository:
```sql
UPDATE cloud.backup_repository SET mount_opts='soft,timeo=50,retrans=3'
WHERE id=<repo_id>;
```
And update `/etc/fstab` on KVM hosts if the NFS backup share is persistently
mounted:
```
172.16.3.63:/ACS /tmp/nasbackup nfs soft,timeo=50,retrans=3,_netdev 0 0
```
## Versions
- CloudStack 4.20
- NFS v4.1
- KVM hosts: Debian/Ubuntu with kernel 5.x/6.x
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]