Hello,
given are ordinary (HP brand) file servers (not VMs) serving as NFS backing
store for eight ESXi servers, serving around 120 VMs running Linux and Windows.
The servers feature Xeon CPUs with 64 GiB of RAM. Infrastructure is 10Gbe with
9k jumbo frames. Over the day, load is usually less than 10.
The backing store has SSDs and SAS disks, as mdadm based RAID5 to two volumes,
a fast (SSD) and a slower (SAS) one. Each md has an ext4 filesystem and is
mounted.
The NFS-Kernel Server exports these mount points to the said ESXi servers.
We're using VEEAM as backup solution for the VMs themselves. VEEAM does massive
parallel accesses of vmdk resources to get the backups done as fast as possible.
This leads to frequent messages like this in Syslog:
rpc-srv/tcp: nfsd: sent only 68468 when sending 131204 bytes - shutting down
socket
Unfortunately, ESXi doesn't support NFS over UDP.
Extensive research and tests showed that requests to the slower SAS storage are
blocking threads, thus preventing them to serve requests to the faster SSD
storage. When not mixing storage tiers on the same server, 64 Threads are just
fine, for both SSDs and SAS. When mixing threads, we need to go up to 1024
threads to prevent these messages completely. I'm not sure what implications
this very high count of threads might pose. Once, when the machine was heavily
loaded with I/O, raising the thread count via /proc made the ssh session become
unresponsive for almost a minute.
Of course, the NFS-Server can't know about the possible speed of backing store
handling incoming requests. The only thing we can try is to split backup jobs,
so backups of VMs is done in one batch for SSD-only VMs, and one for mixed SAS
("cheap bulk storage") and SSD (so OS upgrades are going fast) VMs. Not very
convenient, and even less "suitable for everyday use".
As far as I understand, if a thread is already waiting for I/O, requests are
dispatched to the next free thread. If all threads are exhausted, and waiting
for I/O to complete, what exactly happens when new requests come in? I can't
guess from the error message from above.
I'm not entirely sure if this is the right place to ask. But then, I have to
start somewhere. :-)
Which other possibilities do you see to mitigate this excessive amount of
threads? I guess the most important point is to separate requests between
storage tiers. Ideas (not really refined!)…
- A given thread may move work back to the dispatcher after a certain timeout
and request the next batch of work. There's a chance that that next batch is
for SSDs and thus can be served within the timeout value. So, requests for SAS
stack up in the dispatcher but the whole thing stays responsive for
fast-serviceable SSD requests.
- Probably making the nfs-kernel-server mountpoint-aware (multi-queue
approach)? While parsing /etc/exports, entries for the same mountpoint will be
assigned an initial percentage of the available thread pool (thus threads being
shared between mountpoints, but for different exports). This prevents work for
different storage tiers being mixed. How to distribute threads? Static, evenly?
Dynamic, according to what?
- Do nothing about it. The problem is rare in the real world, generates a lot
of work for the kernel devs, introduces changes probably affecting other
installs in a bad way and mechanical disks are already a thing of the past.
Thoughts, anyone?
:wq! PoC
PGP-Key: DDD3 4ABF 6413 38DE - https://www.pocnet.net/poc-key.asc