Hello,

given are ordinary (HP brand) file servers (not VMs) serving as NFS backing 
store for eight ESXi servers, serving around 120 VMs running Linux and Windows. 
The servers feature Xeon CPUs with 64 GiB of RAM. Infrastructure is 10Gbe with 
9k jumbo frames. Over the day, load is usually less than 10.

The backing store has SSDs and SAS disks, as mdadm based RAID5 to two volumes, 
a fast (SSD) and a slower (SAS) one. Each md has an ext4 filesystem and is 
mounted.

The NFS-Kernel Server exports these mount points to the said ESXi servers.

We're using VEEAM as backup solution for the VMs themselves. VEEAM does massive 
parallel accesses of vmdk resources to get the backups done as fast as possible.

This leads to frequent messages like this in Syslog:

rpc-srv/tcp: nfsd: sent only 68468 when sending 131204 bytes - shutting down 
socket

Unfortunately, ESXi doesn't support NFS over UDP.

Extensive research and tests showed that requests to the slower SAS storage are 
blocking threads, thus preventing them to serve requests to the faster SSD 
storage. When not mixing storage tiers on the same server, 64 Threads are just 
fine, for both SSDs and SAS. When mixing threads, we need to go up to 1024 
threads to prevent these messages completely. I'm not sure what implications 
this very high count of threads might pose. Once, when the machine was heavily 
loaded with I/O, raising the thread count via /proc made the ssh session become 
unresponsive for almost a minute.

Of course, the NFS-Server can't know about the possible speed of backing store 
handling incoming requests. The only thing we can try is to split backup jobs, 
so backups of VMs is done in one batch for SSD-only VMs, and one for mixed SAS 
("cheap bulk storage") and SSD (so OS upgrades are going fast) VMs. Not very 
convenient, and even less "suitable for everyday use".

As far as I understand, if a thread is already waiting for I/O, requests are 
dispatched to the next free thread. If all threads are exhausted, and waiting 
for I/O to complete, what exactly happens when new requests come in? I can't 
guess from the error message from above.

I'm not entirely sure if this is the right place to ask. But then, I have to 
start somewhere. :-)

Which other possibilities do you see to mitigate this excessive amount of 
threads? I guess the most important point is to separate requests between 
storage tiers. Ideas (not really refined!)…

- A given thread may move work back to the dispatcher after a certain timeout 
and request the next batch of work. There's a chance that that next batch is 
for SSDs and thus can be served within the timeout value. So, requests for SAS 
stack up in the dispatcher but the whole thing stays responsive for 
fast-serviceable SSD requests.

- Probably making the nfs-kernel-server mountpoint-aware (multi-queue 
approach)? While parsing /etc/exports, entries for the same mountpoint will be 
assigned an initial percentage of the available thread pool (thus threads being 
shared between mountpoints, but for different exports). This prevents work for 
different storage tiers being mixed. How to distribute threads? Static, evenly? 
Dynamic, according to what?

- Do nothing about it. The problem is rare in the real world, generates a lot 
of work for the kernel devs, introduces changes probably affecting other 
installs in a bad way and mechanical disks are already a thing of the past.

Thoughts, anyone?

:wq! PoC

PGP-Key: DDD3 4ABF 6413 38DE - https://www.pocnet.net/poc-key.asc


Reply via email to