Folks,

i’m getting a bit desperate here debugging a performance issue on one of our 
CephFS clusters. Ceph 19.2.3, normal replication with three replicas, Ceph 
cluster is performance-optimized according to best pratices, CephFS is running 
with a standby-replay MDS. All storage devices are 15.6TB NVMes in recent Dell 
servers. Content structure of CephFS is a single directory with 65535 
directories and 2000-4000 files in each one of these. We’re seeing massive 
performance issues with a certain application that does roughly the following:

* Read a file from the CephFS
* Modify it
* Write it back to the MDS in a new path with changed content

Average file size is somewhere in the ballpark of 200-300kb.

The effect we are now seeing is this: When we have the above mentioned 
application write to CephFS using four threads, there already is little 
performance impact that is visible to the outside. When we increase that thread 
count to 40, all read and write operations associated with said filesystem 
become notably slower.  Node interconnect is 10Gbit/s, we have done extensive 
tests using iperf and can guarantee that network bandwidth is not a concern on 
the physical layer. We have also tried to reproduce the problem using fio and 
with numerous different combinations of queue depths, numbers of jobs and block 
sizes, but to no avail. We can see fio writing reliably to the cluster with 
bandwidth rates of 1200-1400Mb/s and seemingly arbitrary numbers of 
parallelization, even when running from the same systems where the problematic 
application is running.

We have also conducted extensive tests with debug logging in the MDS enabled, 
and we see that there is a notable delay for reads and particularly writes 
between the „set_trace_dist added snap head“ and „link_primary_inode“ steps, 
with the delay varying between being almost unnoticeable and 2-3 seconds. Even 
when running with four threads only. And even then we see numerous processes in 
the system trying to access CephFS being in D state, according to their stack 
in /proc waiting for an MDS operation to finish.

We’ve examined the app in question to rule out that it does something strange 
with regards to stat() or some such, but that is not the case. Application is 
Java based, it uses nio to open the files when reading and writing.

We’ve ruled out all the usual suspects, mds cache limit on the MDS is 64 
Gigabytes, CPUs are AMD EPYCs and we do not see any notable load on the servers 
hosting the MDSes while encountering the delays. We also do not see the 
particular MDS process running at 100% CPU load on any core. And we do not see 
network congestion while the problem appears.

I’m running out of ideas here on what to debug and what to look for. Any hints 
or tips would be greatly appreciated. Thank you very much in advance, and a 
happy new year to everyone!

Best regards
Martin

-- 
Martin Gerhard Loschwitz
Geschäftsführer / CEO, True West IT Services GmbH
Phone: +49 2433 5253130
Mobile: +49 176 61832178
Address: Schmiedegasse 24a, 41836 Hückelhoven, Germany
Legal: HRB 21985, Amtsgericht Mönchengladbach
VAT: DE363893844

True West IT Services GmbH is compliant with the GDPR regulation on data 
protection and privacy in the European Union and the European Economic Area. 
You can request the information on how we collect and process your private data 
according to the law by contacting the email sender.

_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to