Hi Martin,

How many active MDS processes are you running?  Given you have so many sub-directories and they appear to be static, you may want to set up a new parallel test directory where you have all 65535 sub-directories round-robin pinned across multiple active MDSes and see how that compares to the existing configuration.  You can achieve real gains this way, though multi-active MDSes may hit some bugs that aren't present in single-active MDS setups.

Since you are not seeing obvious CPU load on any of the existing MDS processes, I would also be curious what a wallclock profile of the authoritative MDS for the parent directory would show.  You can try mine if you'd like, it's available here:

https://github.com/markhpc/uwpmp

You'll need debug symbols installed.  The libunwind backend is the most reliable so I would stick with that even though it is slower than libdw.


Thanks,
Mark

On 12/31/25 2:50 AM, Martin Gerhard Loschwitz via ceph-users wrote:
Folks,

i’m getting a bit desperate here debugging a performance issue on one of our 
CephFS clusters. Ceph 19.2.3, normal replication with three replicas, Ceph 
cluster is performance-optimized according to best pratices, CephFS is running 
with a standby-replay MDS. All storage devices are 15.6TB NVMes in recent Dell 
servers. Content structure of CephFS is a single directory with 65535 
directories and 2000-4000 files in each one of these. We’re seeing massive 
performance issues with a certain application that does roughly the following:

* Read a file from the CephFS
* Modify it
* Write it back to the MDS in a new path with changed content

Average file size is somewhere in the ballpark of 200-300kb.

The effect we are now seeing is this: When we have the above mentioned 
application write to CephFS using four threads, there already is little 
performance impact that is visible to the outside. When we increase that thread 
count to 40, all read and write operations associated with said filesystem 
become notably slower.  Node interconnect is 10Gbit/s, we have done extensive 
tests using iperf and can guarantee that network bandwidth is not a concern on 
the physical layer. We have also tried to reproduce the problem using fio and 
with numerous different combinations of queue depths, numbers of jobs and block 
sizes, but to no avail. We can see fio writing reliably to the cluster with 
bandwidth rates of 1200-1400Mb/s and seemingly arbitrary numbers of 
parallelization, even when running from the same systems where the problematic 
application is running.

We have also conducted extensive tests with debug logging in the MDS enabled, 
and we see that there is a notable delay for reads and particularly writes 
between the „set_trace_dist added snap head“ and „link_primary_inode“ steps, 
with the delay varying between being almost unnoticeable and 2-3 seconds. Even 
when running with four threads only. And even then we see numerous processes in 
the system trying to access CephFS being in D state, according to their stack 
in /proc waiting for an MDS operation to finish.

We’ve examined the app in question to rule out that it does something strange 
with regards to stat() or some such, but that is not the case. Application is 
Java based, it uses nio to open the files when reading and writing.

We’ve ruled out all the usual suspects, mds cache limit on the MDS is 64 
Gigabytes, CPUs are AMD EPYCs and we do not see any notable load on the servers 
hosting the MDSes while encountering the delays. We also do not see the 
particular MDS process running at 100% CPU load on any core. And we do not see 
network congestion while the problem appears.

I’m running out of ideas here on what to debug and what to look for. Any hints 
or tips would be greatly appreciated. Thank you very much in advance, and a 
happy new year to everyone!

Best regards
Martin

--
Best Regards,
Mark Nelson
Head of R&D (USA)

Clyso GmbH
p: +49 89 21552391 12
a: Loristraße 8 | 80335 München | Germany
w: https://clyso.com | e: [email protected]

We are hiring: https://www.clyso.com/jobs/
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to