Hi Martin,
How many active MDS processes are you running? Given you have so many
sub-directories and they appear to be static, you may want to set up a
new parallel test directory where you have all 65535 sub-directories
round-robin pinned across multiple active MDSes and see how that
compares to the existing configuration. You can achieve real gains this
way, though multi-active MDSes may hit some bugs that aren't present in
single-active MDS setups.
Since you are not seeing obvious CPU load on any of the existing MDS
processes, I would also be curious what a wallclock profile of the
authoritative MDS for the parent directory would show. You can try mine
if you'd like, it's available here:
https://github.com/markhpc/uwpmp
You'll need debug symbols installed. The libunwind backend is the most
reliable so I would stick with that even though it is slower than libdw.
Thanks,
Mark
On 12/31/25 2:50 AM, Martin Gerhard Loschwitz via ceph-users wrote:
Folks,
i’m getting a bit desperate here debugging a performance issue on one of our
CephFS clusters. Ceph 19.2.3, normal replication with three replicas, Ceph
cluster is performance-optimized according to best pratices, CephFS is running
with a standby-replay MDS. All storage devices are 15.6TB NVMes in recent Dell
servers. Content structure of CephFS is a single directory with 65535
directories and 2000-4000 files in each one of these. We’re seeing massive
performance issues with a certain application that does roughly the following:
* Read a file from the CephFS
* Modify it
* Write it back to the MDS in a new path with changed content
Average file size is somewhere in the ballpark of 200-300kb.
The effect we are now seeing is this: When we have the above mentioned
application write to CephFS using four threads, there already is little
performance impact that is visible to the outside. When we increase that thread
count to 40, all read and write operations associated with said filesystem
become notably slower. Node interconnect is 10Gbit/s, we have done extensive
tests using iperf and can guarantee that network bandwidth is not a concern on
the physical layer. We have also tried to reproduce the problem using fio and
with numerous different combinations of queue depths, numbers of jobs and block
sizes, but to no avail. We can see fio writing reliably to the cluster with
bandwidth rates of 1200-1400Mb/s and seemingly arbitrary numbers of
parallelization, even when running from the same systems where the problematic
application is running.
We have also conducted extensive tests with debug logging in the MDS enabled,
and we see that there is a notable delay for reads and particularly writes
between the „set_trace_dist added snap head“ and „link_primary_inode“ steps,
with the delay varying between being almost unnoticeable and 2-3 seconds. Even
when running with four threads only. And even then we see numerous processes in
the system trying to access CephFS being in D state, according to their stack
in /proc waiting for an MDS operation to finish.
We’ve examined the app in question to rule out that it does something strange
with regards to stat() or some such, but that is not the case. Application is
Java based, it uses nio to open the files when reading and writing.
We’ve ruled out all the usual suspects, mds cache limit on the MDS is 64
Gigabytes, CPUs are AMD EPYCs and we do not see any notable load on the servers
hosting the MDSes while encountering the delays. We also do not see the
particular MDS process running at 100% CPU load on any core. And we do not see
network congestion while the problem appears.
I’m running out of ideas here on what to debug and what to look for. Any hints
or tips would be greatly appreciated. Thank you very much in advance, and a
happy new year to everyone!
Best regards
Martin
--
Best Regards,
Mark Nelson
Head of R&D (USA)
Clyso GmbH
p: +49 89 21552391 12
a: Loristraße 8 | 80335 München | Germany
w: https://clyso.com | e: [email protected]
We are hiring: https://www.clyso.com/jobs/
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]