Folks, i’m getting a bit desperate here debugging a performance issue on one of our CephFS clusters. Ceph 19.2.3, normal replication with three replicas, Ceph cluster is performance-optimized according to best pratices, CephFS is running with a standby-replay MDS. All storage devices are 15.6TB NVMes in recent Dell servers. Content structure of CephFS is a single directory with 65535 directories and 2000-4000 files in each one of these. We’re seeing massive performance issues with a certain application that does roughly the following:
* Read a file from the CephFS * Modify it * Write it back to the MDS in a new path with changed content Average file size is somewhere in the ballpark of 200-300kb. The effect we are now seeing is this: When we have the above mentioned application write to CephFS using four threads, there already is little performance impact that is visible to the outside. When we increase that thread count to 40, all read and write operations associated with said filesystem become notably slower. Node interconnect is 10Gbit/s, we have done extensive tests using iperf and can guarantee that network bandwidth is not a concern on the physical layer. We have also tried to reproduce the problem using fio and with numerous different combinations of queue depths, numbers of jobs and block sizes, but to no avail. We can see fio writing reliably to the cluster with bandwidth rates of 1200-1400Mb/s and seemingly arbitrary numbers of parallelization, even when running from the same systems where the problematic application is running. We have also conducted extensive tests with debug logging in the MDS enabled, and we see that there is a notable delay for reads and particularly writes between the „set_trace_dist added snap head“ and „link_primary_inode“ steps, with the delay varying between being almost unnoticeable and 2-3 seconds. Even when running with four threads only. And even then we see numerous processes in the system trying to access CephFS being in D state, according to their stack in /proc waiting for an MDS operation to finish. We’ve examined the app in question to rule out that it does something strange with regards to stat() or some such, but that is not the case. Application is Java based, it uses nio to open the files when reading and writing. We’ve ruled out all the usual suspects, mds cache limit on the MDS is 64 Gigabytes, CPUs are AMD EPYCs and we do not see any notable load on the servers hosting the MDSes while encountering the delays. We also do not see the particular MDS process running at 100% CPU load on any core. And we do not see network congestion while the problem appears. I’m running out of ideas here on what to debug and what to look for. Any hints or tips would be greatly appreciated. Thank you very much in advance, and a happy new year to everyone! Best regards Martin -- Martin Gerhard Loschwitz Geschäftsführer / CEO, True West IT Services GmbH Phone: +49 2433 5253130 Mobile: +49 176 61832178 Address: Schmiedegasse 24a, 41836 Hückelhoven, Germany Legal: HRB 21985, Amtsgericht Mönchengladbach VAT: DE363893844 True West IT Services GmbH is compliant with the GDPR regulation on data protection and privacy in the European Union and the European Economic Area. You can request the information on how we collect and process your private data according to the law by contacting the email sender. _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
