On 12/16/25 3:37 PM, Anthony D'Atri via ceph-users wrote:
I'm tryting to debug low performance on nvme-based cluster.
Which specific drives?
INTEL SSDPF2KE032T1
Per specs it should give about 300k IOPS.
I have 24 NVME in 4 servers, plenty of CPU, cluster perfectly balanced, no
scrubbing or replication atm, 1024 pg for 24 OSD/17TB.
I expect to see some performance (~250k IOPS total, 50% r/w, few hundred
volumes with capped by IOPS, pre-warmed). I see about half of it.
I looked at drive utilization, it's about 70% (per atop)
That's meaningless for SSDs.
Yes, I looked deeper, and I found that most of the time there is 0 or 1,
with very rarely 2 inflight operations, and all of them are done (from
nvme point of view) within 50-70µs. At the same time (under constant
load from fio) I can see about 10-20 op_wip, so my current theory goes
like this:
Ceph accepts about 20 requests, proccess them slowly, and then write to
devices (super fast), so device sits idle most of the time. OSD daemons
are not CPU/memory constrained (~40 cores idle, plenty of memory), os
it's just disparity between Ceph OSD speed and backend speed.
I remember, few years ago (on lesser hardware) I done benchmark for ceph
using brd (block ram disk) and ceph was able to churn up to 10k IOPS per
daemon. Nowdays, it can up to 20k, I think.
Actually, it's a good question: what is the maxium iops a single OSD
daemon can deliver with perfectly fast underlaying storage and
negligible network?
but I've noticed, that in-flight for drives is, basically, around 1. That
means, that at a given time only one request is processed. This is match match
for OSD count/3 /latency formula, and with one in-flight nvme is showing about
10% of it's specs (Intel, DC grade).
Have you updated to the latest firmware with SST?
Yep. Certified and the most up to date firmware.
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]