On 12/16/25 3:37 PM, Anthony D'Atri via ceph-users wrote:
I'm tryting to debug low performance on nvme-based cluster.
Which specific drives?

INTEL SSDPF2KE032T1

Per specs it should give about 300k IOPS.

I have 24 NVME in 4 servers, plenty of CPU, cluster perfectly balanced, no 
scrubbing or replication atm, 1024 pg for 24 OSD/17TB.

I expect to see some performance (~250k IOPS total, 50% r/w, few hundred 
volumes with capped by IOPS, pre-warmed). I see about half of it.

I looked at drive utilization, it's about 70% (per atop)
That's meaningless for SSDs.

Yes, I looked deeper, and I found that most of the time there is 0 or 1, with very rarely 2 inflight operations, and all of them are done (from nvme point of view) within 50-70µs. At the same time (under constant load from fio) I can see about 10-20 op_wip, so my current theory goes like this:

Ceph accepts about 20 requests, proccess them slowly, and then write to devices (super fast), so device sits idle most of the time. OSD daemons are not CPU/memory constrained (~40 cores idle, plenty of memory), os it's just disparity between Ceph OSD speed and backend speed.


I remember, few years ago (on lesser hardware) I done benchmark for ceph using brd (block ram disk) and ceph was able to churn up to 10k IOPS per daemon. Nowdays, it can up to 20k, I think.

Actually, it's a good question: what is the maxium iops a single OSD daemon can deliver with perfectly fast underlaying storage and negligible network?

but I've noticed, that in-flight for drives is, basically, around 1. That 
means, that at a given time only one request is processed. This is match match 
for OSD count/3 /latency formula, and with one in-flight nvme is showing about 
10% of it's specs (Intel, DC grade).
Have you updated to the latest firmware with SST?

Yep. Certified and the most up to date firmware.

_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to