Hi Cephers, I'm testing cluster throughput before moving to the production. Ceph version 13.2.1 (I'll update to 13.2.2).
I run rados bench from 10 cluster nodes and 10 clients in parallel. Just after I call rados command, HDDs behind three OSDs are 100% utilized while others are < 40%. After the short while only one OSD stay 100% utilized. I've stopped this OSD to eliminate hardware issue, but then another OSD on another node start hitting 100% disk util during next rados bench write. The same OSD is fully utilized for each bench run. Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdd 0,00 0,00 0,00 518,00 0,00 129,50 512,00 87,99 155,12 0,00 155,12 1,93 100,00 The test pool size is 3 (replicated). (Deep)scrubbing is temporary off. Networking, CPU and memory is underutilized during the test. Particular rados command is rados bench --name client.rbd_test -p rbd_test 600 write --no-cleanup --run-name $(hostname)_bench The same story with rados --name client.rbd_test -p rbd_test load-gen --min-object-size 4M --max-object-size 4M --min-op-len 4M --max-op-len 4M --max-ops 16 --read-percent 0 --target-throughput 1000 --run-length 600 Do you face the same behavior? It smells like particular PG related. Is it the effect of running number of rados bench tasks in parallel ? Of course, I do not deny it's cluster limit, but I'm not sure why only one and always the same OSD keeps hitting 100% util. Tomorrow I'm going to test cluster using rbd, How looks your clusters limit ? saturated LACP ? 100% utilized HDDs??? Thanks, Jakub
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
