Hi Cephers,

I'm testing cluster throughput before moving to the production. Ceph
version 13.2.1 (I'll update to 13.2.2).

I run rados bench from 10 cluster nodes and 10 clients in parallel.
Just after I call rados command, HDDs behind three OSDs are 100% utilized
while others are < 40%. After the short while only one OSD stay 100%
utilized. I've stopped this OSD to eliminate hardware issue, but then
another OSD on another node start hitting 100% disk util during next rados
bench write. The same OSD is fully utilized for each bench run.

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
sdd               0,00     0,00    0,00  518,00     0,00   129,50   512,00
  87,99  155,12    0,00  155,12   1,93 100,00

The test pool size is 3 (replicated). (Deep)scrubbing is temporary off.

Networking, CPU and memory is underutilized during the test.

Particular rados command is
rados bench --name client.rbd_test -p rbd_test 600 write --no-cleanup
--run-name $(hostname)_bench

The same story with
rados --name client.rbd_test -p rbd_test load-gen --min-object-size 4M
--max-object-size 4M --min-op-len 4M --max-op-len 4M --max-ops 16
--read-percent 0 --target-throughput 1000 --run-length 600

Do you face the same behavior? It smells like particular PG related. Is it
the effect of running number of rados bench tasks in parallel ?

Of course, I do not deny it's cluster limit, but I'm not sure why only one
and always the same OSD keeps hitting 100% util. Tomorrow I'm going to test
cluster using rbd,

How looks your clusters limit ? saturated LACP ? 100% utilized HDDs???

Thanks,
Jakub
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to