Le 10/09/2015 22:56, Robert LeBlanc a écrit :
> We are trying to add some additional OSDs to our cluster, but the
> impact of the backfilling has been very disruptive to client I/O and
> we have been trying to figure out how to reduce the impact. We have
> seen some client I/O blocked for more than 60 seconds. There has been
> CPU and RAM head room on the OSD nodes, network has been fine, disks
> have been busy, but not terrible.

It seems you've already exhausted most of the ways I know. When
confronted to this situation, I used a simple script to throttle
backfills (freezing them, then re-enabling them), this helped our VMs at
the time but you must be prepared for very long migrations and some
experimentations with different schedulings. You simply pass it the
number of seconds backfills are allowed to proceed then the number of
seconds during them they pause.

Here's the script, which should be self-explanatory:
http://pastebin.com/sy7h1VEy

something like :

./throttler 10 120

limited the impact on our VMs (the idea being that during the 10s the
backfill won't be able to trigger filestore syncs and the 120s pause
will allow the filestore syncs to remove "dirty" data from the journals
without interfering too much with concurrent writes).
I believe you must have a high filestore sync value to hope to benefit
from this (we use 30s).
At the very least the long pause will eventually allow VMs to move data
to disk regularly instead of being nearly frozen.

Note that your pgs are more than 10G each, if the OSDs can't stop a
backfill before finishing transferring the current pg this won't help (I
assume backfills go through journals and they probably won't be able to
act as write-back caches anymore as one PG will be enough to fill them up).

Best regards,

Lionel
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to