Hi List,

We have recently experienced a host failure within our 4 host cluster that
is used to store RBD images. Unfortunately the recovery/backfill traffic
from 12 OSD's going offline caused all IO from QEMU/KVM to the cluster to
come to a standstill until the recovery process had taken place, a 6 hour
long IO pause to the guests.

Our understanding was that the backfill/recovery IO was at a lower priority
than normal IO, but in our situation this does not seem to be the case.
We have the following tunables set in our ceph.conf

         osd recovery op priority = 2
         osd max backfills = 1
         osd recovery max active = 1
         osd recovery threads = 1


Each of our OSD hosts has 2x Intel S3700 SSD's for Journal, and 12x Seagate
Conetellation ES.3 drives as OSD's with 32GB of RAM and 1 core per OSD. We
have 2x 10gigabit ethernet per host that are bonded and carry separate
front end and back end vlans.  It appears Ceph is thrashing the backing OSD
disks during the recovery with read requests.

We were running Ceph Dumpling 0.67.4 when we encountered this problem.

I recently noted on the Wiki this text "Tip: Newer versions of Ceph provide
better recovery handling by preventing recovering OSDs from using up system
resources so that up and in OSDs aren't available or are otherwise slow."
which seems to describe the slowness we are experiencing. I was wondering
what version of Ceph this behavior was resolved in ?



Regards,




Andrew Thrift
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to