[ceph-users] snapshot removal slows cluster

Vladimir Prokofev Wed, 26 Apr 2017 05:55:02 -0700

Hello ceph-users.

Short description: during snapshot removal osd usilisation goes up to 100%,
which leads to slow requests and VM failures due to IOPS stall.


We're using Openstack Cinder with CEPH cluster as a volume backend. CEPH
version is 10.2.6.
We also using cinder-backup to create backups of those volumes in CEPH,
which uses snapshot and layering features I guess.
Cluster consists of 5 OSD nodes with mixed SSD/HDD storage, separate SSD
for HDD journals, separate 10Gb/s public and private networks, 3 MON nodes.
We also have a single "backup" node which is responsible for "backups"
pool, handled by CRUSH map rules.

While creating backup everything looks good. Backup node is overwhelmed
with load, but that's to be expected. Problem begins when we start deleting
old backups.
While old backup is deleted, utilization of main nodes OSDs skyrockets up
to 100%. This leads to slow requests in main storage pools, which, given
enough time, can lead to a process hang, or at least SCSI reset attempts,
and in worst cases - VM hangs.

I'm looking for a solution to avoid this issue.

So far I understand that I don't know how CEPH snapshot mechanics work at
all, because I can't figure why deleting a backup leads to requests not to
backup OSDs, where backup data is really stored, but rather to main OSDs,
where original objects reside. Is there any good doc on this?

Googling shows that I'm not the first one to encounter this issue, but I
cound't find any exact solution anywhere. Here's a short list of ideas:
 - use osd snap trim priority = 1. This is reported as not as helpfull, as
this is already lower than client IO priority = 63;
 - use osd_snap_trim_sleep, but as far as I see it's broken in jewel, and
will only be fixed in 10.2.8 - http://tracker.ceph.com/issues/19328;
 - disabling fast-diff and object map features seem to help, but I'm not
sure what are the tradeoffs for this scenario.

I'll appreciate any ideas on how to fix this.

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] snapshot removal slows cluster

Reply via email to