Re: [ceph-users] Stuck pgs (activating+remapped) and slow requests after adding OSD node via ceph-ansible

Peter Linder Mon, 22 Jan 2018 13:47:33 -0800

Did you find out anything about this? We are also getting pgs stuck"activating+remapped". I have to manually alter bucket weights so thatthey are basically the same everywhere, even if disks aren't the samesize to fix the problem, but it is a real hassle every time we add a newnode or disk.

See my email subject "Weird issues related to (large/small) weights inmixed nvme/hdd pool" from 2018-01-20 and see if there are some similarities?



Regards,
Peter

Den 2018-01-07 kl. 12:17, skrev Tzachi Strul:

Hi all,
We have 5 node ceph cluster (Luminous 12.2.1) installed via ceph-ansible.
All servers have 16X1.5TB SSD disks.
3 of these servers are also acting as MON+MGRs.
We don't have separated network for cluster and public, each node has4 NICs bonded together (40G) and serves cluster+public communication(We know it's not ideal and planning to change it).
Last week we added another node to cluster (another 16*1.5TB ssd).
We used ceph-ansible latest stable release.
After OSD activation cluster started rebalancing and problems began:
1. Cluster entered HEALTH_ERROR state
2. 67 pgs stuck at activating+remapped
3. A lot of blocked slow requests.
This cluster serves OpenStack volumes and almost all OpenStackinstances got 100% disk utilization and hanged, eventually,cinder-volume has crushed.
Eventually, after restarting several OSDs, problem solved and clustergot back to HEALTH_OK
Our configuration already has:
osd max backfills = 1
osd max scrubs = 1
osd recovery max active = 1
osd recovery op priority = 1

In addition, we see a lot of bad mappings:
for example: bad mapping rule 0 x 52 num_rep 8 result[32,5,78,25,96,59,80]
What can be the cause and what can I do in order to avoid thissituation? we need to add another 9 osd servers and can't afford downtime.
Any help would be appreciated. Thank you very much


Our ceph configuration:

[mgr]
mgr_modules = dashboard zabbix

[global]
cluster network = *removed for security resons*
fsid =  *removed for security resons*
mon host =  *removed for security resons*
mon initial members =  *removed for security resons*
mon osd down out interval = 900
osd pool default size = 3
public network =  *removed for security resons*

[client.libvirt]
admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok #must be writable by QEMU and allowed by SELinux or AppArmorlog file = /var/log/ceph/qemu-guest-$pid.log # must be writable byQEMU and allowed by SELinux or AppArmor
[osd]
osd backfill scan max = 16
osd backfill scan min = 4
osd bluestore cache size = 104857600 **Due to 12.2.1 bluestore memoryleak bug**
osd max backfills = 1
osd max scrubs = 1
osd recovery max active = 1
osd recovery max single start = 1
osd recovery op priority = 1
osd recovery threads = 1


--

*Tzachi Strul*

*Storage DevOps *// *Kenshoo*
This e-mail, as well as any attached document, may contain materialwhich is confidential and privileged and may include trademark,copyright and other intellectual property rights that are proprietaryto Kenshoo Ltd, its subsidiaries or affiliates ("Kenshoo"). Thise-mail and its attachments may be read, copied and used only by theaddressee for the purpose(s) for which it was disclosed herein. If youhave received it in error, please destroy the message and anyattachment, and contact us immediately. If you are not the intendedrecipient, be aware that any review, reliance, disclosure, copying,distribution or use of the contents of this message without Kenshoo'sexpress permission is strictly prohibited.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Stuck pgs (activating+remapped) and slow requests after adding OSD node via ceph-ansible

Reply via email to