Hey folks,

We have a 16.2.7 cephadm cluster that's had slow ops and several (constantly 
changing) laggy PGs. The set of OSDs with slow ops seems to change at random, 
among all 6 OSD hosts in the cluster. All drives are enterprise SATA SSDs, by 
either Intel or Micron. We're still not ruling out a network issue, but wanted 
to troubleshoot from the Ceph side in case something broke there.

ceph -s:

 health: HEALTH_WARN
 3 slow ops, oldest one blocked for 246 sec, daemons 
[osd.124,osd.130,osd.141,osd.152,osd.27] have slow ops.

 mon: 5 daemons, quorum ceph-osd10,ceph-mon0,ceph-mon1,ceph-osd9,ceph-osd11 
(age 28h)
 mgr: ceph-mon0.sckxhj(active, since 25m), standbys: ceph-osd10.xmdwfh, 
 osd: 143 osds: 143 up (since 92m), 143 in (since 2w)
 rgw: 3 daemons active (3 hosts, 1 zones)

 pools: 26 pools, 3936 pgs
 objects: 33.14M objects, 144 TiB
 usage: 338 TiB used, 162 TiB / 500 TiB avail
 pgs: 3916 active+clean
 19 active+clean+laggy
 1 active+clean+scrubbing+deep

 client: 59 MiB/s rd, 98 MiB/s wr, 1.66k op/s rd, 1.68k op/s wr

This is actually much faster than it's been for much of the past hour, it's 
been as low as 50 kb/s and dozens of iops in both directions (where the cluster 
typically does 300MB to a few gigs, and ~4k iops)

The cluster has been on 16.2.7 since a few days after release without issue. 
The only recent change was an apt upgrade and reboot on the hosts (which was 
last Friday and didn't show signs of problems).

Happy to provide logs, let me know what would be useful. Thanks for reading 
this wall :)


ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to