[ceph-users] Slow ops and "stuck peering"

shehzaad . chakowree Tue, 10 Nov 2020 12:50:47 -0800

Hello all,

We're trying to debug a "slow ops" situation on our cluster running Nautilus 
(latest version). Things were running smoothly for a while, but we had a few 
issues that made things fall apart (possible clock skew, faulty disk...)


- We've checked the ntp, everything seems fine, the whole cluster shows no 
clock skew. Network config seems fine too (we're using jumbo frames throughout 
the cluster).

- We have multiple PGs that are in a "stuck peering" or "stuck inactive" state.

ceph health detail
HEALTH_WARN Reduced data availability: 1020 pgs inactive, 1008 pgs peering; 
Degraded data redundancy: 208352/95157861 objects degraded (0.219%), 9 pgs 
degraded, 9 pgs undersized; 2 pgs not deep-scrubbed in time; 2 pgs not scrubbed 
in time; 3 daemons have recently crashed; 1184 slow ops, oldest one blocked for 
1792 sec, daemons 
[osd.100,osd.101,osd.102,osd.103,osd.104,osd.105,osd.106,osd.107,osd.108,osd.109]...
 have slow ops.
PG_AVAILABILITY Reduced data availability: 1020 pgs inactive, 1008 pgs peering
    pg 12.3cd is stuck inactive for 8939.938831, current state peering, last 
acting [111,75,53]
    pg 12.3ce is stuck peering for 350761.931800, current state peering, last 
acting [48,103,76]
    pg 12.3cf is stuck peering for 345518.349253, current state peering, last 
acting [80,46,116]
    pg 12.3d0 is stuck peering for 396432.771388, current state peering, last 
acting [114,95,42]
    pg 12.3d1 is stuck peering for 389771.820478, current state peering, last 
acting [33,99,122]
    pg 12.3d2 is stuck peering for 16385.796714, current state peering, last 
acting [48,75,105]
    pg 12.3d3 is stuck peering for 375090.876123, current state peering, last 
acting [53,118,90]
    pg 12.3d4 is stuck peering for 350665.788611, current state peering, last 
acting [59,81,40]
    pg 12.3d5 is stuck peering for 344195.934260, current state peering, last 
acting [104,73,87]
    pg 12.3d6 is stuck peering for 388515.338772, current state peering, last 
acting [57,79,60]
    pg 12.3d7 is stuck peering for 27320.368320, current state peering, last 
acting [35,56,109]
    pg 12.3d8 is stuck peering for 345470.520103, current state peering, last 
acting [91,41,74]
    pg 12.3d9 is stuck peering for 347582.613090, current state peering, last 
acting [85,66,103]
    pg 12.3da is stuck peering for 346518.712024, current state peering, last 
acting [87,63,56]
    pg 12.3db is stuck peering for 348804.986864, current state peering, last 
acting [100,122,46]
    pg 12.3dc is stuck peering for 343796.439591, current state peering, last 
acting [55,90,125]
    pg 12.3dd is stuck peering for 345621.663979, current state peering, last 
acting [83,38,125]
    pg 12.3de is stuck peering for 348026.449482, current state peering, last 
acting [38,113,82]
    pg 12.3df is stuck peering for 350263.925579, current state peering, last 
acting [41,104,87]
    pg 12.3e0 is stuck peering for 8738.645205, current state peering, last 
acting [57,86,108]
    pg 12.3e1 is stuck peering for 397082.568164, current state peering, last 
acting [124,46]
    pg 12.3e2 is stuck peering for 345232.402459, current state peering, last 
acting [80,114,65]
    pg 12.3e3 is stuck peering for 347014.276511, current state peering, last 
acting [63,102,83]
    pg 12.3e4 is stuck peering for 345470.524144, current state peering, last 
acting [91,38,71]
    pg 12.3e5 is stuck peering for 346636.837554, current state peering, last 
acting [64,85,118]
    pg 12.3e6 is stuck peering for 398952.293609, current state peering, last 
acting [92,36,75]
    pg 12.3e7 is stuck peering for 346973.264600, current state peering, last 
acting [31,94,53]
    pg 12.3e8 is stuck peering for 370098.248268, current state peering, last 
acting [119,90,72]
    pg 12.3e9 is stuck peering for 345134.069457, current state peering, last 
acting [96,105,36]
    pg 12.3ea is stuck peering for 346305.043394, current state peering, last 
acting [94,103,51]
    pg 12.3eb is stuck peering for 388515.112735, current state peering, last 
acting [57,116,59]
    pg 12.3ec is stuck peering for 348097.249845, current state peering, last 
acting [56,111,84]
    pg 12.3ed is stuck peering for 346636.835287, current state peering, last 
acting [64,106,101]
    pg 12.3ee is stuck peering for 398197.856231, current state peering, last 
acting [53,105,80]
    pg 12.3ef is stuck peering for 347061.858678, current state peering, last 
acting [47,64,80]
    pg 12.3f0 is stuck peering for 371495.723196, current state peering, last 
acting [77,115,81]
    pg 12.3f1 is stuck peering for 27539.717691, current state peering, last 
acting [123,69,48]
    pg 12.3f2 is stuck peering for 346973.596729, current state peering, last 
acting [31,80,45]
    pg 12.3f3 is stuck peering for 345419.834162, current state peering, last 
acting [108,89,40]
    pg 12.3f4 is stuck peering for 347400.170304, current state peering, last 
acting [82,67,104]
    pg 12.3f5 is stuck peering for 346793.349638, current state peering, last 
acting [116,51,68]
    pg 12.3f6 is stuck peering for 372361.763947, current state peering, last 
acting [114,46,93]
    pg 12.3f7 is stuck inactive for 346840.292765, current state activating, 
last acting [125,77,47]
    pg 12.3f8 is stuck peering for 347004.967439, current state peering, last 
acting [42,31,116]
    pg 12.3f9 is stuck peering for 346894.489185, current state peering, last 
acting [40,94,67]
    pg 12.3fa is stuck peering for 395041.494033, current state peering, last 
acting [58,97,112]
    pg 12.3fb is stuck peering for 346337.742759, current state peering, last 
acting [79,55,61]
    pg 12.3fc is stuck peering for 347634.039502, current state peering, last 
acting [66,54,101]
    pg 12.3fd is stuck peering for 345340.666831, current state peering, last 
acting [112,32,87]
    pg 12.3fe is stuck peering for 345777.554974, current state peering, last 
acting [98,30,44]
    pg 12.3ff is stuck peering for 18040.716533, current state peering, last 
acting [86,51,59]
PG_DEGRADED Degraded data redundancy: 208352/95157861 objects degraded 
(0.219%), 9 pgs degraded, 9 pgs undersized
    pg 7.3b is stuck undersized for 1119305.639931, current state 
active+undersized+degraded, last acting [29,19]
    pg 7.5a is stuck undersized for 351332.251298, current state 
active+undersized+degraded, last acting [29,11]
    pg 8.b9 is stuck undersized for 351332.246585, current state 
active+undersized+degraded, last acting [22,17]
    pg 8.c1 is stuck undersized for 351332.257178, current state 
active+undersized+degraded, last acting [24,14]
    pg 8.db is stuck undersized for 350987.698147, current state 
active+undersized+degraded, last acting [24,7]
    pg 8.1c2 is stuck undersized for 350999.603413, current state 
active+undersized+degraded, last acting [20,2]
    pg 9.32 is stuck undersized for 351332.258240, current state 
active+undersized+degraded, last acting [21,11]
    pg 9.a5 is stuck undersized for 351332.266130, current state 
active+undersized+degraded, last acting [25,14]
    pg 9.df is stuck undersized for 351333.298597, current state 
active+undersized+degraded, last acting [25,19]
PG_NOT_DEEP_SCRUBBED 2 pgs not deep-scrubbed in time
    pg 8.db not deep-scrubbed since 2020-10-24 07:09:12.599242
    pg 7.3b not deep-scrubbed since 2020-10-24 14:10:59.877193
PG_NOT_SCRUBBED 2 pgs not scrubbed in time
    pg 8.db not scrubbed since 2020-10-24 07:09:12.599242
    pg 7.3b not scrubbed since 2020-10-24 14:10:59.877193
RECENT_CRASH 3 daemons have recently crashed
    osd.70 crashed on host starfish-osd-05 at 2020-10-30 09:16:06.981832Z
    osd.57 crashed on host starfish-osd-04 at 2020-11-06 10:07:47.868835Z
    mds.starfish-mon-01 crashed on host starfish-mon-01 at 2020-11-02 
18:36:25.266426Z
SLOW_OPS 1184 slow ops, oldest one blocked for 1792 sec, daemons 
[osd.100,osd.101,osd.102,osd.103,osd.104,osd.105,osd.106,osd.107,osd.108,osd.109]...
 have slow ops.

- The 9 degraded / undersized pgs are on a different pool that has osds that 
need to be reweighed. OSDs 1-29 are on another root on the crush map.

- When querying one of the PGs that are in a "stuck peering" state,, there are 
a lot of ".handle_connect_reply_2 connect got BADAUTHORIZER" replys.

- The OSDs logs show the following message (they dissapear for a while if the 
osd is restarted) :

2020-11-10 11:58:58.671 7f90430da700  0 auth: could not find secret_id=14160
2020-11-10 11:58:58.671 7f90430da700  0 cephx: verify_authorizer could not get 
service secret for service osd secret_id=14160
2020-11-10 11:58:58.671 7f90430da700  0 --1- 
[v2:10.100.0.7:6851/483865,v1:10.100.0.7:6854/483865] >> 
v1:10.100.0.3:6815/8007973 conn(0x5632b1a64000 0x5632a64fb000 :6854 
s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2: got 
bad authorizer, auth_reply_len=0

Cheers !

Regards,

Shehzaad
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Slow ops and "stuck peering"

Reply via email to