Hello all,
We're trying to debug a "slow ops" situation on our cluster running Nautilus
(latest version). Things were running smoothly for a while, but we had a few
issues that made things fall apart (possible clock skew, faulty disk...)
- We've checked the ntp, everything seems fine, the whole cluster shows no
clock skew. Network config seems fine too (we're using jumbo frames throughout
the cluster).
- We have multiple PGs that are in a "stuck peering" or "stuck inactive" state.
ceph health detail
HEALTH_WARN Reduced data availability: 1020 pgs inactive, 1008 pgs peering;
Degraded data redundancy: 208352/95157861 objects degraded (0.219%), 9 pgs
degraded, 9 pgs undersized; 2 pgs not deep-scrubbed in time; 2 pgs not scrubbed
in time; 3 daemons have recently crashed; 1184 slow ops, oldest one blocked for
1792 sec, daemons
[osd.100,osd.101,osd.102,osd.103,osd.104,osd.105,osd.106,osd.107,osd.108,osd.109]...
have slow ops.
PG_AVAILABILITY Reduced data availability: 1020 pgs inactive, 1008 pgs peering
pg 12.3cd is stuck inactive for 8939.938831, current state peering, last
acting [111,75,53]
pg 12.3ce is stuck peering for 350761.931800, current state peering, last
acting [48,103,76]
pg 12.3cf is stuck peering for 345518.349253, current state peering, last
acting [80,46,116]
pg 12.3d0 is stuck peering for 396432.771388, current state peering, last
acting [114,95,42]
pg 12.3d1 is stuck peering for 389771.820478, current state peering, last
acting [33,99,122]
pg 12.3d2 is stuck peering for 16385.796714, current state peering, last
acting [48,75,105]
pg 12.3d3 is stuck peering for 375090.876123, current state peering, last
acting [53,118,90]
pg 12.3d4 is stuck peering for 350665.788611, current state peering, last
acting [59,81,40]
pg 12.3d5 is stuck peering for 344195.934260, current state peering, last
acting [104,73,87]
pg 12.3d6 is stuck peering for 388515.338772, current state peering, last
acting [57,79,60]
pg 12.3d7 is stuck peering for 27320.368320, current state peering, last
acting [35,56,109]
pg 12.3d8 is stuck peering for 345470.520103, current state peering, last
acting [91,41,74]
pg 12.3d9 is stuck peering for 347582.613090, current state peering, last
acting [85,66,103]
pg 12.3da is stuck peering for 346518.712024, current state peering, last
acting [87,63,56]
pg 12.3db is stuck peering for 348804.986864, current state peering, last
acting [100,122,46]
pg 12.3dc is stuck peering for 343796.439591, current state peering, last
acting [55,90,125]
pg 12.3dd is stuck peering for 345621.663979, current state peering, last
acting [83,38,125]
pg 12.3de is stuck peering for 348026.449482, current state peering, last
acting [38,113,82]
pg 12.3df is stuck peering for 350263.925579, current state peering, last
acting [41,104,87]
pg 12.3e0 is stuck peering for 8738.645205, current state peering, last
acting [57,86,108]
pg 12.3e1 is stuck peering for 397082.568164, current state peering, last
acting [124,46]
pg 12.3e2 is stuck peering for 345232.402459, current state peering, last
acting [80,114,65]
pg 12.3e3 is stuck peering for 347014.276511, current state peering, last
acting [63,102,83]
pg 12.3e4 is stuck peering for 345470.524144, current state peering, last
acting [91,38,71]
pg 12.3e5 is stuck peering for 346636.837554, current state peering, last
acting [64,85,118]
pg 12.3e6 is stuck peering for 398952.293609, current state peering, last
acting [92,36,75]
pg 12.3e7 is stuck peering for 346973.264600, current state peering, last
acting [31,94,53]
pg 12.3e8 is stuck peering for 370098.248268, current state peering, last
acting [119,90,72]
pg 12.3e9 is stuck peering for 345134.069457, current state peering, last
acting [96,105,36]
pg 12.3ea is stuck peering for 346305.043394, current state peering, last
acting [94,103,51]
pg 12.3eb is stuck peering for 388515.112735, current state peering, last
acting [57,116,59]
pg 12.3ec is stuck peering for 348097.249845, current state peering, last
acting [56,111,84]
pg 12.3ed is stuck peering for 346636.835287, current state peering, last
acting [64,106,101]
pg 12.3ee is stuck peering for 398197.856231, current state peering, last
acting [53,105,80]
pg 12.3ef is stuck peering for 347061.858678, current state peering, last
acting [47,64,80]
pg 12.3f0 is stuck peering for 371495.723196, current state peering, last
acting [77,115,81]
pg 12.3f1 is stuck peering for 27539.717691, current state peering, last
acting [123,69,48]
pg 12.3f2 is stuck peering for 346973.596729, current state peering, last
acting [31,80,45]
pg 12.3f3 is stuck peering for 345419.834162, current state peering, last
acting [108,89,40]
pg 12.3f4 is stuck peering for 347400.170304, current state peering, last
acting [82,67,104]
pg 12.3f5 is stuck peering for 346793.349638, current state peering, last
acting [116,51,68]
pg 12.3f6 is stuck peering for 372361.763947, current state peering, last
acting [114,46,93]
pg 12.3f7 is stuck inactive for 346840.292765, current state activating,
last acting [125,77,47]
pg 12.3f8 is stuck peering for 347004.967439, current state peering, last
acting [42,31,116]
pg 12.3f9 is stuck peering for 346894.489185, current state peering, last
acting [40,94,67]
pg 12.3fa is stuck peering for 395041.494033, current state peering, last
acting [58,97,112]
pg 12.3fb is stuck peering for 346337.742759, current state peering, last
acting [79,55,61]
pg 12.3fc is stuck peering for 347634.039502, current state peering, last
acting [66,54,101]
pg 12.3fd is stuck peering for 345340.666831, current state peering, last
acting [112,32,87]
pg 12.3fe is stuck peering for 345777.554974, current state peering, last
acting [98,30,44]
pg 12.3ff is stuck peering for 18040.716533, current state peering, last
acting [86,51,59]
PG_DEGRADED Degraded data redundancy: 208352/95157861 objects degraded
(0.219%), 9 pgs degraded, 9 pgs undersized
pg 7.3b is stuck undersized for 1119305.639931, current state
active+undersized+degraded, last acting [29,19]
pg 7.5a is stuck undersized for 351332.251298, current state
active+undersized+degraded, last acting [29,11]
pg 8.b9 is stuck undersized for 351332.246585, current state
active+undersized+degraded, last acting [22,17]
pg 8.c1 is stuck undersized for 351332.257178, current state
active+undersized+degraded, last acting [24,14]
pg 8.db is stuck undersized for 350987.698147, current state
active+undersized+degraded, last acting [24,7]
pg 8.1c2 is stuck undersized for 350999.603413, current state
active+undersized+degraded, last acting [20,2]
pg 9.32 is stuck undersized for 351332.258240, current state
active+undersized+degraded, last acting [21,11]
pg 9.a5 is stuck undersized for 351332.266130, current state
active+undersized+degraded, last acting [25,14]
pg 9.df is stuck undersized for 351333.298597, current state
active+undersized+degraded, last acting [25,19]
PG_NOT_DEEP_SCRUBBED 2 pgs not deep-scrubbed in time
pg 8.db not deep-scrubbed since 2020-10-24 07:09:12.599242
pg 7.3b not deep-scrubbed since 2020-10-24 14:10:59.877193
PG_NOT_SCRUBBED 2 pgs not scrubbed in time
pg 8.db not scrubbed since 2020-10-24 07:09:12.599242
pg 7.3b not scrubbed since 2020-10-24 14:10:59.877193
RECENT_CRASH 3 daemons have recently crashed
osd.70 crashed on host starfish-osd-05 at 2020-10-30 09:16:06.981832Z
osd.57 crashed on host starfish-osd-04 at 2020-11-06 10:07:47.868835Z
mds.starfish-mon-01 crashed on host starfish-mon-01 at 2020-11-02
18:36:25.266426Z
SLOW_OPS 1184 slow ops, oldest one blocked for 1792 sec, daemons
[osd.100,osd.101,osd.102,osd.103,osd.104,osd.105,osd.106,osd.107,osd.108,osd.109]...
have slow ops.
- The 9 degraded / undersized pgs are on a different pool that has osds that
need to be reweighed. OSDs 1-29 are on another root on the crush map.
- When querying one of the PGs that are in a "stuck peering" state,, there are
a lot of ".handle_connect_reply_2 connect got BADAUTHORIZER" replys.
- The OSDs logs show the following message (they dissapear for a while if the
osd is restarted) :
2020-11-10 11:58:58.671 7f90430da700 0 auth: could not find secret_id=14160
2020-11-10 11:58:58.671 7f90430da700 0 cephx: verify_authorizer could not get
service secret for service osd secret_id=14160
2020-11-10 11:58:58.671 7f90430da700 0 --1-
[v2:10.100.0.7:6851/483865,v1:10.100.0.7:6854/483865] >>
v1:10.100.0.3:6815/8007973 conn(0x5632b1a64000 0x5632a64fb000 :6854
s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2: got
bad authorizer, auth_reply_len=0
Cheers !
Regards,
Shehzaad
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]