Good morning,

I've look somewhat intensively through the list and it seems we are
rather hard hit by this. Originally yesterday started on a mixed 14.2.9
and 14.2.16 cluster (osds, mons were all 14.2.16).

We started phasing in 7 new osds, 6 of them throttled by reweighting to
0.1.
Symptoms are many unknown PGs, very long stuck in peering (hours), slow
activating and the infamous BADAUTHORIZER message. Client I/O is almost
0, not only of the pool with new OSDs, but also of other pools.

We tried restarting all OSDs one by one, which seemed to clear the
unknown PGs, however after around 1h they came back to unknown state.

Tuning the --osd-max-backfills=.. and  --osd-recovery-max-active=... to
1 does not improve the situation, so we are currently running at =7 for
both of them.

An excerpt from one of the osds which heavily logs this issue is below.

Any pointer on how this problem was solved on nautilus before is much
appreciated, as the issue started late last night.

Best regards,

Nico


2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: (Original Log Time 
2021/04/13-07:38:37.275255) EVENT_LOG_v1 {"time_micros": 1618292317275255, 
"job": 5, "event": "compaction_finished", "compaction_time_micros": 5288403, 
"compaction_time_cpu_micros": 2940370, "output_level": 2, "num_output_files": 
6, "total_output_size": 370673432, "num_input_records": 983055, 
"num_output_records": 641964, "num_subcompactions": 1, "output_compression": 
"NoCompression", "num_single_delete_mismatches": 0, 
"num_single_delete_fallthrough": 0, "lsm_state": [0, 5, 35, 0, 0, 0, 0]}
2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 
1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 
421257}
2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 
1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 
401289}
2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 
1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 
401288}
2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 
1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 
401278}
2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 
1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 
401277}
2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 
1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 
401258}
2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 
1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 
401257}
2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 
1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 
401256}
2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: [db/compaction_job.cc:1645] 
[default] [JOB 6] Compacting 1@1 + 5@2 files to L2, score 1.17
2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: [db/compaction_job.cc:1649] 
[default] Compaction start summary: Base version 5 Base level 1, inputs: 
[421255(65MB)], [401318(65MB) 401319(65MB) 401320(65MB) 401321(65MB) 
401322(65MB)]

2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 
1618292317275255, "job": 6, "event": "compaction_started", "compaction_reason": 
"LevelMaxLevelSize", "files_L1": [421255], "files_L2": [401318, 401319, 401320, 
401321, 401322], "score": 1.17224, "input_data_size": 413630461}
2021-04-13 07:38:37.307 7f15a0704700  0 --1- 
[v2:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6816/5534,v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6817/5534]
 >> v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6862/8751 conn(0x55d520fbcd80 
0x55d520ffa000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 
l=0).handle_connect_reply_2 connect got BADAUTHORIZER
2021-04-13 07:38:37.315 7f15a372c700  0 --1- 
[v2:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6816/5534,v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6817/5534]
 >> v1:[2a0a:e5c0:2:1:21b:21ff:febc:5060]:6835/13651 conn(0x55d521b1bf80 
0x55d521b19800 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 
l=0).handle_connect_reply_2 connect got BADAUTHORIZER
2021-04-13 07:38:37.707 7f15a0704700  0 --1- 
[v2:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6816/5534,v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6817/5534]
 >> v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6862/8751 conn(0x55d520fbcd80 
0x55d520ffa000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 
l=0).handle_connect_reply_2 connect got BADAUTHORIZER
2021-04-13 07:38:37.715 7f15a372c700  0 --1- 
[v2:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6816/5534,v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6817/5534]
 >> v1:[2a0a:e5c0:2:1:21b:21ff:febc:5060]:6835/13651 conn(0x55d521b1bf80 
0x55d521b19800 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 
l=0).handle_connect_reply_2 connect got BADAUTHORIZER
2021-04-13 07:38:38.119 7f159a469700  4 rocksdb: [db/compaction_job.cc:1332] 
[default] [JOB 6] Generated table #421271: 130853 keys, 68923910 bytes
2021-04-13 07:38:38.119 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 
1618292318119320, "cf_name": "default", "job": 6, "event": 
"table_file_creation", "file_number": 421271, "file_size": 68923910, 
"table_properties": {"data_size": 67111841, "index_size": 1483972, 
"filter_size": 327237, "raw_key_size": 10909111, "raw_average_key_size": 83, 
"raw_value_size": 62457370, "raw_average_value_size": 477, "num_data_blocks": 
16807, "num_entries": 130853, "filter_policy_name": 
"rocksdb.BuiltinBloomFilter"}}
2021-04-13 07:38:38.331 7f15a0704700  0 --1- 
[v2:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6816/5534,v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6817/5534]
 >> v1:[2a0a:e5c0:2:1:21b:21ff:febb:68f0]:6884/29245 conn(0x55d521b1a000 
0x55d521b16800 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 
l=0).handle_connect_reply_2 connect got BADAUTHORIZER
2021-04-13 07:38:38.331 7f15a0704700  0 --1- 
[v2:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6816/5534,v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6817/5534]
 >> v1:[2a0a:e5c0:2:1:21b:21ff:febb:68f0]:6819/17174 conn(0x55d521b1c400 
0x55d521b38000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 
l=0).handle_connect_reply_2 connect got BADAUTHORIZER
2021-04-13 07:38:38.507 7f15a0704700  0 --1- 
[v2:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6816/5534,v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6817/5534]
 >> v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6862/8751 conn(0x55d520fbcd80 
0x55d520ffa000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 
l=0).handle_connect_reply_2 connect got BADAUTHORIZER
2021-04-13 07:38:38.515 7f15a372c700  0 --1- 
[v2:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6816/5534,v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6817/5534]
 >> v1:[2a0a:e5c0:2:1:21b:21ff:febc:5060]:6835/13651 conn(0x55d521b1bf80 
0x55d521b19800 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 
l=0).handle_connect_reply_2 connect got BADAUTHORIZER
2021-04-13 07:38:38.731 7f15a0704700  0 --1- 
[v2:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6816/5534,v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6817/5534]
 >> v1:[2a0a:e5c0:2:1:21b:21ff:febb:68f0]:6884/29245 conn(0x55d521b1a000 
0x55d521b16800 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 
l=0).handle_connect_reply_2 connect got BADAUTHORIZER
2021-04-13 07:38:38.731 7f15a0704700  0 --1- 
[v2:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6816/5534,v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6817/5534]
 >> v1:[2a0a:e5c0:2:1:21b:21ff:febb:68f0]:6819/17174 conn(0x55d521b1c400 
0x55d521b38000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 
l=0).handle_connect_reply_2 connect got BADAUTHORIZER



--
Sustainable and modern Infrastructures by ungleich.ch
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to