Hi folks,
Our entire cluster is down at the moment.
We started upgrading from 12.2.13 to 14.2.7 with the monitors. The first
monitor we upgraded crashed. We reverted to luminous on this one and tried
another, and it was fine. We upgraded the rest, and they all worked.
Then we upgraded the first one again, and after it became the leader, it died.
Then the second one became the leader, and it died. Then the third became the
leader, and it died, leaving mon 4 and 5 unable to form a quorum.
We tried creating a single monitor cluster by editing the monmap of mon05, and
it died in the same way, just without the paxos negotiation first.
We have tried to revert to a luminous (12.2.12) monitor backup taken a few
hours before the crash. The mon daemon will start, but is flooded with blocked
requests and unknown pgs after a while. For better or worse we removed the
“noout” flag and 144 of 232 OSDs are now showing as down:
cluster:
id: ababdd7f-1040-431b-962c-c45bea5424aa
health: HEALTH_ERR
noout,nobackfill,norecover flag(s) set
101 osds down
9 hosts (143 osds) down
1 auth entities have invalid capabilities
Long heartbeat ping times on back interface seen, longest is
15424.178 msec
Long heartbeat ping times on front interface seen, longest is
14763.145 msec
Reduced data availability: 521 pgs inactive, 48 pgs stale
274 slow requests are blocked > 32 sec
88 stuck requests are blocked > 4096 sec
1303 slow ops, oldest one blocked for 174 sec, mon.ntr-mon01 has
slow ops
too many PGs per OSD (299 > max 250) services:
mon: 1 daemons, quorum ntr-mon01 (age 3m)
mgr: ntr-mon01(active, since 30m)
mds: cephfs:1 {0=akld2e18u42=up:active(laggy or crashed)}
osd: 223 osds: 66 up, 167 in
flags noout,nobackfill,norecover
rgw: 2 daemons active (ntr-rgw01, ntr-rgw02) data:
pools: 14 pools, 15220 pgs
objects: 35.26M objects, 134 TiB
usage: 379 TiB used, 1014 TiB / 1.4 PiB avail
pgs: 3.423% pgs unknown
14651 active+clean
521 unknown
48 stale+active+clean io:
client: 20 KiB/s rd, 439 KiB/s wr, 7 op/s rd, 54 op/s wr
These luminous OSD daemons are not down, but are all in fact running. They just
have no comms with the monitor:
2020-02-19 10:12:33.565680 7ff222e24700 1 osd.0 pg_epoch: 305104 pg[100.37as3(
v 129516'2 (0'0,129516'2] local-lis/les=297268/297269 n=0 ec=129502/129502
lis/c 297268/297268 les/c/f 297269/297358/0 297268/297268/161526)
[41,192,216,0,160,117]p41(0) r=3 lpr=305101 crt=129516'2 lcod 0'0 unknown
NOTIFY mbc={}] state<Start>: transitioning to Stray
2020-02-19 10:12:33.565861 7ff222e24700 1 osd.0 pg_epoch: 305104 pg[4.53c( v
305046'1933429 (304777'1931907,305046'1933429] local-lis/les=298009/298010
n=7350 ec=768/768 lis/c 298009/298009 les/c/f 298010/298010/0
297268/298009/298009) [0,61,103] r=0 lpr=305101 crt=305046'1933429 lcod 0'0
mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary
2020-02-19 10:12:33.566742 7ff222e24700 1 osd.0 pg_epoch: 305104 pg[100.des4(
v 129516'1 (0'0,129516'1] local-lis/les=292010/292011 n=1 ec=129502/129502
lis/c 292010/292010 les/c/f 292011/292417/0 292010/292010/280955)
[149,62,209,187,0,98]p149(0) r=4 lpr=305072 crt=129516'1 lcod 0'0 unknown
NOTIFY mbc={}] state<Start>: transitioning to Stray
2020-02-19 10:12:33.566896 7ff23ccd9e00 0 osd.0 305104 done with init,
starting boot process
2020-02-19 10:12:33.566956 7ff23ccd9e00 1 osd.0 305104 start_boot
One oddity in our deployment is that there was a test mds instance, and it is
running mimic. I shut it down, as the monitor trace has an MDS call in it, but
the nautilus monitors still die the same way.
"mds": {
"ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic
(stable)": 1
},
...
-11> 2020-02-18 09:50:00.800 7fd164a1a700 5
mon.ntr-mon02@1(leader).paxos(paxos recovering c 85448935..85449502)
is_readable = 0 - now=2020-02-18 09:50:00.804429 lease_expire=0.000000 has v0
lc 85449502
-10> 2020-02-18 09:50:00.800 7fd164a1a700 5
mon.ntr-mon02@1(leader).paxos(paxos recovering c 85448935..85449502)
is_readable = 0 - now=2020-02-18 09:50:00.804446 lease_expire=0.000000 has v0
lc 85449502
-9> 2020-02-18 09:50:00.800 7fd164a1a700 5
mon.ntr-mon02@1(leader).paxos(paxos recovering c 85448935..85449502)
is_readable = 0 - now=2020-02-18 09:50:00.804460 lease_expire=0.000000 has v0
lc 85449502
-8> 2020-02-18 09:50:00.800 7fd164a1a700 4 set_mon_vals no callback set
-7> 2020-02-18 09:50:00.800 7fd164a1a700 4 mgrc handle_mgr_map Got map
version 2301191
-6> 2020-02-18 09:50:00.804 7fd164a1a700 4 mgrc handle_mgr_map Active mgr
is now v1:10.31.88.17:6801/2924412
-5> 2020-02-18 09:50:00.804 7fd164a1a700 0 log_channel(cluster) log [DBG]
: monmap e25: 5 mons at
{ntr-mon01=v1:10.31.88.14:6789/0,ntr-mon02=v1:10.31.88.15:6789/0,ntr-mon03=v1:10.31.88.16:6789/0,ntr-mon04=v1:10.31.88.17:6789/0,ntr-mon05=v1:10.31.88.18:6789/0}
-4> 2020-02-18 09:50:00.804 7fd164a1a700 10 log_client _send_to_mon log to
self
-3> 2020-02-18 09:50:00.804 7fd164a1a700 10 log_client log_queue is 3
last_log 3 sent 2 num 3 unsent 1 sending 1
-2> 2020-02-18 09:50:00.804 7fd164a1a700 10 log_client will send
2020-02-18 09:50:00.806845 mon.ntr-mon02 (mon.1) 3 : cluster [DBG] monmap e25:
5 mons at
{ntr-mon01=v1:10.31.88.14:6789/0,ntr-mon02=v1:10.31.88.15:6789/0,ntr-mon03=v1:10.31.88.16:6789/0,ntr-mon04=v1:10.31.88.17:6789/0,ntr-mon05=v1:10.31.88.18:6789/0}
-1> 2020-02-18 09:50:00.804 7fd164a1a700 5
mon.ntr-mon02@1(leader).paxos(paxos active c 85448935..85449502) is_readable =
1 - now=2020-02-18 09:50:00.806920 lease_expire=2020-02-18 09:50:05.804479 has
v0 lc 85449502
0> 2020-02-18 09:50:00.812 7fd164a1a700 -1 *** Caught signal (Aborted) **
in thread 7fd164a1a700 thread_name:ms_dispatch
ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus
(stable)
1: (()+0x11390) [0x7fd171e98390]
2: (gsignal()+0x38) [0x7fd1715e5428]
3: (abort()+0x16a) [0x7fd1715e702a]
4: (__gnu_cxx::__verbose_terminate_handler()+0x135) [0x7fd173673bf5]
5: (__cxxabiv1::__terminate(void (*)())+0x6) [0x7fd173667bd6]
6: (()+0x8b6c21) [0x7fd173667c21]
7: (()+0x8c2e34) [0x7fd173673e34]
8: (std::__throw_out_of_range(char const*)+0x3f) [0x7fd17367f55f]
9: (MDSMonitor::maybe_resize_cluster(FSMap&, int)+0xcf0) [0x79ae00]
10: (MDSMonitor::tick()+0xc9) [0x79c669]
11: (MDSMonitor::on_active()+0x28) [0x785e88]
12: (PaxosService::_active()+0xdd) [0x6d4b2d]
13: (Context::complete(int)+0x9) [0x600789]
14: (void finish_contexts<std::__cxx11::list<Context*,
std::allocator<Context*> > >(CephContext*, std::__cxx11::list<Context*,
std::allocator<Context*> >&, int)+0xa8) [0x6299a8]
15: (Paxos::finish_round()+0x76) [0x6cb276]
16: (Paxos::handle_last(boost::intrusive_ptr<MonOpRequest>)+0xbff) [0x6cc47f]
17: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x24b) [0x6ccf2b]
18: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x15c5)
[0x5fa6f5]
19: (Monitor::_ms_dispatch(Message*)+0x4d2) [0x5fad42]
20: (Monitor::ms_dispatch(Message*)+0x26) [0x62b046]
21: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26)
[0x6270b6]
22: (DispatchQueue::entry()+0x1219) [0x7fd1732b7e59]
23: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fd17336836d]
24: (()+0x76ba) [0x7fd171e8e6ba]
25: (clone()+0x6d) [0x7fd1716b741d]
...
Ceph versions output
{
"mon": {
"ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e)
luminous (stable)": 1,
"ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8)
nautilus (stable)": 4
},
"mgr": {
"ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777)
luminous (stable)": 1,
"ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e)
luminous (stable)": 1,
"ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8)
nautilus (stable)": 2
},
"osd": {
"ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee)
luminous (stable)": 175,
"ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777)
luminous (stable)": 32,
"ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e)
luminous (stable)": 16
},
"mds": {
"ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic
(stable)": 1
},
"rgw": {
"ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777)
luminous (stable)": 2
},
"overall": {
"ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee)
luminous (stable)": 175,
"ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777)
luminous (stable)": 35,
"ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e)
luminous (stable)": 18,
"ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic
(stable)": 1,
"ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8)
nautilus (stable)": 6
}
}
We’ve filed a bug report with the actions of the actual cascading crash
described above (when we upgraded mon01 and it became the leader):
https://tracker.ceph.com/issues/44185 (parts here copied from that report)
Right now we’re not sure what the best path to some sort of recovery would be.
All OSD Daemons are still on Luminous, so AFAICT, we could build the monitor db
from the OSDs with
https://github.com/ceph/ceph/blob/luminous/doc/rados/troubleshooting/troubleshooting-mon.rst#recovery-using-osds
which describes using this script:
#!/bin/bash
hosts="ntr-sto01 ntr-sto02"
ms=/tmp/mon-store/
mkdir $ms
# collect the cluster map from OSDs
for host in $hosts; do
echo $host
rsync -avz $ms root@$host:$ms
rm -rf $ms
ssh root@$host <<EOF
for osd in /var/lib/ceph/osd/ceph-*; do
ceph-objectstore-tool --data-path \$osd --op update-mon-db
--mon-store-path $ms
done
EOF
rsync -avz root@$host:$ms $ms
done
If this is our best idea to try, should we try the mon store from the above
script on a luminous or nautilus mon daemon? Any other ideas to try at this
dark hour? : \
Cheers,
Sean
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]