Hi,
correct, SUSE's Ceph product was Salt-based, in this case 14.2.22 was
shipped with SES 6. ;-)
Do you also have some mon logs from right before the crash, maybe with
a higher debug level? It could make sense to stop client traffic and
OSDs as well to be able to recover. But unfortunately, I can't really
comment on the stack trace.
Maybe someone has a different idea, but if you get one MON up, I would
probably reduce the monmap to 1 MON to bring the cluster back up. Back
up all the MON stores, just in case you have to start over. Then
extract the monmap, remove all but one, and inject the modified monmap
into the MON you want to revive. The procedure is described here [0].
Just don't change the address but only reduce the monmap. ;-)
Regards,
Eugen
[0]
https://docs.ceph.com/en/latest/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address-advanced-method
Zitat von Miles Goodhew <c...@m0les.com>:
Hi,
I've been called-in by a client with an ancient SUSE-based Ceph
Nautilus (14.2.22) who's MONs keep dieing oddly.
Apparently the issue started with MDS daemons not working and
eventuallt a MON restart killed the cluster.
OS: SLES 15-SP1 (out of support)
Ceph: 14.2.22 "Nautilus" (Deployed with Salt... I think)
3 MONs; 5 MDSs; 3 MGRs; 4 RGWs; 336 OSDs on 21 nodes.
Client services: "One of everything at least", but RBD/Openstack,
S3/RGW and CephFS are big ones.
After sorting out some of the logs here are some things I know:
Disk space, RAM availability, inodes and network connectivity seem
OK to me. After shutting-down all the MONs, MGRs and MDSes, one MON
can usually be started, but it sits there spamming-out log messages
like "[SERVICE_ID](probing) e6 handle_auth_request failed to assign
global_id" (maybe 50 - 100 times per second). All the while the
syslog shows 'e6 get_health_metrics reporting [INCREASING_NUMBER]
slow ops` fairly often. This is probably due to OSDs and clients
being active.
If I restart one of the other MONs, the running one will die with
a stack trace at (Limiting to C++/library internal calls):
```
8: (std::__throw_out_of_range(char const*)+0x41) [0x7f2a5983fa07]
9: (MDSMonitor::maybe_resize_cluster(FSMap&, int)+0xcf0) [0x55b441e37490]
10: (MDSMonitor::tick()+0xc9) [0x55b441e38ce9]
11: (MDSMonitor::on_active()+0x28) [0x55b441e22fa8]
12: (PaxosService::_active()+0xdd) [0x55b441d7188d]
13: (Context::complete(int)+0x9) [0x55b441c888a9]
14: (void finish_contexts<std::__cxx11::list<Context*,
std::allocator<Context*> > >(CephContext*,
std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0xa8)
[0x55b441cb2408]
15: (Paxos::finish_round()+0x76) [0x55b441d681b6]
16: (Paxos::handle_last(boost::intrusive_ptr<MonOpRequest>)+0xc1f)
[0x55b441d693df]
17: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x233)
[0x55b441d69e23]
18:
(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x1668)
[0x55b441c820b8]
19: (Monitor::_ms_dispatch(Message*)+0xa3a) [0x55b441c82b5a]
20: (Monitor::ms_dispatch(Message*)+0x26) [0x55b441cb3646]
21: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message>
const&)+0x26) [0x55b441cb00b6]
22: (DispatchQueue::entry()+0x1279) [0x7f2a5b188379]
23: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f2a5b238a5d]
24: (()+0x8539) [0x7f2a59db7539]
25: (clone()+0x3f) [0x7f2a58f87ecf]
```
Anyone got any clues about how to diagnose or better-yet repair this?
Sorry, I know this is a bit half-baked, but I'm trying to dump this
help request at COB to see if I can hook anyone's interest overnight.
Thanks for at least reading this far,
M0les.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io