[ceph-users] Re: Ceph 14.2.22 MONs keep crashing in MDSMonitor::maybe_resize_cluster (out of range)

Eugen Block Wed, 18 Jun 2025 00:12:18 -0700

Hi,

correct, SUSE's Ceph product was Salt-based, in this case 14.2.22 wasshipped with SES 6. ;-)

Do you also have some mon logs from right before the crash, maybe witha higher debug level? It could make sense to stop client traffic andOSDs as well to be able to recover. But unfortunately, I can't reallycomment on the stack trace.

Maybe someone has a different idea, but if you get one MON up, I wouldprobably reduce the monmap to 1 MON to bring the cluster back up. Backup all the MON stores, just in case you have to start over. Thenextract the monmap, remove all but one, and inject the modified monmapinto the MON you want to revive. The procedure is described here [0].Just don't change the address but only reduce the monmap. ;-)


Regards,
Eugen

[0]https://docs.ceph.com/en/latest/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address-advanced-method


Zitat von Miles Goodhew <c...@m0les.com>:

Hi,
I've been called-in by a client with an ancient SUSE-based CephNautilus (14.2.22) who's MONs keep dieing oddly.Apparently the issue started with MDS daemons not working andeventuallt a MON restart killed the cluster.
OS: SLES 15-SP1 (out of support)
Ceph: 14.2.22 "Nautilus" (Deployed with Salt... I think)
3 MONs; 5 MDSs; 3 MGRs; 4 RGWs; 336 OSDs on 21 nodes.
Client services: "One of everything at least", but RBD/Openstack,S3/RGW and CephFS are big ones.
After sorting out some of the logs here are some things I know:Disk space, RAM availability, inodes and network connectivity seemOK to me. After shutting-down all the MONs, MGRs and MDSes, one MONcan usually be started, but it sits there spamming-out log messageslike "[SERVICE_ID](probing) e6 handle_auth_request failed to assignglobal_id" (maybe 50 - 100 times per second). All the while thesyslog shows 'e6 get_health_metrics reporting [INCREASING_NUMBER]slow ops` fairly often. This is probably due to OSDs and clientsbeing active.
If I restart one of the other MONs, the running one will die witha stack trace at (Limiting to C++/library internal calls):
```
8: (std::__throw_out_of_range(char const*)+0x41) [0x7f2a5983fa07]
9: (MDSMonitor::maybe_resize_cluster(FSMap&, int)+0xcf0) [0x55b441e37490]
10: (MDSMonitor::tick()+0xc9) [0x55b441e38ce9]
11: (MDSMonitor::on_active()+0x28) [0x55b441e22fa8]
12: (PaxosService::_active()+0xdd) [0x55b441d7188d]
13: (Context::complete(int)+0x9) [0x55b441c888a9]
14: (void finish_contexts<std::__cxx11::list<Context*,std::allocator<Context*> > >(CephContext*,std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0xa8)[0x55b441cb2408]
15: (Paxos::finish_round()+0x76) [0x55b441d681b6]
16: (Paxos::handle_last(boost::intrusive_ptr<MonOpRequest>)+0xc1f)[0x55b441d693df]17: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x233)[0x55b441d69e23]18:(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x1668)[0x55b441c820b8]
19: (Monitor::_ms_dispatch(Message*)+0xa3a) [0x55b441c82b5a]
20: (Monitor::ms_dispatch(Message*)+0x26) [0x55b441cb3646]
21: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message>const&)+0x26) [0x55b441cb00b6]
22: (DispatchQueue::entry()+0x1279) [0x7f2a5b188379]
23: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f2a5b238a5d]
24: (()+0x8539) [0x7f2a59db7539]
25: (clone()+0x3f) [0x7f2a58f87ecf]
```

Anyone got any clues about how to diagnose or better-yet repair this?
Sorry, I know this is a bit half-baked, but I'm trying to dump thishelp request at COB to see if I can hook anyone's interest overnight.
Thanks for at least reading this far,

M0les.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph 14.2.22 MONs keep crashing in MDSMonitor::maybe_resize_cluster (out of range)

Reply via email to