Eugen, Sorry, I forgot to add that this is what the monmap looks like now (IPs/names sanitised):
``` min_mon_release 14 (nautilus) 0: [v2:IP_MON3:3300/0,v1:IP_MON3:6789/0] mon3 1: v1:IP_MON1:6789/0 mon1 2: v1:IP_MON2:6789/0 mon2 ``` Not sure why mon3 has the v2 + v1 setup and mon1/2 don't Thanks again, M0les. On Wed, 18 Jun 2025, at 17:42, Miles Goodhew wrote: > Hi Eugen, > Thanks for your response. > > Out of interest things that I've done overnight are stopping all the > daemons (OSDs and RGWs were the ones still running) - so I'm just dealing > with the 3 MONs now. Trying different start-sequences, I can determine: > > * mon3 was the last one working > * Starting mon1 will kill mon3 (and prevent it starting) with that crash > mentioned in the original email > * Similarly starting mon2 will kill both mon1 and mon3 in the same way > * Only mon3 gets the fast spamming of "e6 handle_auth_request failed to > assign global_id" log messages when it's running. > * Dumping the monmap results in the same file on all 3 mons. > > As for your suggestion of reducing the monmap to 1 node and rebuilding, we > were also thinking of heading down that path. I'm hoping that deploying a > temporary 4th mon on a new node might be able to get two nodes running > (without killing the "old" one). Probably using mon3 because it's likely the > most up-to-date. If that works, we could try clobbering and redeploying the > other two "old" mon daemons and removing the temporary one to get back to the > original 3 mons. As you say: using their original IP addresses (one of the > clients is Openstack/RBD, which can be sentimental about mon IPs). > > I'm just in a bit of decision paralysis about which mon to take as the > survivor. All can run _individually_, but only mon2 will survive a group > start. mon3 was the last one working, but it has the mysterious "failed to > assign global ID" errors. I'm leaning toward using mon3.. or mon2. > > Thanks for listening, > > M0les. > > > On Wed, 18 Jun 2025, at 17:04, Eugen Block wrote: >> Hi, >> >> correct, SUSE's Ceph product was Salt-based, in this case 14.2.22 was >> shipped with SES 6. ;-) >> >> Do you also have some mon logs from right before the crash, maybe with >> a higher debug level? It could make sense to stop client traffic and >> OSDs as well to be able to recover. But unfortunately, I can't really >> comment on the stack trace. >> >> Maybe someone has a different idea, but if you get one MON up, I would >> probably reduce the monmap to 1 MON to bring the cluster back up. Back >> up all the MON stores, just in case you have to start over. Then >> extract the monmap, remove all but one, and inject the modified monmap >> into the MON you want to revive. The procedure is described here [0]. >> Just don't change the address but only reduce the monmap. ;-) >> >> Regards, >> Eugen >> >> [0] >> https://docs.ceph.com/en/latest/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address-advanced-method >> >> Zitat von Miles Goodhew <c...@m0les.com>: >> >> > Hi, >> > I've been called-in by a client with an ancient SUSE-based Ceph >> > Nautilus (14.2.22) who's MONs keep dieing oddly. >> > Apparently the issue started with MDS daemons not working and >> > eventuallt a MON restart killed the cluster. >> > >> > OS: SLES 15-SP1 (out of support) >> > Ceph: 14.2.22 "Nautilus" (Deployed with Salt... I think) >> > 3 MONs; 5 MDSs; 3 MGRs; 4 RGWs; 336 OSDs on 21 nodes. >> > Client services: "One of everything at least", but RBD/Openstack, >> > S3/RGW and CephFS are big ones. >> > >> > After sorting out some of the logs here are some things I know: >> > Disk space, RAM availability, inodes and network connectivity seem >> > OK to me. After shutting-down all the MONs, MGRs and MDSes, one MON >> > can usually be started, but it sits there spamming-out log messages >> > like "[SERVICE_ID](probing) e6 handle_auth_request failed to assign >> > global_id" (maybe 50 - 100 times per second). All the while the >> > syslog shows 'e6 get_health_metrics reporting [INCREASING_NUMBER] >> > slow ops` fairly often. This is probably due to OSDs and clients >> > being active. >> > >> > If I restart one of the other MONs, the running one will die with >> > a stack trace at (Limiting to C++/library internal calls): >> > >> > ``` >> > 8: (std::__throw_out_of_range(char const*)+0x41) [0x7f2a5983fa07] >> > 9: (MDSMonitor::maybe_resize_cluster(FSMap&, int)+0xcf0) [0x55b441e37490] >> > 10: (MDSMonitor::tick()+0xc9) [0x55b441e38ce9] >> > 11: (MDSMonitor::on_active()+0x28) [0x55b441e22fa8] >> > 12: (PaxosService::_active()+0xdd) [0x55b441d7188d] >> > 13: (Context::complete(int)+0x9) [0x55b441c888a9] >> > 14: (void finish_contexts<std::__cxx11::list<Context*, >> > std::allocator<Context*> > >(CephContext*, >> > std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0xa8) >> > [0x55b441cb2408] >> > 15: (Paxos::finish_round()+0x76) [0x55b441d681b6] >> > 16: (Paxos::handle_last(boost::intrusive_ptr<MonOpRequest>)+0xc1f) >> > [0x55b441d693df] >> > 17: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x233) >> > [0x55b441d69e23] >> > 18: >> > (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x1668) >> > [0x55b441c820b8] >> > 19: (Monitor::_ms_dispatch(Message*)+0xa3a) [0x55b441c82b5a] >> > 20: (Monitor::ms_dispatch(Message*)+0x26) [0x55b441cb3646] >> > 21: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> >> > const&)+0x26) [0x55b441cb00b6] >> > 22: (DispatchQueue::entry()+0x1279) [0x7f2a5b188379] >> > 23: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f2a5b238a5d] >> > 24: (()+0x8539) [0x7f2a59db7539] >> > 25: (clone()+0x3f) [0x7f2a58f87ecf] >> > ``` >> > >> > Anyone got any clues about how to diagnose or better-yet repair this? >> > >> > Sorry, I know this is a bit half-baked, but I'm trying to dump this >> > help request at COB to see if I can hook anyone's interest overnight. >> > >> > Thanks for at least reading this far, >> > >> > M0les. >> > _______________________________________________ >> > ceph-users mailing list -- ceph-users@ceph.io >> > To unsubscribe send an email to ceph-users-le...@ceph.io >> >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io >> > _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io