[ceph-users] Re: Ceph 14.2.22 MONs keep crashing in MDSMonitor::maybe_resize_cluster (out of range)

Miles Goodhew Wed, 18 Jun 2025 00:49:02 -0700

Eugen,
  Sorry, I forgot to add that this is what the monmap looks like now (IPs/names 
sanitised):


```
min_mon_release 14 (nautilus)
0: [v2:IP_MON3:3300/0,v1:IP_MON3:6789/0] mon3
1: v1:IP_MON1:6789/0 mon1
2: v1:IP_MON2:6789/0 mon2
```

Not sure why mon3 has the v2 + v1 setup and mon1/2 don't

Thanks again,

M0les.

On Wed, 18 Jun 2025, at 17:42, Miles Goodhew wrote:
> Hi Eugen,
>   Thanks for your response.
> 
>   Out of interest things that I've done overnight are stopping all the 
> daemons (OSDs and RGWs were the ones still running) - so I'm just dealing 
> with the 3 MONs now. Trying different start-sequences, I can determine:
> 
> * mon3 was the last one working
> * Starting mon1 will kill mon3 (and prevent it starting) with that crash 
> mentioned in the original email
> * Similarly starting mon2 will kill both mon1 and mon3 in the same way
> * Only mon3 gets the fast spamming of "e6 handle_auth_request failed to 
> assign global_id" log messages when it's running.
> * Dumping the monmap results in the same file on all 3 mons.
> 
> As for your suggestion of reducing the monmap to 1 node and rebuilding, we 
> were also thinking of heading down that path. I'm hoping that deploying a 
> temporary 4th mon on a new node might be able to get two nodes running 
> (without killing the "old" one). Probably using mon3 because it's likely the 
> most up-to-date. If that works, we could try clobbering and redeploying the 
> other two "old" mon daemons and removing the temporary one to get back to the 
> original 3 mons. As you say: using their original IP addresses (one of the 
> clients is Openstack/RBD, which can be sentimental about mon IPs).
> 
> I'm just in a bit of decision paralysis about which mon to take as the 
> survivor. All can run _individually_, but only mon2 will survive a group 
> start. mon3 was the last one working, but it has the mysterious "failed to 
> assign global ID" errors. I'm leaning toward using mon3.. or mon2.
> 
> Thanks for listening,
> 
> M0les.
> 
> 
> On Wed, 18 Jun 2025, at 17:04, Eugen Block wrote:
>> Hi,
>> 
>> correct, SUSE's Ceph product was Salt-based, in this case 14.2.22 was  
>> shipped with SES 6. ;-)
>> 
>> Do you also have some mon logs from right before the crash, maybe with  
>> a higher debug level? It could make sense to stop client traffic and  
>> OSDs as well to be able to recover. But unfortunately, I can't really  
>> comment on the stack trace.
>> 
>> Maybe someone has a different idea, but if you get one MON up, I would  
>> probably reduce the monmap to 1 MON to bring the cluster back up. Back  
>> up all the MON stores, just in case you have to start over. Then  
>> extract the monmap, remove all but one, and inject the modified monmap  
>> into the MON you want to revive. The procedure is described here [0].  
>> Just don't change the address but only reduce the monmap. ;-)
>> 
>> Regards,
>> Eugen
>> 
>> [0]  
>> https://docs.ceph.com/en/latest/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address-advanced-method
>> 
>> Zitat von Miles Goodhew <c...@m0les.com>:
>> 
>> > Hi,
>> >   I've been called-in by a client with an ancient SUSE-based Ceph  
>> > Nautilus (14.2.22) who's MONs keep dieing oddly.
>> >   Apparently the issue started with MDS daemons not working and  
>> > eventuallt a MON restart killed the cluster.
>> >
>> > OS: SLES 15-SP1 (out of support)
>> > Ceph: 14.2.22 "Nautilus" (Deployed with Salt... I think)
>> > 3 MONs; 5 MDSs; 3 MGRs; 4 RGWs; 336 OSDs on 21 nodes.
>> > Client services: "One of everything at least", but RBD/Openstack,  
>> > S3/RGW and CephFS are big ones.
>> >
>> >   After sorting out some of the logs here are some things I know:  
>> > Disk space, RAM availability, inodes and network connectivity seem  
>> > OK to me. After shutting-down all the MONs, MGRs and MDSes, one MON  
>> > can usually be started, but it sits there spamming-out log messages  
>> > like "[SERVICE_ID](probing) e6 handle_auth_request failed to assign  
>> > global_id" (maybe 50 - 100 times per second). All the while the  
>> > syslog shows 'e6 get_health_metrics reporting [INCREASING_NUMBER]  
>> > slow ops` fairly often. This is probably due to OSDs and clients  
>> > being active.
>> >
>> >   If I restart one of the other MONs, the running one will die with  
>> > a stack trace at (Limiting to C++/library internal calls):
>> >
>> > ```
>> > 8: (std::__throw_out_of_range(char const*)+0x41) [0x7f2a5983fa07]
>> > 9: (MDSMonitor::maybe_resize_cluster(FSMap&, int)+0xcf0) [0x55b441e37490]
>> > 10: (MDSMonitor::tick()+0xc9) [0x55b441e38ce9]
>> > 11: (MDSMonitor::on_active()+0x28) [0x55b441e22fa8]
>> > 12: (PaxosService::_active()+0xdd) [0x55b441d7188d]
>> > 13: (Context::complete(int)+0x9) [0x55b441c888a9]
>> > 14: (void finish_contexts<std::__cxx11::list<Context*,  
>> > std::allocator<Context*> > >(CephContext*,  
>> > std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0xa8)  
>> > [0x55b441cb2408]
>> > 15: (Paxos::finish_round()+0x76) [0x55b441d681b6]
>> > 16: (Paxos::handle_last(boost::intrusive_ptr<MonOpRequest>)+0xc1f)  
>> > [0x55b441d693df]
>> > 17: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x233)  
>> > [0x55b441d69e23]
>> > 18:  
>> > (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x1668)  
>> > [0x55b441c820b8]
>> > 19: (Monitor::_ms_dispatch(Message*)+0xa3a) [0x55b441c82b5a]
>> > 20: (Monitor::ms_dispatch(Message*)+0x26) [0x55b441cb3646]
>> > 21: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message>  
>> > const&)+0x26) [0x55b441cb00b6]
>> > 22: (DispatchQueue::entry()+0x1279) [0x7f2a5b188379]
>> > 23: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f2a5b238a5d]
>> > 24: (()+0x8539) [0x7f2a59db7539]
>> > 25: (clone()+0x3f) [0x7f2a58f87ecf]
>> > ```
>> >
>> > Anyone got any clues about how to diagnose or better-yet repair this?
>> >
>> > Sorry, I know this is a bit half-baked, but I'm trying to dump this  
>> > help request at COB to see if I can hook anyone's interest overnight.
>> >
>> > Thanks for at least reading this far,
>> >
>> > M0les.
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
>> 
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
> 
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph 14.2.22 MONs keep crashing in MDSMonitor::maybe_resize_cluster (out of range)

Reply via email to