Hi, I've been called-in by a client with an ancient SUSE-based Ceph Nautilus (14.2.22) who's MONs keep dieing oddly. Apparently the issue started with MDS daemons not working and eventuallt a MON restart killed the cluster.
OS: SLES 15-SP1 (out of support) Ceph: 14.2.22 "Nautilus" (Deployed with Salt... I think) 3 MONs; 5 MDSs; 3 MGRs; 4 RGWs; 336 OSDs on 21 nodes. Client services: "One of everything at least", but RBD/Openstack, S3/RGW and CephFS are big ones. After sorting out some of the logs here are some things I know: Disk space, RAM availability, inodes and network connectivity seem OK to me. After shutting-down all the MONs, MGRs and MDSes, one MON can usually be started, but it sits there spamming-out log messages like "[SERVICE_ID](probing) e6 handle_auth_request failed to assign global_id" (maybe 50 - 100 times per second). All the while the syslog shows 'e6 get_health_metrics reporting [INCREASING_NUMBER] slow ops` fairly often. This is probably due to OSDs and clients being active. If I restart one of the other MONs, the running one will die with a stack trace at (Limiting to C++/library internal calls): ``` 8: (std::__throw_out_of_range(char const*)+0x41) [0x7f2a5983fa07] 9: (MDSMonitor::maybe_resize_cluster(FSMap&, int)+0xcf0) [0x55b441e37490] 10: (MDSMonitor::tick()+0xc9) [0x55b441e38ce9] 11: (MDSMonitor::on_active()+0x28) [0x55b441e22fa8] 12: (PaxosService::_active()+0xdd) [0x55b441d7188d] 13: (Context::complete(int)+0x9) [0x55b441c888a9] 14: (void finish_contexts<std::__cxx11::list<Context*, std::allocator<Context*> > >(CephContext*, std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0xa8) [0x55b441cb2408] 15: (Paxos::finish_round()+0x76) [0x55b441d681b6] 16: (Paxos::handle_last(boost::intrusive_ptr<MonOpRequest>)+0xc1f) [0x55b441d693df] 17: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x233) [0x55b441d69e23] 18: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x1668) [0x55b441c820b8] 19: (Monitor::_ms_dispatch(Message*)+0xa3a) [0x55b441c82b5a] 20: (Monitor::ms_dispatch(Message*)+0x26) [0x55b441cb3646] 21: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26) [0x55b441cb00b6] 22: (DispatchQueue::entry()+0x1279) [0x7f2a5b188379] 23: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f2a5b238a5d] 24: (()+0x8539) [0x7f2a59db7539] 25: (clone()+0x3f) [0x7f2a58f87ecf] ``` Anyone got any clues about how to diagnose or better-yet repair this? Sorry, I know this is a bit half-baked, but I'm trying to dump this help request at COB to see if I can hook anyone's interest overnight. Thanks for at least reading this far, M0les. _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io