Hi,
  I've been called-in by a client with an ancient SUSE-based Ceph Nautilus 
(14.2.22) who's MONs keep dieing oddly.
  Apparently the issue started with MDS daemons not working and eventuallt a 
MON restart killed the cluster.

OS: SLES 15-SP1 (out of support)
Ceph: 14.2.22 "Nautilus" (Deployed with Salt... I think)
3 MONs; 5 MDSs; 3 MGRs; 4 RGWs; 336 OSDs on 21 nodes.
Client services: "One of everything at least", but RBD/Openstack, S3/RGW and 
CephFS are big ones.

  After sorting out some of the logs here are some things I know: Disk space, 
RAM availability, inodes and network connectivity seem OK to me. After 
shutting-down all the MONs, MGRs and MDSes, one MON can usually be started, but 
it sits there spamming-out log messages like "[SERVICE_ID](probing) e6 
handle_auth_request failed to assign global_id" (maybe 50 - 100 times per 
second). All the while the syslog shows 'e6 get_health_metrics reporting 
[INCREASING_NUMBER] slow ops` fairly often. This is probably due to OSDs and 
clients being active.

  If I restart one of the other MONs, the running one will die with a stack 
trace at (Limiting to C++/library internal calls):

```
8: (std::__throw_out_of_range(char const*)+0x41) [0x7f2a5983fa07]
9: (MDSMonitor::maybe_resize_cluster(FSMap&, int)+0xcf0) [0x55b441e37490]
10: (MDSMonitor::tick()+0xc9) [0x55b441e38ce9]
11: (MDSMonitor::on_active()+0x28) [0x55b441e22fa8]
12: (PaxosService::_active()+0xdd) [0x55b441d7188d]
13: (Context::complete(int)+0x9) [0x55b441c888a9]
14: (void finish_contexts<std::__cxx11::list<Context*, std::allocator<Context*> 
> >(CephContext*, std::__cxx11::list<Context*, std::allocator<Context*> >&, 
int)+0xa8) [0x55b441cb2408]
15: (Paxos::finish_round()+0x76) [0x55b441d681b6]
16: (Paxos::handle_last(boost::intrusive_ptr<MonOpRequest>)+0xc1f) 
[0x55b441d693df]
17: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x233) [0x55b441d69e23]
18: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x1668) 
[0x55b441c820b8]
19: (Monitor::_ms_dispatch(Message*)+0xa3a) [0x55b441c82b5a]
20: (Monitor::ms_dispatch(Message*)+0x26) [0x55b441cb3646]
21: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26) 
[0x55b441cb00b6]
22: (DispatchQueue::entry()+0x1279) [0x7f2a5b188379]
23: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f2a5b238a5d]
24: (()+0x8539) [0x7f2a59db7539]
25: (clone()+0x3f) [0x7f2a58f87ecf]
```

Anyone got any clues about how to diagnose or better-yet repair this?

Sorry, I know this is a bit half-baked, but I'm trying to dump this help 
request at COB to see if I can hook anyone's interest overnight.

Thanks for at least reading this far,

M0les.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to