[ceph-users] Re: Ceph 14.2.22 MONs keep crashing in MDSMonitor::maybe_resize_cluster (out of range)

Eugen Block Fri, 20 Jun 2025 00:06:43 -0700

Cool, that's fantastic news! And a great analysis, too! I'm glad yougot it back up and client operations could resume. Happy to help!


Zitat von Miles Goodhew <c...@m0les.com>:

On Thu, 19 Jun 2025, at 18:39, Eugen Block wrote:
Zitat von Miles Goodhew <c...@m0les.com>:

> On Thu, 19 Jun 2025, at 17:48, Eugen Block wrote:
>> Too bad. :-/ Could you increase the debug log level to 20? Maybe it
>> gets a bit clearer where exactly it fails.
>
> I guess that's in `ceph.conf` with:
>
> [mon]
>     debug_mon = 20
> ?

Correct.
Some progress has been made!
The mon.mds print_map() output shows an "in" set of [0,1,2,3] (i.e.size 4, but only 2 are actually perceived as "up") and a max_mdsvalue of 2.With the log level increased to 20, the last dout log we see is online 1810 ("in 4 max 2"). So none of the other 4 dout logs are seenfrom lines 1818, 1834, 1847 or 1855. So that must mean that one ofthe calls at lines 1816 (mds_map.isresizable x 2), 1845(mds_map.get_info) or 1846 (mds_map.is_active) would have to causethe crash.
We were toying with methods to see if we could set values todrop-out of this code earlier and it was decided that the MDSservice was not the most important part of the cluster (TheOpenstack cluster on top of it was more important).
So for a test, we used the `ceph-kvstore-tool` to just trim-off the"mds*" prefixes from the DB:
```
  ceph-kvstore-tool leveldb ${DB_PATH} rm-prefix mdsmap
  ceph-kvstore-tool leveldb ${DB_PATH} rm-prefix mds_health
  ceph-kvstore-tool leveldb ${DB_PATH} rm-prefix mds_metadata
  ceph-kvstore-tool leveldb ${DB_PATH} rm health mdsmap
```
(I suspect the "mdsmap" part was the most important, but we'remostly just going by feel at this level).
To our surprised delight, the MONs all started-up and formed aquorum. We then started-up all the MGRs without issue. Weprogressively started all the OSDs (minor rebalancing from anactually unhealthy disk).
The cluster got back to a "nominally operational except for CephFS"state and the Openstack cluster was verified and repaired. All greenat COB. The RGW services restarted and verified operational by theirclients.
So we're leaving it like this for now and conducting a review onMonday. There are a few immediate bits of maintenance that can bedone, but this whole incident puts a fire under the "let's get thisupdated and moved to supported OS/Hardware" plan.
Thanks again for all your help, Eugen - much appreciated!!!

M0les.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph 14.2.22 MONs keep crashing in MDSMonitor::maybe_resize_cluster (out of range)

Reply via email to