Cool, that's fantastic news! And a great analysis, too! I'm glad you got it back up and client operations could resume. Happy to help!

Zitat von Miles Goodhew <c...@m0les.com>:

On Thu, 19 Jun 2025, at 18:39, Eugen Block wrote:
Zitat von Miles Goodhew <c...@m0les.com>:

> On Thu, 19 Jun 2025, at 17:48, Eugen Block wrote:
>> Too bad. :-/ Could you increase the debug log level to 20? Maybe it
>> gets a bit clearer where exactly it fails.
>
> I guess that's in `ceph.conf` with:
>
> [mon]
>     debug_mon = 20
> ?

Correct.

Some progress has been made!
The mon.mds print_map() output shows an "in" set of [0,1,2,3] (i.e. size 4, but only 2 are actually perceived as "up") and a max_mds value of 2. With the log level increased to 20, the last dout log we see is on line 1810 ("in 4 max 2"). So none of the other 4 dout logs are seen from lines 1818, 1834, 1847 or 1855. So that must mean that one of the calls at lines 1816 (mds_map.isresizable x 2), 1845 (mds_map.get_info) or 1846 (mds_map.is_active) would have to cause the crash.

We were toying with methods to see if we could set values to drop-out of this code earlier and it was decided that the MDS service was not the most important part of the cluster (The Openstack cluster on top of it was more important).

So for a test, we used the `ceph-kvstore-tool` to just trim-off the "mds*" prefixes from the DB:

```
  ceph-kvstore-tool leveldb ${DB_PATH} rm-prefix mdsmap
  ceph-kvstore-tool leveldb ${DB_PATH} rm-prefix mds_health
  ceph-kvstore-tool leveldb ${DB_PATH} rm-prefix mds_metadata
  ceph-kvstore-tool leveldb ${DB_PATH} rm health mdsmap
```

(I suspect the "mdsmap" part was the most important, but we're mostly just going by feel at this level).

To our surprised delight, the MONs all started-up and formed a quorum. We then started-up all the MGRs without issue. We progressively started all the OSDs (minor rebalancing from an actually unhealthy disk).

The cluster got back to a "nominally operational except for CephFS" state and the Openstack cluster was verified and repaired. All green at COB. The RGW services restarted and verified operational by their clients.

So we're leaving it like this for now and conducting a review on Monday. There are a few immediate bits of maintenance that can be done, but this whole incident puts a fire under the "let's get this updated and moved to supported OS/Hardware" plan.

Thanks again for all your help, Eugen - much appreciated!!!

M0les.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to