[ceph-users] Re: MDS Repeatedly Crashing/Restarting - Unable to get CephFS Active

Eugen Block Mon, 19 May 2025 23:03:27 -0700

Hi,

I don't think I've had to use a journal backup yet. Either the backupof the journal failed because it was corrupted, or the disasterrecovery procedure worked out.

But assume that you would need to import the backup:


cephfs-journal-tool [options] journal import <path> [--force]

and then retry to recover the FS. But I can't remember either ifanyone on this list has reported to successfully restore the journalfrom backup and then succefully recovered the FS in a second attempt.



Zitat von Alexander Patrakov <patra...@gmail.com>:

Hi Eugen,

I have never seen any instructions on how to use such a backup if
disaster recovery fails. Do you know the procedure?

On Tue, May 20, 2025 at 1:23 AM Eugen Block <ebl...@nde.ag> wrote:


Hi,

not sure if it was related to journal replay, but have you checked for
memory issues? What's the mds memory target? Any traces of an oom
killer?

Next I would do is inspect the journals for both purge_queue and md_log:

cephfs-journal-tool journal inspect --rank=<cephfs> --journal=md_log
cephfs-journal-tool journal inspect --rank=<cephfs> --journal=purge_queue

The --rank and --journal parameters might be in the wrong place here,
I'm writing this without immediate access to a cephfs-journal-tool.

In case the journals are okay, create a backup as described in the
docs [0]. Then you might have to go through the disaster recovery
steps (for this cephfs only).

[0] https://docs.ceph.com/en/latest/cephfs/disaster-recovery/

Zitat von Kasper Rasmussen <kasper_steenga...@hotmail.com>:

> Ceph Version: 18.2.7
>
> I've just migrated to cephadm, and upgrade from pacific to reef
> 18.2.7 last week.
> All successful except some minor issues with BlueFS Spillover
>
>
> Today the MDS of a specific fs refuse to start, and the ceph orch ps
> shows the daemons with status "error".
> I have three other cephfs that still works(though I haven't tested
> if they can fail over.)
>
> I've restartet the MDSs - No luck (the selected MDS just start/crash
> in a loop until it gives up)
> I've deployed 2 new MDSs - No luck same issue
>
> In all scenarios I see in ceph fs status, that a MDS is chosen. FS
> status goes to "replay" or "replay(laggy)"
> On the host with the MDS I see the MDS container just crashes after
> way less than 5 mins.. And status reported by ceph orch ps is error.
>
> (btw - mds_beacon_grace has been set to 360)
>
> I've managed to get a good 500 lines of log out with info like this:
>
> << ----------------- LOG EXAMPLE START ----------------- >>
>     -7> 2025-05-19T16:05:02.840+0000 7f6739bb8640 10 monclient:
> _check_auth_tickets
>     -6> 2025-05-19T16:05:02.840+0000 7f6739bb8640 10 monclient:
> _check_auth_rotating have uptodate secrets (they expire after
> 2025-05-19T16:04:32.845551+0000)
>     -5> 2025-05-19T16:05:02.860+0000 7f673e3c1640 10 monclient:
> get_auth_request con 0x5616e9616c00 auth_method 0
>     -4> 2025-05-19T16:05:02.916+0000 7f673dbc0640 10 monclient:
> get_auth_request con 0x5616e7422800 auth_method 0
>     -3> 2025-05-19T16:05:02.968+0000 7f673d3bf640 10 monclient:
> get_auth_request con 0x5616f5eac800 auth_method 0
>     -2> 2025-05-19T16:05:02.972+0000 7f6736bb2640  2 mds.0.cache
> Memory usage:  total 574800, rss 343772, heap 207124, baseline
> 182548, 0 / 7535 inodes have caps, 0 caps, 0 caps per inode
>     -1> 2025-05-19T16:05:03.676+0000 7f67333ab640 -1

>/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/18.2.7/rpm/el9/BUILD/ceph-18.2.7/src/include/interval_set.h: In function 'void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]' thread 7f67333ab640time

> 2025-05-19T16:05:03.680495+0000

> start)
>

> ceph version 18.2.7 (6b0e988052ec84cf2d4a54ff9bbbc5e720b621ad)reef (stable)

>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x11e) [0x7f67406e6d2c]
>  2: /usr/lib64/ceph/libceph-common.so.2(+0x16beeb) [0x7f67406e6eeb]
>  3: /usr/bin/ceph-mds(+0x1f16fe) [0x5616e04d46fe]
>  4: /usr/bin/ceph-mds(+0x1f1745) [0x5616e04d4745]
>  5: (EMetaBlob::replay(MDSRank*, LogSegment*, int,
> MDPeerUpdate*)+0x4bdc) [0x5616e0709a4c]
>  6: (EUpdate::replay(MDSRank*)+0x5d) [0x5616e0711afd]
>  7: (MDLog::_replay_thread()+0x75e) [0x5616e06bc02e]
>  8: /usr/bin/ceph-mds(+0x1404b1) [0x5616e04234b1]
>  9: /lib64/libc.so.6(+0x8a21a) [0x7f674009721a]
>  10: clone()
>
>      0> 2025-05-19T16:05:03.676+0000 7f67333ab640 -1 *** Caught
> signal (Aborted) **
>  in thread 7f67333ab640 thread_name:mds-log-replay
>

> ceph version 18.2.7 (6b0e988052ec84cf2d4a54ff9bbbc5e720b621ad)reef (stable)

>  1: /lib64/libc.so.6(+0x3ebf0) [0x7f674004bbf0]
>  2: /lib64/libc.so.6(+0x8bf5c) [0x7f6740098f5c]
>  3: raise()
>  4: abort()
>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x178) [0x7f67406e6d86]
>  6: /usr/lib64/ceph/libceph-common.so.2(+0x16beeb) [0x7f67406e6eeb]
>  7: /usr/bin/ceph-mds(+0x1f16fe) [0x5616e04d46fe]
>  8: /usr/bin/ceph-mds(+0x1f1745) [0x5616e04d4745]
>  9: (EMetaBlob::replay(MDSRank*, LogSegment*, int,
> MDPeerUpdate*)+0x4bdc) [0x5616e0709a4c]
>  10: (EUpdate::replay(MDSRank*)+0x5d) [0x5616e0711afd]
>  11: (MDLog::_replay_thread()+0x75e) [0x5616e06bc02e]
>  12: /usr/bin/ceph-mds(+0x1404b1) [0x5616e04234b1]
>  13: /lib64/libc.so.6(+0x8a21a) [0x7f674009721a]
>  14: clone()
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
> << ----------------- LOG EXAMPLE END ----------------- >>
>
>
> But to be honest, out of all those lines, I don't know what to
> provide (all +500 might be a bit to much)
>
>
> I really need this FS back online, so help will be very much appreciated
>
>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




--
Alexander Patrakov



_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MDS Repeatedly Crashing/Restarting - Unable to get CephFS Active

Reply via email to