Hi, you need to fix the stale PGs first. Do they belong to one of the CephFS pools? If that is the case your journal issue most likely roots in those stale PGs.
Anth All via ceph-users <[email protected]> schrieb am Mo. 2. Feb. 2026 um 19:58: > Hi all, > I’m running a Ceph cluster managed by Rook on Kubernetes, and my CephFS > metadata/journal appears to be in a bad state. I’d like to get advice > before I attempt any destructive metadata repair operations. > Below is a summary of the situation. > *Environment* > > - Orchestrator: Rook/Ceph on Kubernetes (namespace: rook-ceph) > - Ceph version: 18.2.2 (Reef, stable) for mon/mgr/osd/mds > - Filesystem: > - Name: rookfs > - One metadata pool > - One data pool > > > *Cluster health* > ceph status shows: > > - Health: HEALTH_WARN > - Warnings: > - 1 MDS reports slow metadata IOs > - 1 MDS reports slow requests > - Reduced data availability: some PGs in stale state > - A large number of daemons have recently crashed > > MDS section reports: > > - 1/1 MDS daemons up, 1 hot standby > > > CephFS state and MDS status > ceph fs dump for rookfs shows > > - Filesystem rookfs is marked damaged. > - max_mds = 1. > - in set is empty. > - up set is {0=<mds_id>}. > - Flags mention allow_standby_replay. > > > So, the filesystem is marked damaged in the fsmap, while one MDS is still > up:active on rank 0. > ceph tell mds.* status confirms that the MDS for rookfs is: > > - state: up:active > - fs_name: rookfs > - whoami: 0 > > > > I ran the following commands on the CephFS journal: > > 1. Journal reset: > > bash > cephfs-journal-tool --rank=rookfs:0 journal reset > > This completed and indicated a new journal start offset. > 2. Journal inspection: > > bash > cephfs-journal-tool --rank=rookfs:0 journal inspect > > Output : > - Bad entry start ptr (...) at certain offsets > - Overall journal integrity: DAMAGED > - Corrupt regions reported, including a range up to ffffffffffffffff > > So, even after the reset, cephfs-journal-tool reports the journal as > DAMAGED with > corrupt regions. > Listing the metadata pool shows at least the mds_snaptable object, so the > metadata pool is not empty. > > > *Current behaviour* > > - ceph fs status is sometimes very slow or appears to hang. > - Ceph health reports: > - “MDSs report slow metadata IOs” > - “MDSs report slow requests” > - Stale PGs in the cluster > - The filesystem rookfs is marked damaged in ceph fs dump, but the > MDS is still up:active on rank 0. > > Any guidance or best practices for handling this kind of journal corruption > and damaged filesystem in a Rook/Kubernetes setup would be greatly > appreciated, including precautions you would strongly recommend before > running the heavy-repair commands. > Best regards, > Anthony > _______________________________________________ > ceph-users mailing list -- [email protected] > To unsubscribe send an email to [email protected] > _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
