Hi Alexander

Thanks man

Forgot to mention ceph version is 18.2.7

Is this described anywhere - bug tracker / docs?

Also when you write: "The same huge-swap recommendation applies to the recovery 
operations."

Should I - If I fail over the MDS in current state - expect that it will chew 
away on huge amounts of RAM, requiring me to add 1TB swap?

BR. Kasper


________________________________
From: Alexander Patrakov <patra...@gmail.com>
Sent: Wednesday, June 18, 2025 09:11
To: Kasper Rasmussen <kasper_steenga...@hotmail.com>
Cc: ceph-users <ceph-users@ceph.io>
Subject: Re: [ceph-users] CephFS scrub resulting in MDS_CACHE_OVERSIZED

Hello Kasper,

This is known. Next time, please add at least 1 TB of swap before the scrub, 
and ignore the warning while the MDS is chewing through all the directories and 
files.

The same huge-swap recommendation applies to the recovery operations.

On Wed, Jun 18, 2025 at 3:01 PM Kasper Rasmussen 
<kasper_steenga...@hotmail.com<mailto:kasper_steenga...@hotmail.com>> wrote:
After starting a recursive scrub on a cephfs with alot of files the MDS cache 
went oversized.

Scrub command: ceph... scrub start / recursive,repair,force

I kept an eye on the MDS memory usage - since I was warned that it might go 
crazy.. and after 2-3 hours, I started getting the warning

[WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
    mds.generic-mds.<host>.asddje(mds.0): MDS cache is too large (63GB/36GB); 
1250394 inodes in use by clients, 28888 stray files


I then paused the scrub, resulting in scrub status


{

    "status": "PAUSED (22837086 inodes in the stack)",

    "scrubs": {

        "27f0e32a-bc8c-443d-b1f0-534474798ddf": {

            "path": "/",

            "tag": "27f0e32a-bc8c-443d-b1f0-534474798ddf",

            "options": "recursive,repair,force"

        }

    }

}

and expected the cache size to go down again - but it didn't.
After +12 hours with no change, I opted to abort the scrub - again expecting 
that the inodes in the stack would be offloaded from memory.

The status after abort command:

{

    "status": "PAUSED (0 inodes in the stack)",

    "scrubs": {}

}

But still no changes to the cache size.

Since the status after the abort command had "PAUSED" in it, I resumed the 
scrub, resulting in status:

{

    "status": "no active scrubs running",

    "scrubs": {}

}

Still no changes to the cache size.

The log from the MDS in standard log level was:

debug 2025-06-03T06:48:24.122+0000 7f319065d640 1 mds.generic-mds.<host>.asddje 
asok_command: scrub start {path=/,prefix=scrub 
start,scrubops=[recursive,repair,force]} (starting...)
debug 2025-06-03T06:48:24.122+0000 7f318864d640 0 log_channel(cluster) log 
[INF] : scrub queued for path: /
debug 2025-06-03T06:48:24.122+0000 7f318864d640 0 log_channel(cluster) log 
[INF] : scrub summary: idle+waiting paths [/]
debug 2025-06-03T06:48:24.122+0000 7f318864d640 0 log_channel(cluster) log 
[INF] : scrub summary: active paths [/]
debug 2025-06-03T06:48:24.126+0000 7f3189e50640 1 
mds.0.cache.dir(0x10041e16a55) mismatch between head items and fnode.fragstat! 
printing dentries
debug 2025-06-03T06:48:24.126+0000 7f3189e50640 1 
mds.0.cache.dir(0x10041e16a55) get_num_head_items() = 38; 
fnode.fragstat.nfiles=28 fnode.fragstat.nsubdirs=11
debug 2025-06-03T06:48:24.126+0000 7f3189e50640 1 
mds.0.cache.dir(0x10041e16a55) mismatch between child accounted_rstats and my 
rstats!
debug 2025-06-03T06:48:24.126+0000 7f3189e50640 1 
mds.0.cache.dir(0x10041e16a55) total of child dentries: n(v0 
rc2025-06-03T06:48:11.042059+0000 b1661845634 127=95+32)
debug 2025-06-03T06:48:24.126+0000 7f3189e50640 1 
mds.0.cache.dir(0x10041e16a55) my rstats: n(v544237 
rc2025-06-03T06:48:11.042059+0000 b1661845650 128=96+32)
debug 2025-06-03T06:49:38.689+0000 7f319065d640 1 mds.generic-mds.<host>.asddje 
asok_command: scrub status {prefix=scrub status} (starting...)
debug 2025-06-03T06:51:49.782+0000 7f319065d640 1 mds.generic-mds.<host>.asddje 
asok_command: scrub status {prefix=scrub status} (starting...)
debug 2025-06-03T06:55:39.654+0000 7f319065d640 1 mds.generic-mds.<host>.asddje 
asok_command: scrub status {prefix=scrub status} (starting...)
debug 2025-06-03T07:00:56.205+0000 7f319065d640 1 mds.generic-mds.<host>.asddje 
asok_command: scrub status
..
..
From here it's either
- asok_command: scrub status {prefix=scrub status} (starting...)
- Updating MDS map to version xxxxxx from mon.3
Until I pause the scrub.

Extracts from the perf dump from the MDS:

"mds": {
..
..
..
"inodes": 23121955,
"inodes_top": 3684,
"inodes_bottom": 1728,
"inodes_pin_tail": 23116543,
"inodes_pinned": 23116691,
"inodes_expired": 39049803601,
"inodes_with_caps": 84593,
..
..

}
..
..
"mds_mem": {
     "ino": 23114378,
     "ino+": 38966647328,
     "ino-": 38943532950,
     "dir": 513065,
     "dir+": 130921896,
     "dir-": 130408831,
     "dn": 23121954,
     "dn+": 39349549680,
     "dn-": 39326427726,
     "cap": 87280,
     "cap+": 6964477825,
     "cap-": 6964390545,
     "rss": 79730620,
     "heap": 223508
},


I've have been reluctant to just fail the MDS, to clear the memory, but when I 
finally came around to do so I got the error

"Error EPERM: MDS has one of two health warnings which could extend recovery: 
MDS_TRIM or MDS_CACHE_OVERSIZED. MDS failover is not recommended since it might 
cause unexpected file system unavailability. If you wish to proceed, pass 
--yes-i-really-mean-it"

At this moment the number strays reported in the MDS_CACHE_OVERSIZED warning, 
are now up with a factor 10 (approx. 280000)

Which made me pause.
This seems like a bug.. But To be honest I don't know quite what to expect, if 
I just execute with "--yes-i-really-mean-it"..
Will the MDS eat huge amount of RAM during replay? (I've seen this before 
during failover - where MDS ate almost 200GB ram, even though the cache was not 
oversized.)
Any advice on how to proceed?

BR. Kasper
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email to 
ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>


--
Alexander Patrakov
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to