Hi Alexander Thanks man
Forgot to mention ceph version is 18.2.7 Is this described anywhere - bug tracker / docs? Also when you write: "The same huge-swap recommendation applies to the recovery operations." Should I - If I fail over the MDS in current state - expect that it will chew away on huge amounts of RAM, requiring me to add 1TB swap? BR. Kasper ________________________________ From: Alexander Patrakov <patra...@gmail.com> Sent: Wednesday, June 18, 2025 09:11 To: Kasper Rasmussen <kasper_steenga...@hotmail.com> Cc: ceph-users <ceph-users@ceph.io> Subject: Re: [ceph-users] CephFS scrub resulting in MDS_CACHE_OVERSIZED Hello Kasper, This is known. Next time, please add at least 1 TB of swap before the scrub, and ignore the warning while the MDS is chewing through all the directories and files. The same huge-swap recommendation applies to the recovery operations. On Wed, Jun 18, 2025 at 3:01 PM Kasper Rasmussen <kasper_steenga...@hotmail.com<mailto:kasper_steenga...@hotmail.com>> wrote: After starting a recursive scrub on a cephfs with alot of files the MDS cache went oversized. Scrub command: ceph... scrub start / recursive,repair,force I kept an eye on the MDS memory usage - since I was warned that it might go crazy.. and after 2-3 hours, I started getting the warning [WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache mds.generic-mds.<host>.asddje(mds.0): MDS cache is too large (63GB/36GB); 1250394 inodes in use by clients, 28888 stray files I then paused the scrub, resulting in scrub status { "status": "PAUSED (22837086 inodes in the stack)", "scrubs": { "27f0e32a-bc8c-443d-b1f0-534474798ddf": { "path": "/", "tag": "27f0e32a-bc8c-443d-b1f0-534474798ddf", "options": "recursive,repair,force" } } } and expected the cache size to go down again - but it didn't. After +12 hours with no change, I opted to abort the scrub - again expecting that the inodes in the stack would be offloaded from memory. The status after abort command: { "status": "PAUSED (0 inodes in the stack)", "scrubs": {} } But still no changes to the cache size. Since the status after the abort command had "PAUSED" in it, I resumed the scrub, resulting in status: { "status": "no active scrubs running", "scrubs": {} } Still no changes to the cache size. The log from the MDS in standard log level was: debug 2025-06-03T06:48:24.122+0000 7f319065d640 1 mds.generic-mds.<host>.asddje asok_command: scrub start {path=/,prefix=scrub start,scrubops=[recursive,repair,force]} (starting...) debug 2025-06-03T06:48:24.122+0000 7f318864d640 0 log_channel(cluster) log [INF] : scrub queued for path: / debug 2025-06-03T06:48:24.122+0000 7f318864d640 0 log_channel(cluster) log [INF] : scrub summary: idle+waiting paths [/] debug 2025-06-03T06:48:24.122+0000 7f318864d640 0 log_channel(cluster) log [INF] : scrub summary: active paths [/] debug 2025-06-03T06:48:24.126+0000 7f3189e50640 1 mds.0.cache.dir(0x10041e16a55) mismatch between head items and fnode.fragstat! printing dentries debug 2025-06-03T06:48:24.126+0000 7f3189e50640 1 mds.0.cache.dir(0x10041e16a55) get_num_head_items() = 38; fnode.fragstat.nfiles=28 fnode.fragstat.nsubdirs=11 debug 2025-06-03T06:48:24.126+0000 7f3189e50640 1 mds.0.cache.dir(0x10041e16a55) mismatch between child accounted_rstats and my rstats! debug 2025-06-03T06:48:24.126+0000 7f3189e50640 1 mds.0.cache.dir(0x10041e16a55) total of child dentries: n(v0 rc2025-06-03T06:48:11.042059+0000 b1661845634 127=95+32) debug 2025-06-03T06:48:24.126+0000 7f3189e50640 1 mds.0.cache.dir(0x10041e16a55) my rstats: n(v544237 rc2025-06-03T06:48:11.042059+0000 b1661845650 128=96+32) debug 2025-06-03T06:49:38.689+0000 7f319065d640 1 mds.generic-mds.<host>.asddje asok_command: scrub status {prefix=scrub status} (starting...) debug 2025-06-03T06:51:49.782+0000 7f319065d640 1 mds.generic-mds.<host>.asddje asok_command: scrub status {prefix=scrub status} (starting...) debug 2025-06-03T06:55:39.654+0000 7f319065d640 1 mds.generic-mds.<host>.asddje asok_command: scrub status {prefix=scrub status} (starting...) debug 2025-06-03T07:00:56.205+0000 7f319065d640 1 mds.generic-mds.<host>.asddje asok_command: scrub status .. .. From here it's either - asok_command: scrub status {prefix=scrub status} (starting...) - Updating MDS map to version xxxxxx from mon.3 Until I pause the scrub. Extracts from the perf dump from the MDS: "mds": { .. .. .. "inodes": 23121955, "inodes_top": 3684, "inodes_bottom": 1728, "inodes_pin_tail": 23116543, "inodes_pinned": 23116691, "inodes_expired": 39049803601, "inodes_with_caps": 84593, .. .. } .. .. "mds_mem": { "ino": 23114378, "ino+": 38966647328, "ino-": 38943532950, "dir": 513065, "dir+": 130921896, "dir-": 130408831, "dn": 23121954, "dn+": 39349549680, "dn-": 39326427726, "cap": 87280, "cap+": 6964477825, "cap-": 6964390545, "rss": 79730620, "heap": 223508 }, I've have been reluctant to just fail the MDS, to clear the memory, but when I finally came around to do so I got the error "Error EPERM: MDS has one of two health warnings which could extend recovery: MDS_TRIM or MDS_CACHE_OVERSIZED. MDS failover is not recommended since it might cause unexpected file system unavailability. If you wish to proceed, pass --yes-i-really-mean-it" At this moment the number strays reported in the MDS_CACHE_OVERSIZED warning, are now up with a factor 10 (approx. 280000) Which made me pause. This seems like a bug.. But To be honest I don't know quite what to expect, if I just execute with "--yes-i-really-mean-it".. Will the MDS eat huge amount of RAM during replay? (I've seen this before during failover - where MDS ate almost 200GB ram, even though the cache was not oversized.) Any advice on how to proceed? BR. Kasper _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io> To unsubscribe send an email to ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io> -- Alexander Patrakov _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io