Hi Dan, thanks for our continued help, I really appreciate it. Just to clarify:
> Your MDS is burning CPU (you see that with top) but it's unresponsive. Did you mean "is *not* burning CPU"? The MDS is idle - *no* CPU load, yet unresponsive. See below for a more detailed description of observations. For doing the investigation you ask for I need to run perf inside the container, but get the error "No permission to enable cycles:u event." Do you know how I can get perf to work inside the docker container? I use the official one from quay.io and run it with privileged=true. New observations: I had to bring up the OSDs in the host and can now confirm that the heartbeat failure is not related to swapping. This time the MDS needed to start swapping much earlier and it just continues to fill the cache. Slower this time, but it makes good progress. The MDS loads cache items until its done (it always stops at about the same number, which is slowly decreasing per restart; similarly, the reported stray count goes a bit down every restart) and then serves a few requests. Very shortly after that the request/s goes to 0 (dashboard), the heartbeat messages show up in the log and the MDS stops responding to daemon queries. >From then on, for a longer period of time it continues to do something until >it settles at a certain memory consumption. During this period memory usage is >growing. After that period, CPU usage goes down to almost nothing and memory >usage stays constant. From then on nothing seems to happen any more. So there is a short time window where stray items are evaluated, but the MDS gets thrown off rather quickly somehow. Dan, the host I'm working on is quite busy, it has 80 OSDs. I can stop these again for a while to be able to see what the MDS is doing with the low CPU load. Output like top -H is dominated by the OSDs. Unless you really meant "high CPU load", in which case you might want to ask me to pull out some other information regarding the opposite question "why is the MDS idle yet unresponsive". Thanks for your help so far! Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Dan van der Ster <dan.vanders...@clyso.com> Sent: Saturday, January 11, 2025 3:04 AM To: Frank Schilder Cc: Bailey Allison; ceph-users@ceph.io Subject: Re: [ceph-users] Re: Help needed, ceph fs down due to large stray dir Hi Frank, I don't think the blocklists are related. (Those are blocking the previous running instances of the mds on that host, not the current). Your MDS is burning CPU (you see that with top) but it's unresponsive. Any of these will be closer to finding a clue what it's doing: perf top -p <pidof ceph-mds> top -H uwpmp Cheers, Dan _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io