[ceph-users] Re: Help needed, ceph fs down due to large stray dir

Frank Schilder Fri, 10 Jan 2025 19:07:17 -0800

Hi Dan,

thanks for our continued help, I really appreciate it. Just to clarify:


> Your MDS is burning CPU (you see that with top) but it's unresponsive.

Did you mean "is *not* burning CPU"? The MDS is idle - *no* CPU load, yet 
unresponsive. See below for a more detailed description of observations.

For doing the investigation you ask for I need to run perf inside the 
container, but get the error "No permission to enable cycles:u event." Do you 
know how I can get perf to work inside the docker container? I use the official 
one from quay.io and run it with privileged=true.

New observations: I had to bring up the OSDs in the host and can now confirm 
that the heartbeat failure is not related to swapping. This time the MDS needed 
to start swapping much earlier and it just continues to fill the cache. Slower 
this time, but it makes good progress.

The MDS loads cache items until its done (it always stops at about the same 
number, which is slowly decreasing per restart; similarly, the reported stray 
count goes a bit down every restart) and then serves a few requests. Very 
shortly after that the request/s goes to 0 (dashboard), the heartbeat messages 
show up in the log and the MDS stops responding to daemon queries.

>From then on, for a longer period of time it continues to do something until 
>it settles at a certain memory consumption. During this period memory usage is 
>growing. After that period, CPU usage goes down to almost nothing and memory 
>usage stays constant. From then on nothing seems to happen any more.

So there is a short time window where stray items are evaluated, but the MDS 
gets thrown off rather quickly somehow.

Dan, the host I'm working on is quite busy, it has 80 OSDs. I can stop these 
again for a while to be able to see what the MDS is doing with the low CPU 
load. Output like top -H is dominated by the OSDs.

Unless you really meant "high CPU load", in which case you might want to ask me 
to pull out some other information regarding the opposite question "why is the 
MDS idle yet unresponsive".

Thanks for your help so far! Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Dan van der Ster <dan.vanders...@clyso.com>
Sent: Saturday, January 11, 2025 3:04 AM
To: Frank Schilder
Cc: Bailey Allison; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: Help needed, ceph fs down due to large stray dir

Hi Frank,

I don't think the blocklists are related. (Those are blocking the
previous running instances of the mds on that host, not the current).

Your MDS is burning CPU (you see that with top) but it's unresponsive.

Any of these will be closer to finding a clue what it's doing:

perf top -p <pidof ceph-mds>
top -H
uwpmp

Cheers, Dan
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Help needed, ceph fs down due to large stray dir

Reply via email to