And another small piece of information:

Needed to do another restart. This time I managed to capture the approximate 
length of the period for which the MDS is up and responsive after loading the 
cache (it reports stats). Its pretty much exactly 60 seconds. This smells like 
a timeout. Is there any MDS/ceph-fs related timeout with a 60s default 
somewhere?

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <fr...@dtu.dk>
Sent: Saturday, January 11, 2025 12:46 PM
To: Dan van der Ster
Cc: Bailey Allison; ceph-users@ceph.io
Subject: [ceph-users] Re: Help needed, ceph fs down due to large stray dir

Hi all,

my hopes are down again. The MDS might look busy but I'm not sure its doing 
anything interesting. I now see a lot of these in the log (stripped the 
heartbeat messages):

2025-01-11T12:35:50.712+0100 7ff888375700 -1 monclient: _check_auth_rotating 
possible clock skew, rotating keys expired way too early (before 
2025-01-11T11:35:50.713867+0100)
2025-01-11T12:35:51.712+0100 7ff888375700 -1 monclient: _check_auth_rotating 
possible clock skew, rotating keys expired way too early (before 
2025-01-11T11:35:51.714027+0100)
2025-01-11T12:35:52.712+0100 7ff888375700 -1 monclient: _check_auth_rotating 
possible clock skew, rotating keys expired way too early (before 
2025-01-11T11:35:52.714335+0100)
2025-01-11T12:35:53.084+0100 7ff88cb7e700  0 auth: could not find 
secret_id=51092
2025-01-11T12:35:53.084+0100 7ff88cb7e700  0 cephx: verify_authorizer could not 
get service secret for service mds secret_id=51092
2025-01-11T12:35:53.353+0100 7ff88cb7e700  0 auth: could not find 
secret_id=51092
2025-01-11T12:35:53.353+0100 7ff88cb7e700  0 cephx: verify_authorizer could not 
get service secret for service mds secret_id=51092
2025-01-11T12:35:53.536+0100 7ff88cb7e700  0 auth: could not find 
secret_id=51092
2025-01-11T12:35:53.536+0100 7ff88cb7e700  0 cephx: verify_authorizer could not 
get service secret for service mds secret_id=51092
2025-01-11T12:35:53.573+0100 7ff88cb7e700  0 auth: could not find 
secret_id=51092
2025-01-11T12:35:53.573+0100 7ff88cb7e700  0 cephx: verify_authorizer could not 
get service secret for service mds secret_id=51092

Looks like the auth key for the MDS expired and cannot be renewed. Is there a 
grace period for that as well?

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <fr...@dtu.dk>
Sent: Saturday, January 11, 2025 11:41 AM
To: Dan van der Ster
Cc: Bailey Allison; ceph-users@ceph.io
Subject: [ceph-users] Re: Help needed, ceph fs down due to large stray dir

Hi all,

new update: after sleeping after the final MDS restart the MDS is doing 
something! It is still unresponsive, but it does show CPU load of between 
150-200% and I really really hope that this is the trimming of stray items.

I will try to find out if I get perf to work inside the container. For now, to 
facilitate trouble shooting, I will add a swap disk to every MDS host just to 
be on the safe side if stuff fails over.

Just to get my hopes back: can someone (from the dev team) let me know if it is 
expected that an MDS is unresponsive during stray evaluation?

Thanks and best regards!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to