Hi Xiubo,

it happened again. This time, we might be able to pull logs from the client 
node. Please take a look at my intermediate action below - thanks!

I am in a bit of a calamity, I'm on holidays with terrible network connection 
and can't do much. My first priority is securing the cluster to avoid damage 
caused by this issue. I did an MDS evict by client ID on the MDS reporting the 
warning with the client ID reported in the warning. For some reason the client 
got blocked on 2 MDSes after this command, one of these is an ordinary stand-by 
daemon. Not sure if this is expected.

Main question: is this sufficient to prevent any damaging IO on the cluster? 
I'm thinking here about the MDS eating through all its RAM until it crashes 
hard in an irrecoverable state (that was described as a consequence in an old 
post about this warning). If this is a safe state, I can keep it in this state 
until I return from holidays.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Xiubo Li <xiu...@redhat.com>
Sent: Friday, July 28, 2023 11:37 AM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: MDS stuck in rejoin


On 7/26/23 22:13, Frank Schilder wrote:
> Hi Xiubo.
>
>> ... I am more interested in the kclient side logs. Just want to
>> know why that oldest request got stuck so long.
> I'm afraid I'm a bad admin in this case. I don't have logs from the host any 
> more, I would have needed the output of dmesg and this is gone. In case it 
> happens again I will try to pull the info out.
>
> The tracker https://tracker.ceph.com/issues/22885 sounds a lot more violent 
> than our situation. We had no problems with the MDSes, the cache didn't grow 
> and the relevant one was also not put into read-only mode. It was just this 
> warning showing all the time, health was OK otherwise. I think the warning 
> was there for at least 16h before I failed the MDS.
>
> The MDS log contains nothing, this is the only line mentioning this client:
>
> 2023-07-20T00:22:05.518+0200 7fe13df59700  0 log_channel(cluster) log [WRN] : 
> client.145678382 does not advance its oldest_client_tid (16121616), 100000 
> completed requests recorded in session

Okay, if so it's hard to say and dig out what has happened in client why
it didn't advance the tid.

Thanks

- Xiubo


> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to