[ceph-users] Re: MDS stuck in rejoin

Frank Schilder Mon, 07 Aug 2023 06:55:55 -0700

Dear Xiubo,

I managed to collect some information. It looks like there is nothing in the 
dmesg log around the time the client failed to advance its TID. I collected 
short snippets around the critical time below. I have full logs in case you are 
interested. Its large files, I will need to do an upload for that.


I also have a dump of "mds session ls" output for clients that showed the same 
issue later. Unfortunately, no consistent log information for a single incident.

Here the summary, please let me know if uploading the full package makes sense:

- Status:

On July 29, 2023

ceph status/df/pool stats/health detail at 01:05:14:
  cluster:
    health: HEALTH_WARN
            1 pools nearfull

ceph status/df/pool stats/health detail at 01:05:28:
  cluster:
    health: HEALTH_WARN
            1 clients failing to advance oldest client/flush tid
            1 pools nearfull

[...]

On July 31, 2023

ceph status/df/pool stats/health detail at 10:36:16:
  cluster:
    health: HEALTH_WARN
            1 clients failing to advance oldest client/flush tid
            1 pools nearfull

  cluster:
    health: HEALTH_WARN
            1 pools nearfull

- client evict command (date, time, command):

2023-07-31 10:36  ceph tell mds.ceph-11 client evict id=145678457

We have a 1h time difference between the date stamp of the command and the 
dmesg date stamps. However, there seems to be a weird 10min delay from issuing 
the evict command until it shows up in dmesg on the client.

- dmesg:

[Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
[Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
[Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
[Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
[Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
[Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
[Fri Jul 28 16:07:47 2023] slurm.epilog.cl (24175): drop_caches: 3
[Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed (con 
state OPEN)
[Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed (con 
state OPEN)
[Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed (con 
state OPEN)
[Sat Jul 29 18:21:42 2023] ceph: mds2 reconnect start
[Sat Jul 29 18:21:42 2023] ceph: mds2 reconnect start
[Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect start
[Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect success
[Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect success
[Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect success
[Sat Jul 29 18:26:39 2023] ceph: mds2 reconnect start
[Sat Jul 29 18:26:39 2023] ceph: mds2 reconnect start
[Sat Jul 29 18:26:39 2023] ceph: mds2 reconnect start
[Sat Jul 29 18:26:40 2023] ceph: mds2 reconnect success
[Sat Jul 29 18:26:40 2023] ceph: mds2 reconnect success
[Sat Jul 29 18:26:40 2023] ceph: mds2 reconnect success
[Sat Jul 29 18:26:49 2023] ceph: update_snap_trace error -22
[Sat Jul 29 18:26:49 2023] ceph: mds2 recovery completed
[Sat Jul 29 18:26:49 2023] ceph: mds2 recovery completed
[Sat Jul 29 18:26:49 2023] ceph: mds2 recovery completed
[Sun Jul 30 16:37:55 2023] slurm.epilog.cl (43668): drop_caches: 3
[Mon Jul 31 01:00:20 2023] slurm.epilog.cl (73347): drop_caches: 3
[Mon Jul 31 09:46:41 2023] libceph: mds0 192.168.32.81:6801 socket closed (con 
state OPEN)
[Mon Jul 31 09:46:41 2023] libceph: mds3 192.168.32.87:6801 socket closed (con 
state OPEN)
[Mon Jul 31 09:46:41 2023] libceph: mds7 192.168.32.88:6801 socket closed (con 
state OPEN)
[Mon Jul 31 09:46:41 2023] libceph: mds5 192.168.32.78:6801 socket closed (con 
state OPEN)
[Mon Jul 31 09:46:41 2023] libceph: mds4 192.168.32.73:6801 socket closed (con 
state OPEN)
[Mon Jul 31 09:46:41 2023] libceph: mds1 192.168.32.80:6801 socket closed (con 
state OPEN)
[Mon Jul 31 09:46:41 2023] libceph: mds2 192.168.32.75:6801 socket closed (con 
state OPEN)
[Mon Jul 31 09:46:41 2023] libceph: mds3 192.168.32.87:6801 connection reset
[Mon Jul 31 09:46:41 2023] libceph: reset on mds3
[Mon Jul 31 09:46:41 2023] ceph: mds3 closed our session
[Mon Jul 31 09:46:41 2023] ceph: mds3 reconnect start
[Mon Jul 31 09:46:41 2023] libceph: mds7 192.168.32.88:6801 connection reset
[Mon Jul 31 09:46:41 2023] libceph: reset on mds7
[Mon Jul 31 09:46:41 2023] ceph: mds7 closed our session
[Mon Jul 31 09:46:41 2023] ceph: mds7 reconnect start
[Mon Jul 31 09:46:41 2023] libceph: mds2 192.168.32.75:6801 connection reset
[Mon Jul 31 09:46:41 2023] libceph: reset on mds2
[Mon Jul 31 09:46:41 2023] ceph: mds2 closed our session
[Mon Jul 31 09:46:41 2023] ceph: mds2 reconnect start
[Mon Jul 31 09:46:41 2023] libceph: mds4 192.168.32.73:6801 connection reset
[Mon Jul 31 09:46:41 2023] libceph: reset on mds4
[Mon Jul 31 09:46:41 2023] ceph: mds4 closed our session
[Mon Jul 31 09:46:41 2023] ceph: mds4 reconnect start
[Mon Jul 31 09:46:41 2023] libceph: mds1 192.168.32.80:6801 connection reset
[Mon Jul 31 09:46:41 2023] libceph: reset on mds1
[Mon Jul 31 09:46:41 2023] ceph: mds1 closed our session
[Mon Jul 31 09:46:41 2023] ceph: mds1 reconnect start
[Mon Jul 31 09:46:41 2023] libceph: mds0 192.168.32.81:6801 connection reset
[Mon Jul 31 09:46:41 2023] libceph: reset on mds0
[Mon Jul 31 09:46:41 2023] ceph: mds0 closed our session
[Mon Jul 31 09:46:41 2023] ceph: mds0 reconnect start
[Mon Jul 31 09:46:41 2023] libceph: mds5 192.168.32.78:6801 connection reset
[Mon Jul 31 09:46:41 2023] libceph: reset on mds5
[Mon Jul 31 09:46:41 2023] ceph: mds5 closed our session
[Mon Jul 31 09:46:41 2023] ceph: mds5 reconnect start
[Mon Jul 31 09:46:41 2023] ceph: mds2 reconnect denied
[Mon Jul 31 09:46:41 2023] ceph: mds1 reconnect denied
[Mon Jul 31 09:46:41 2023] ceph: mds0 reconnect denied
[Mon Jul 31 09:46:41 2023] ceph: mds5 reconnect denied
[Mon Jul 31 09:46:41 2023] ceph: mds3 reconnect denied
[Mon Jul 31 09:46:41 2023] ceph: mds7 reconnect denied
[Mon Jul 31 09:46:41 2023] ceph: mds4 reconnect denied

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Xiubo Li <xiu...@redhat.com>
Sent: Monday, July 31, 2023 12:14 PM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: MDS stuck in rejoin


On 7/31/23 16:50, Frank Schilder wrote:
> Hi Xiubo,
>
> its a kernel client. I actually made a mistake when trying to evict the 
> client and my command didn't do anything. I did another evict and this time 
> the client IP showed up in the blacklist. Furthermore, the warning 
> disappeared. I asked for the dmesg logs from the client node.

Yeah, after the client's sessions are closed the corresponding warning
should be cleared.

Thanks

> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Xiubo Li <xiu...@redhat.com>
> Sent: Monday, July 31, 2023 4:12 AM
> To: Frank Schilder; ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: MDS stuck in rejoin
>
> Hi Frank,
>
> On 7/30/23 16:52, Frank Schilder wrote:
>> Hi Xiubo,
>>
>> it happened again. This time, we might be able to pull logs from the client 
>> node. Please take a look at my intermediate action below - thanks!
>>
>> I am in a bit of a calamity, I'm on holidays with terrible network 
>> connection and can't do much. My first priority is securing the cluster to 
>> avoid damage caused by this issue. I did an MDS evict by client ID on the 
>> MDS reporting the warning with the client ID reported in the warning. For 
>> some reason the client got blocked on 2 MDSes after this command, one of 
>> these is an ordinary stand-by daemon. Not sure if this is expected.
>>
>> Main question: is this sufficient to prevent any damaging IO on the cluster? 
>> I'm thinking here about the MDS eating through all its RAM until it crashes 
>> hard in an irrecoverable state (that was described as a consequence in an 
>> old post about this warning). If this is a safe state, I can keep it in this 
>> state until I return from holidays.
> Yeah, I think so.
>
> BTW, are u using the kclients or user space clients ? I checked both
> kclient and libcephfs, it seems buggy in libcephfs, which could cause
> this issue. But for kclient it's okay till now.
>
> Thanks
>
> - Xiubo
>
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: Xiubo Li <xiu...@redhat.com>
>> Sent: Friday, July 28, 2023 11:37 AM
>> To: Frank Schilder; ceph-users@ceph.io
>> Subject: Re: [ceph-users] Re: MDS stuck in rejoin
>>
>>
>> On 7/26/23 22:13, Frank Schilder wrote:
>>> Hi Xiubo.
>>>
>>>> ... I am more interested in the kclient side logs. Just want to
>>>> know why that oldest request got stuck so long.
>>> I'm afraid I'm a bad admin in this case. I don't have logs from the host 
>>> any more, I would have needed the output of dmesg and this is gone. In case 
>>> it happens again I will try to pull the info out.
>>>
>>> The tracker https://tracker.ceph.com/issues/22885 sounds a lot more violent 
>>> than our situation. We had no problems with the MDSes, the cache didn't 
>>> grow and the relevant one was also not put into read-only mode. It was just 
>>> this warning showing all the time, health was OK otherwise. I think the 
>>> warning was there for at least 16h before I failed the MDS.
>>>
>>> The MDS log contains nothing, this is the only line mentioning this client:
>>>
>>> 2023-07-20T00:22:05.518+0200 7fe13df59700  0 log_channel(cluster) log [WRN] 
>>> : client.145678382 does not advance its oldest_client_tid (16121616), 
>>> 100000 completed requests recorded in session
>> Okay, if so it's hard to say and dig out what has happened in client why
>> it didn't advance the tid.
>>
>> Thanks
>>
>> - Xiubo
>>
>>
>>> Best regards,
>>> =================
>>> Frank Schilder
>>> AIT Risø Campus
>>> Bygning 109, rum S14
>>>
>>>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MDS stuck in rejoin

Reply via email to