Dear Xiubo,

the nearfull pool is an RBD pool and has nothing to do with the file system. 
All pools for the file system have plenty of capacity.

I think we have an idea what kind of workload caused the issue. We had a user 
run a computation that reads the same file over and over again. He started 100 
such jobs in parallel and our storage servers were at 400% load. I saw 167K 
read IOP/s on an HDD pool that has an aggregated raw IOP/s budget of ca. 11K. 
Clearly, most of this was served from RAM.

It is possible that this extreme load situation triggered a race that remained 
undetected/unreported. There is literally no related message in any logs near 
the time the warning started popping up. It shows up out of nowhere.

We asked the user to change his workflow to use local RAM disk for the input 
files. I don't think we can reproduce the problem anytime soon.

About the bug fixes, I'm eagerly waiting for this and another one. Any idea 
when they might show up in distro kernels?

Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Xiubo Li <xiu...@redhat.com>
Sent: Tuesday, August 8, 2023 2:57 AM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: MDS stuck in rejoin


On 8/7/23 21:54, Frank Schilder wrote:
> Dear Xiubo,
>
> I managed to collect some information. It looks like there is nothing in the 
> dmesg log around the time the client failed to advance its TID. I collected 
> short snippets around the critical time below. I have full logs in case you 
> are interested. Its large files, I will need to do an upload for that.
>
> I also have a dump of "mds session ls" output for clients that showed the 
> same issue later. Unfortunately, no consistent log information for a single 
> incident.
>
> Here the summary, please let me know if uploading the full package makes 
> sense:
>
> - Status:
>
> On July 29, 2023
>
> ceph status/df/pool stats/health detail at 01:05:14:
>    cluster:
>      health: HEALTH_WARN
>              1 pools nearfull
>
> ceph status/df/pool stats/health detail at 01:05:28:
>    cluster:
>      health: HEALTH_WARN
>              1 clients failing to advance oldest client/flush tid
>              1 pools nearfull

Okay, then this could be the root cause.

If the pool nearful it could block flushing the journal logs to the pool
and then the MDS couldn't safe reply to the requests and then block them
like this.

Could you fix the pool nearful issue first and then check could you see
it again ?


> [...]
>
> On July 31, 2023
>
> ceph status/df/pool stats/health detail at 10:36:16:
>    cluster:
>      health: HEALTH_WARN
>              1 clients failing to advance oldest client/flush tid
>              1 pools nearfull
>
>    cluster:
>      health: HEALTH_WARN
>              1 pools nearfull
>
> - client evict command (date, time, command):
>
> 2023-07-31 10:36  ceph tell mds.ceph-11 client evict id=145678457
>
> We have a 1h time difference between the date stamp of the command and the 
> dmesg date stamps. However, there seems to be a weird 10min delay from 
> issuing the evict command until it shows up in dmesg on the client.
>
> - dmesg:
>
> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
> [Fri Jul 28 16:07:47 2023] slurm.epilog.cl (24175): drop_caches: 3
> [Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed 
> (con state OPEN)
> [Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed 
> (con state OPEN)
> [Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed 
> (con state OPEN)
> [Sat Jul 29 18:21:42 2023] ceph: mds2 reconnect start
> [Sat Jul 29 18:21:42 2023] ceph: mds2 reconnect start
> [Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect start
> [Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect success
> [Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect success
> [Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect success
> [Sat Jul 29 18:26:39 2023] ceph: mds2 reconnect start
> [Sat Jul 29 18:26:39 2023] ceph: mds2 reconnect start
> [Sat Jul 29 18:26:39 2023] ceph: mds2 reconnect start
> [Sat Jul 29 18:26:40 2023] ceph: mds2 reconnect success
> [Sat Jul 29 18:26:40 2023] ceph: mds2 reconnect success
> [Sat Jul 29 18:26:40 2023] ceph: mds2 reconnect success
> [Sat Jul 29 18:26:49 2023] ceph: update_snap_trace error -22

This is a known bug and we have fixed it in both kclient and ceph side:

https://tracker.ceph.com/issues/61200

https://tracker.ceph.com/issues/61217

Thanks

- Xiubo


> [Sat Jul 29 18:26:49 2023] ceph: mds2 recovery completed
> [Sat Jul 29 18:26:49 2023] ceph: mds2 recovery completed
> [Sat Jul 29 18:26:49 2023] ceph: mds2 recovery completed
> [Sun Jul 30 16:37:55 2023] slurm.epilog.cl (43668): drop_caches: 3
> [Mon Jul 31 01:00:20 2023] slurm.epilog.cl (73347): drop_caches: 3
> [Mon Jul 31 09:46:41 2023] libceph: mds0 192.168.32.81:6801 socket closed 
> (con state OPEN)
> [Mon Jul 31 09:46:41 2023] libceph: mds3 192.168.32.87:6801 socket closed 
> (con state OPEN)
> [Mon Jul 31 09:46:41 2023] libceph: mds7 192.168.32.88:6801 socket closed 
> (con state OPEN)
> [Mon Jul 31 09:46:41 2023] libceph: mds5 192.168.32.78:6801 socket closed 
> (con state OPEN)
> [Mon Jul 31 09:46:41 2023] libceph: mds4 192.168.32.73:6801 socket closed 
> (con state OPEN)
> [Mon Jul 31 09:46:41 2023] libceph: mds1 192.168.32.80:6801 socket closed 
> (con state OPEN)
> [Mon Jul 31 09:46:41 2023] libceph: mds2 192.168.32.75:6801 socket closed 
> (con state OPEN)
> [Mon Jul 31 09:46:41 2023] libceph: mds3 192.168.32.87:6801 connection reset
> [Mon Jul 31 09:46:41 2023] libceph: reset on mds3
> [Mon Jul 31 09:46:41 2023] ceph: mds3 closed our session
> [Mon Jul 31 09:46:41 2023] ceph: mds3 reconnect start
> [Mon Jul 31 09:46:41 2023] libceph: mds7 192.168.32.88:6801 connection reset
> [Mon Jul 31 09:46:41 2023] libceph: reset on mds7
> [Mon Jul 31 09:46:41 2023] ceph: mds7 closed our session
> [Mon Jul 31 09:46:41 2023] ceph: mds7 reconnect start
> [Mon Jul 31 09:46:41 2023] libceph: mds2 192.168.32.75:6801 connection reset
> [Mon Jul 31 09:46:41 2023] libceph: reset on mds2
> [Mon Jul 31 09:46:41 2023] ceph: mds2 closed our session
> [Mon Jul 31 09:46:41 2023] ceph: mds2 reconnect start
> [Mon Jul 31 09:46:41 2023] libceph: mds4 192.168.32.73:6801 connection reset
> [Mon Jul 31 09:46:41 2023] libceph: reset on mds4
> [Mon Jul 31 09:46:41 2023] ceph: mds4 closed our session
> [Mon Jul 31 09:46:41 2023] ceph: mds4 reconnect start
> [Mon Jul 31 09:46:41 2023] libceph: mds1 192.168.32.80:6801 connection reset
> [Mon Jul 31 09:46:41 2023] libceph: reset on mds1
> [Mon Jul 31 09:46:41 2023] ceph: mds1 closed our session
> [Mon Jul 31 09:46:41 2023] ceph: mds1 reconnect start
> [Mon Jul 31 09:46:41 2023] libceph: mds0 192.168.32.81:6801 connection reset
> [Mon Jul 31 09:46:41 2023] libceph: reset on mds0
> [Mon Jul 31 09:46:41 2023] ceph: mds0 closed our session
> [Mon Jul 31 09:46:41 2023] ceph: mds0 reconnect start
> [Mon Jul 31 09:46:41 2023] libceph: mds5 192.168.32.78:6801 connection reset
> [Mon Jul 31 09:46:41 2023] libceph: reset on mds5
> [Mon Jul 31 09:46:41 2023] ceph: mds5 closed our session
> [Mon Jul 31 09:46:41 2023] ceph: mds5 reconnect start
> [Mon Jul 31 09:46:41 2023] ceph: mds2 reconnect denied
> [Mon Jul 31 09:46:41 2023] ceph: mds1 reconnect denied
> [Mon Jul 31 09:46:41 2023] ceph: mds0 reconnect denied
> [Mon Jul 31 09:46:41 2023] ceph: mds5 reconnect denied
> [Mon Jul 31 09:46:41 2023] ceph: mds3 reconnect denied
> [Mon Jul 31 09:46:41 2023] ceph: mds7 reconnect denied
> [Mon Jul 31 09:46:41 2023] ceph: mds4 reconnect denied
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Xiubo Li <xiu...@redhat.com>
> Sent: Monday, July 31, 2023 12:14 PM
> To: Frank Schilder; ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: MDS stuck in rejoin
>
>
> On 7/31/23 16:50, Frank Schilder wrote:
>> Hi Xiubo,
>>
>> its a kernel client. I actually made a mistake when trying to evict the 
>> client and my command didn't do anything. I did another evict and this time 
>> the client IP showed up in the blacklist. Furthermore, the warning 
>> disappeared. I asked for the dmesg logs from the client node.
> Yeah, after the client's sessions are closed the corresponding warning
> should be cleared.
>
> Thanks
>
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: Xiubo Li <xiu...@redhat.com>
>> Sent: Monday, July 31, 2023 4:12 AM
>> To: Frank Schilder; ceph-users@ceph.io
>> Subject: Re: [ceph-users] Re: MDS stuck in rejoin
>>
>> Hi Frank,
>>
>> On 7/30/23 16:52, Frank Schilder wrote:
>>> Hi Xiubo,
>>>
>>> it happened again. This time, we might be able to pull logs from the client 
>>> node. Please take a look at my intermediate action below - thanks!
>>>
>>> I am in a bit of a calamity, I'm on holidays with terrible network 
>>> connection and can't do much. My first priority is securing the cluster to 
>>> avoid damage caused by this issue. I did an MDS evict by client ID on the 
>>> MDS reporting the warning with the client ID reported in the warning. For 
>>> some reason the client got blocked on 2 MDSes after this command, one of 
>>> these is an ordinary stand-by daemon. Not sure if this is expected.
>>>
>>> Main question: is this sufficient to prevent any damaging IO on the 
>>> cluster? I'm thinking here about the MDS eating through all its RAM until 
>>> it crashes hard in an irrecoverable state (that was described as a 
>>> consequence in an old post about this warning). If this is a safe state, I 
>>> can keep it in this state until I return from holidays.
>> Yeah, I think so.
>>
>> BTW, are u using the kclients or user space clients ? I checked both
>> kclient and libcephfs, it seems buggy in libcephfs, which could cause
>> this issue. But for kclient it's okay till now.
>>
>> Thanks
>>
>> - Xiubo
>>
>>> Best regards,
>>> =================
>>> Frank Schilder
>>> AIT Risø Campus
>>> Bygning 109, rum S14
>>>
>>> ________________________________________
>>> From: Xiubo Li <xiu...@redhat.com>
>>> Sent: Friday, July 28, 2023 11:37 AM
>>> To: Frank Schilder; ceph-users@ceph.io
>>> Subject: Re: [ceph-users] Re: MDS stuck in rejoin
>>>
>>>
>>> On 7/26/23 22:13, Frank Schilder wrote:
>>>> Hi Xiubo.
>>>>
>>>>> ... I am more interested in the kclient side logs. Just want to
>>>>> know why that oldest request got stuck so long.
>>>> I'm afraid I'm a bad admin in this case. I don't have logs from the host 
>>>> any more, I would have needed the output of dmesg and this is gone. In 
>>>> case it happens again I will try to pull the info out.
>>>>
>>>> The tracker https://tracker.ceph.com/issues/22885 sounds a lot more 
>>>> violent than our situation. We had no problems with the MDSes, the cache 
>>>> didn't grow and the relevant one was also not put into read-only mode. It 
>>>> was just this warning showing all the time, health was OK otherwise. I 
>>>> think the warning was there for at least 16h before I failed the MDS.
>>>>
>>>> The MDS log contains nothing, this is the only line mentioning this client:
>>>>
>>>> 2023-07-20T00:22:05.518+0200 7fe13df59700  0 log_channel(cluster) log 
>>>> [WRN] : client.145678382 does not advance its oldest_client_tid 
>>>> (16121616), 100000 completed requests recorded in session
>>> Okay, if so it's hard to say and dig out what has happened in client why
>>> it didn't advance the tid.
>>>
>>> Thanks
>>>
>>> - Xiubo
>>>
>>>
>>>> Best regards,
>>>> =================
>>>> Frank Schilder
>>>> AIT Risø Campus
>>>> Bygning 109, rum S14
>>>>
>>>>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to