[ceph-users] Re: MDS stuck in rejoin

Frank Schilder Thu, 09 Nov 2023 07:42:40 -0800

Hi Xiubo,

great! I'm not sure if we observed this particular issue, but we did have the 
oldest_client_tid updates not advancing  message in a context that might re 
related.


If this fix is not too large, it would be really great if it could be included 
in the last Pacific point release.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Xiubo Li <xiu...@redhat.com>
Sent: Wednesday, November 8, 2023 1:38 AM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: MDS stuck in rejoin

Hi Frank,

Recently I found a new possible case could cause this, please see
https://github.com/ceph/ceph/pull/54259. This is just a ceph side fix,
after this we need to fix it in kclient too, which hasn't done yet.

Thanks

- Xiubo

On 8/8/23 17:44, Frank Schilder wrote:
> Dear Xiubo,
>
> the nearfull pool is an RBD pool and has nothing to do with the file system. 
> All pools for the file system have plenty of capacity.
>
> I think we have an idea what kind of workload caused the issue. We had a user 
> run a computation that reads the same file over and over again. He started 
> 100 such jobs in parallel and our storage servers were at 400% load. I saw 
> 167K read IOP/s on an HDD pool that has an aggregated raw IOP/s budget of ca. 
> 11K. Clearly, most of this was served from RAM.
>
> It is possible that this extreme load situation triggered a race that 
> remained undetected/unreported. There is literally no related message in any 
> logs near the time the warning started popping up. It shows up out of nowhere.
>
> We asked the user to change his workflow to use local RAM disk for the input 
> files. I don't think we can reproduce the problem anytime soon.
>
> About the bug fixes, I'm eagerly waiting for this and another one. Any idea 
> when they might show up in distro kernels?
>
> Thanks and best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Xiubo Li <xiu...@redhat.com>
> Sent: Tuesday, August 8, 2023 2:57 AM
> To: Frank Schilder; ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: MDS stuck in rejoin
>
>
> On 8/7/23 21:54, Frank Schilder wrote:
>> Dear Xiubo,
>>
>> I managed to collect some information. It looks like there is nothing in the 
>> dmesg log around the time the client failed to advance its TID. I collected 
>> short snippets around the critical time below. I have full logs in case you 
>> are interested. Its large files, I will need to do an upload for that.
>>
>> I also have a dump of "mds session ls" output for clients that showed the 
>> same issue later. Unfortunately, no consistent log information for a single 
>> incident.
>>
>> Here the summary, please let me know if uploading the full package makes 
>> sense:
>>
>> - Status:
>>
>> On July 29, 2023
>>
>> ceph status/df/pool stats/health detail at 01:05:14:
>>     cluster:
>>       health: HEALTH_WARN
>>               1 pools nearfull
>>
>> ceph status/df/pool stats/health detail at 01:05:28:
>>     cluster:
>>       health: HEALTH_WARN
>>               1 clients failing to advance oldest client/flush tid
>>               1 pools nearfull
> Okay, then this could be the root cause.
>
> If the pool nearful it could block flushing the journal logs to the pool
> and then the MDS couldn't safe reply to the requests and then block them
> like this.
>
> Could you fix the pool nearful issue first and then check could you see
> it again ?
>
>
>> [...]
>>
>> On July 31, 2023
>>
>> ceph status/df/pool stats/health detail at 10:36:16:
>>     cluster:
>>       health: HEALTH_WARN
>>               1 clients failing to advance oldest client/flush tid
>>               1 pools nearfull
>>
>>     cluster:
>>       health: HEALTH_WARN
>>               1 pools nearfull
>>
>> - client evict command (date, time, command):
>>
>> 2023-07-31 10:36  ceph tell mds.ceph-11 client evict id=145678457
>>
>> We have a 1h time difference between the date stamp of the command and the 
>> dmesg date stamps. However, there seems to be a weird 10min delay from 
>> issuing the evict command until it shows up in dmesg on the client.
>>
>> - dmesg:
>>
>> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
>> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
>> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
>> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
>> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
>> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
>> [Fri Jul 28 16:07:47 2023] slurm.epilog.cl (24175): drop_caches: 3
>> [Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed 
>> (con state OPEN)
>> [Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed 
>> (con state OPEN)
>> [Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed 
>> (con state OPEN)
>> [Sat Jul 29 18:21:42 2023] ceph: mds2 reconnect start
>> [Sat Jul 29 18:21:42 2023] ceph: mds2 reconnect start
>> [Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect start
>> [Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect success
>> [Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect success
>> [Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect success
>> [Sat Jul 29 18:26:39 2023] ceph: mds2 reconnect start
>> [Sat Jul 29 18:26:39 2023] ceph: mds2 reconnect start
>> [Sat Jul 29 18:26:39 2023] ceph: mds2 reconnect start
>> [Sat Jul 29 18:26:40 2023] ceph: mds2 reconnect success
>> [Sat Jul 29 18:26:40 2023] ceph: mds2 reconnect success
>> [Sat Jul 29 18:26:40 2023] ceph: mds2 reconnect success
>> [Sat Jul 29 18:26:49 2023] ceph: update_snap_trace error -22
> This is a known bug and we have fixed it in both kclient and ceph side:
>
> https://tracker.ceph.com/issues/61200
>
> https://tracker.ceph.com/issues/61217
>
> Thanks
>
> - Xiubo
>
>
>> [Sat Jul 29 18:26:49 2023] ceph: mds2 recovery completed
>> [Sat Jul 29 18:26:49 2023] ceph: mds2 recovery completed
>> [Sat Jul 29 18:26:49 2023] ceph: mds2 recovery completed
>> [Sun Jul 30 16:37:55 2023] slurm.epilog.cl (43668): drop_caches: 3
>> [Mon Jul 31 01:00:20 2023] slurm.epilog.cl (73347): drop_caches: 3
>> [Mon Jul 31 09:46:41 2023] libceph: mds0 192.168.32.81:6801 socket closed 
>> (con state OPEN)
>> [Mon Jul 31 09:46:41 2023] libceph: mds3 192.168.32.87:6801 socket closed 
>> (con state OPEN)
>> [Mon Jul 31 09:46:41 2023] libceph: mds7 192.168.32.88:6801 socket closed 
>> (con state OPEN)
>> [Mon Jul 31 09:46:41 2023] libceph: mds5 192.168.32.78:6801 socket closed 
>> (con state OPEN)
>> [Mon Jul 31 09:46:41 2023] libceph: mds4 192.168.32.73:6801 socket closed 
>> (con state OPEN)
>> [Mon Jul 31 09:46:41 2023] libceph: mds1 192.168.32.80:6801 socket closed 
>> (con state OPEN)
>> [Mon Jul 31 09:46:41 2023] libceph: mds2 192.168.32.75:6801 socket closed 
>> (con state OPEN)
>> [Mon Jul 31 09:46:41 2023] libceph: mds3 192.168.32.87:6801 connection reset
>> [Mon Jul 31 09:46:41 2023] libceph: reset on mds3
>> [Mon Jul 31 09:46:41 2023] ceph: mds3 closed our session
>> [Mon Jul 31 09:46:41 2023] ceph: mds3 reconnect start
>> [Mon Jul 31 09:46:41 2023] libceph: mds7 192.168.32.88:6801 connection reset
>> [Mon Jul 31 09:46:41 2023] libceph: reset on mds7
>> [Mon Jul 31 09:46:41 2023] ceph: mds7 closed our session
>> [Mon Jul 31 09:46:41 2023] ceph: mds7 reconnect start
>> [Mon Jul 31 09:46:41 2023] libceph: mds2 192.168.32.75:6801 connection reset
>> [Mon Jul 31 09:46:41 2023] libceph: reset on mds2
>> [Mon Jul 31 09:46:41 2023] ceph: mds2 closed our session
>> [Mon Jul 31 09:46:41 2023] ceph: mds2 reconnect start
>> [Mon Jul 31 09:46:41 2023] libceph: mds4 192.168.32.73:6801 connection reset
>> [Mon Jul 31 09:46:41 2023] libceph: reset on mds4
>> [Mon Jul 31 09:46:41 2023] ceph: mds4 closed our session
>> [Mon Jul 31 09:46:41 2023] ceph: mds4 reconnect start
>> [Mon Jul 31 09:46:41 2023] libceph: mds1 192.168.32.80:6801 connection reset
>> [Mon Jul 31 09:46:41 2023] libceph: reset on mds1
>> [Mon Jul 31 09:46:41 2023] ceph: mds1 closed our session
>> [Mon Jul 31 09:46:41 2023] ceph: mds1 reconnect start
>> [Mon Jul 31 09:46:41 2023] libceph: mds0 192.168.32.81:6801 connection reset
>> [Mon Jul 31 09:46:41 2023] libceph: reset on mds0
>> [Mon Jul 31 09:46:41 2023] ceph: mds0 closed our session
>> [Mon Jul 31 09:46:41 2023] ceph: mds0 reconnect start
>> [Mon Jul 31 09:46:41 2023] libceph: mds5 192.168.32.78:6801 connection reset
>> [Mon Jul 31 09:46:41 2023] libceph: reset on mds5
>> [Mon Jul 31 09:46:41 2023] ceph: mds5 closed our session
>> [Mon Jul 31 09:46:41 2023] ceph: mds5 reconnect start
>> [Mon Jul 31 09:46:41 2023] ceph: mds2 reconnect denied
>> [Mon Jul 31 09:46:41 2023] ceph: mds1 reconnect denied
>> [Mon Jul 31 09:46:41 2023] ceph: mds0 reconnect denied
>> [Mon Jul 31 09:46:41 2023] ceph: mds5 reconnect denied
>> [Mon Jul 31 09:46:41 2023] ceph: mds3 reconnect denied
>> [Mon Jul 31 09:46:41 2023] ceph: mds7 reconnect denied
>> [Mon Jul 31 09:46:41 2023] ceph: mds4 reconnect denied
>>
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: Xiubo Li <xiu...@redhat.com>
>> Sent: Monday, July 31, 2023 12:14 PM
>> To: Frank Schilder; ceph-users@ceph.io
>> Subject: Re: [ceph-users] Re: MDS stuck in rejoin
>>
>>
>> On 7/31/23 16:50, Frank Schilder wrote:
>>> Hi Xiubo,
>>>
>>> its a kernel client. I actually made a mistake when trying to evict the 
>>> client and my command didn't do anything. I did another evict and this time 
>>> the client IP showed up in the blacklist. Furthermore, the warning 
>>> disappeared. I asked for the dmesg logs from the client node.
>> Yeah, after the client's sessions are closed the corresponding warning
>> should be cleared.
>>
>> Thanks
>>
>>> Best regards,
>>> =================
>>> Frank Schilder
>>> AIT Risø Campus
>>> Bygning 109, rum S14
>>>
>>> ________________________________________
>>> From: Xiubo Li <xiu...@redhat.com>
>>> Sent: Monday, July 31, 2023 4:12 AM
>>> To: Frank Schilder; ceph-users@ceph.io
>>> Subject: Re: [ceph-users] Re: MDS stuck in rejoin
>>>
>>> Hi Frank,
>>>
>>> On 7/30/23 16:52, Frank Schilder wrote:
>>>> Hi Xiubo,
>>>>
>>>> it happened again. This time, we might be able to pull logs from the 
>>>> client node. Please take a look at my intermediate action below - thanks!
>>>>
>>>> I am in a bit of a calamity, I'm on holidays with terrible network 
>>>> connection and can't do much. My first priority is securing the cluster to 
>>>> avoid damage caused by this issue. I did an MDS evict by client ID on the 
>>>> MDS reporting the warning with the client ID reported in the warning. For 
>>>> some reason the client got blocked on 2 MDSes after this command, one of 
>>>> these is an ordinary stand-by daemon. Not sure if this is expected.
>>>>
>>>> Main question: is this sufficient to prevent any damaging IO on the 
>>>> cluster? I'm thinking here about the MDS eating through all its RAM until 
>>>> it crashes hard in an irrecoverable state (that was described as a 
>>>> consequence in an old post about this warning). If this is a safe state, I 
>>>> can keep it in this state until I return from holidays.
>>> Yeah, I think so.
>>>
>>> BTW, are u using the kclients or user space clients ? I checked both
>>> kclient and libcephfs, it seems buggy in libcephfs, which could cause
>>> this issue. But for kclient it's okay till now.
>>>
>>> Thanks
>>>
>>> - Xiubo
>>>
>>>> Best regards,
>>>> =================
>>>> Frank Schilder
>>>> AIT Risø Campus
>>>> Bygning 109, rum S14
>>>>
>>>> ________________________________________
>>>> From: Xiubo Li <xiu...@redhat.com>
>>>> Sent: Friday, July 28, 2023 11:37 AM
>>>> To: Frank Schilder; ceph-users@ceph.io
>>>> Subject: Re: [ceph-users] Re: MDS stuck in rejoin
>>>>
>>>>
>>>> On 7/26/23 22:13, Frank Schilder wrote:
>>>>> Hi Xiubo.
>>>>>
>>>>>> ... I am more interested in the kclient side logs. Just want to
>>>>>> know why that oldest request got stuck so long.
>>>>> I'm afraid I'm a bad admin in this case. I don't have logs from the host 
>>>>> any more, I would have needed the output of dmesg and this is gone. In 
>>>>> case it happens again I will try to pull the info out.
>>>>>
>>>>> The tracker https://tracker.ceph.com/issues/22885 sounds a lot more 
>>>>> violent than our situation. We had no problems with the MDSes, the cache 
>>>>> didn't grow and the relevant one was also not put into read-only mode. It 
>>>>> was just this warning showing all the time, health was OK otherwise. I 
>>>>> think the warning was there for at least 16h before I failed the MDS.
>>>>>
>>>>> The MDS log contains nothing, this is the only line mentioning this 
>>>>> client:
>>>>>
>>>>> 2023-07-20T00:22:05.518+0200 7fe13df59700  0 log_channel(cluster) log 
>>>>> [WRN] : client.145678382 does not advance its oldest_client_tid 
>>>>> (16121616), 100000 completed requests recorded in session
>>>> Okay, if so it's hard to say and dig out what has happened in client why
>>>> it didn't advance the tid.
>>>>
>>>> Thanks
>>>>
>>>> - Xiubo
>>>>
>>>>
>>>>> Best regards,
>>>>> =================
>>>>> Frank Schilder
>>>>> AIT Risø Campus
>>>>> Bygning 109, rum S14
>>>>>
>>>>>

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MDS stuck in rejoin

Reply via email to