Dear Xiubo, I managed to collect some information. It looks like there is nothing in the dmesg log around the time the client failed to advance its TID. I collected short snippets around the critical time below. I have full logs in case you are interested. Its large files, I will need to do an upload for that.
I also have a dump of "mds session ls" output for clients that showed the same issue later. Unfortunately, no consistent log information for a single incident. Here the summary, please let me know if uploading the full package makes sense: - Status: On July 29, 2023 ceph status/df/pool stats/health detail at 01:05:14: cluster: health: HEALTH_WARN 1 pools nearfull ceph status/df/pool stats/health detail at 01:05:28: cluster: health: HEALTH_WARN 1 clients failing to advance oldest client/flush tid 1 pools nearfull [...] On July 31, 2023 ceph status/df/pool stats/health detail at 10:36:16: cluster: health: HEALTH_WARN 1 clients failing to advance oldest client/flush tid 1 pools nearfull cluster: health: HEALTH_WARN 1 pools nearfull - client evict command (date, time, command): 2023-07-31 10:36 ceph tell mds.ceph-11 client evict id=145678457 We have a 1h time difference between the date stamp of the command and the dmesg date stamps. However, there seems to be a weird 10min delay from issuing the evict command until it shows up in dmesg on the client. - dmesg: [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey [Fri Jul 28 16:07:47 2023] slurm.epilog.cl (24175): drop_caches: 3 [Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed (con state OPEN) [Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed (con state OPEN) [Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed (con state OPEN) [Sat Jul 29 18:21:42 2023] ceph: mds2 reconnect start [Sat Jul 29 18:21:42 2023] ceph: mds2 reconnect start [Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect start [Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect success [Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect success [Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect success [Sat Jul 29 18:26:39 2023] ceph: mds2 reconnect start [Sat Jul 29 18:26:39 2023] ceph: mds2 reconnect start [Sat Jul 29 18:26:39 2023] ceph: mds2 reconnect start [Sat Jul 29 18:26:40 2023] ceph: mds2 reconnect success [Sat Jul 29 18:26:40 2023] ceph: mds2 reconnect success [Sat Jul 29 18:26:40 2023] ceph: mds2 reconnect success [Sat Jul 29 18:26:49 2023] ceph: update_snap_trace error -22 [Sat Jul 29 18:26:49 2023] ceph: mds2 recovery completed [Sat Jul 29 18:26:49 2023] ceph: mds2 recovery completed [Sat Jul 29 18:26:49 2023] ceph: mds2 recovery completed [Sun Jul 30 16:37:55 2023] slurm.epilog.cl (43668): drop_caches: 3 [Mon Jul 31 01:00:20 2023] slurm.epilog.cl (73347): drop_caches: 3 [Mon Jul 31 09:46:41 2023] libceph: mds0 192.168.32.81:6801 socket closed (con state OPEN) [Mon Jul 31 09:46:41 2023] libceph: mds3 192.168.32.87:6801 socket closed (con state OPEN) [Mon Jul 31 09:46:41 2023] libceph: mds7 192.168.32.88:6801 socket closed (con state OPEN) [Mon Jul 31 09:46:41 2023] libceph: mds5 192.168.32.78:6801 socket closed (con state OPEN) [Mon Jul 31 09:46:41 2023] libceph: mds4 192.168.32.73:6801 socket closed (con state OPEN) [Mon Jul 31 09:46:41 2023] libceph: mds1 192.168.32.80:6801 socket closed (con state OPEN) [Mon Jul 31 09:46:41 2023] libceph: mds2 192.168.32.75:6801 socket closed (con state OPEN) [Mon Jul 31 09:46:41 2023] libceph: mds3 192.168.32.87:6801 connection reset [Mon Jul 31 09:46:41 2023] libceph: reset on mds3 [Mon Jul 31 09:46:41 2023] ceph: mds3 closed our session [Mon Jul 31 09:46:41 2023] ceph: mds3 reconnect start [Mon Jul 31 09:46:41 2023] libceph: mds7 192.168.32.88:6801 connection reset [Mon Jul 31 09:46:41 2023] libceph: reset on mds7 [Mon Jul 31 09:46:41 2023] ceph: mds7 closed our session [Mon Jul 31 09:46:41 2023] ceph: mds7 reconnect start [Mon Jul 31 09:46:41 2023] libceph: mds2 192.168.32.75:6801 connection reset [Mon Jul 31 09:46:41 2023] libceph: reset on mds2 [Mon Jul 31 09:46:41 2023] ceph: mds2 closed our session [Mon Jul 31 09:46:41 2023] ceph: mds2 reconnect start [Mon Jul 31 09:46:41 2023] libceph: mds4 192.168.32.73:6801 connection reset [Mon Jul 31 09:46:41 2023] libceph: reset on mds4 [Mon Jul 31 09:46:41 2023] ceph: mds4 closed our session [Mon Jul 31 09:46:41 2023] ceph: mds4 reconnect start [Mon Jul 31 09:46:41 2023] libceph: mds1 192.168.32.80:6801 connection reset [Mon Jul 31 09:46:41 2023] libceph: reset on mds1 [Mon Jul 31 09:46:41 2023] ceph: mds1 closed our session [Mon Jul 31 09:46:41 2023] ceph: mds1 reconnect start [Mon Jul 31 09:46:41 2023] libceph: mds0 192.168.32.81:6801 connection reset [Mon Jul 31 09:46:41 2023] libceph: reset on mds0 [Mon Jul 31 09:46:41 2023] ceph: mds0 closed our session [Mon Jul 31 09:46:41 2023] ceph: mds0 reconnect start [Mon Jul 31 09:46:41 2023] libceph: mds5 192.168.32.78:6801 connection reset [Mon Jul 31 09:46:41 2023] libceph: reset on mds5 [Mon Jul 31 09:46:41 2023] ceph: mds5 closed our session [Mon Jul 31 09:46:41 2023] ceph: mds5 reconnect start [Mon Jul 31 09:46:41 2023] ceph: mds2 reconnect denied [Mon Jul 31 09:46:41 2023] ceph: mds1 reconnect denied [Mon Jul 31 09:46:41 2023] ceph: mds0 reconnect denied [Mon Jul 31 09:46:41 2023] ceph: mds5 reconnect denied [Mon Jul 31 09:46:41 2023] ceph: mds3 reconnect denied [Mon Jul 31 09:46:41 2023] ceph: mds7 reconnect denied [Mon Jul 31 09:46:41 2023] ceph: mds4 reconnect denied Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Xiubo Li <xiu...@redhat.com> Sent: Monday, July 31, 2023 12:14 PM To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Re: MDS stuck in rejoin On 7/31/23 16:50, Frank Schilder wrote: > Hi Xiubo, > > its a kernel client. I actually made a mistake when trying to evict the > client and my command didn't do anything. I did another evict and this time > the client IP showed up in the blacklist. Furthermore, the warning > disappeared. I asked for the dmesg logs from the client node. Yeah, after the client's sessions are closed the corresponding warning should be cleared. Thanks > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Xiubo Li <xiu...@redhat.com> > Sent: Monday, July 31, 2023 4:12 AM > To: Frank Schilder; ceph-users@ceph.io > Subject: Re: [ceph-users] Re: MDS stuck in rejoin > > Hi Frank, > > On 7/30/23 16:52, Frank Schilder wrote: >> Hi Xiubo, >> >> it happened again. This time, we might be able to pull logs from the client >> node. Please take a look at my intermediate action below - thanks! >> >> I am in a bit of a calamity, I'm on holidays with terrible network >> connection and can't do much. My first priority is securing the cluster to >> avoid damage caused by this issue. I did an MDS evict by client ID on the >> MDS reporting the warning with the client ID reported in the warning. For >> some reason the client got blocked on 2 MDSes after this command, one of >> these is an ordinary stand-by daemon. Not sure if this is expected. >> >> Main question: is this sufficient to prevent any damaging IO on the cluster? >> I'm thinking here about the MDS eating through all its RAM until it crashes >> hard in an irrecoverable state (that was described as a consequence in an >> old post about this warning). If this is a safe state, I can keep it in this >> state until I return from holidays. > Yeah, I think so. > > BTW, are u using the kclients or user space clients ? I checked both > kclient and libcephfs, it seems buggy in libcephfs, which could cause > this issue. But for kclient it's okay till now. > > Thanks > > - Xiubo > >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Xiubo Li <xiu...@redhat.com> >> Sent: Friday, July 28, 2023 11:37 AM >> To: Frank Schilder; ceph-users@ceph.io >> Subject: Re: [ceph-users] Re: MDS stuck in rejoin >> >> >> On 7/26/23 22:13, Frank Schilder wrote: >>> Hi Xiubo. >>> >>>> ... I am more interested in the kclient side logs. Just want to >>>> know why that oldest request got stuck so long. >>> I'm afraid I'm a bad admin in this case. I don't have logs from the host >>> any more, I would have needed the output of dmesg and this is gone. In case >>> it happens again I will try to pull the info out. >>> >>> The tracker https://tracker.ceph.com/issues/22885 sounds a lot more violent >>> than our situation. We had no problems with the MDSes, the cache didn't >>> grow and the relevant one was also not put into read-only mode. It was just >>> this warning showing all the time, health was OK otherwise. I think the >>> warning was there for at least 16h before I failed the MDS. >>> >>> The MDS log contains nothing, this is the only line mentioning this client: >>> >>> 2023-07-20T00:22:05.518+0200 7fe13df59700 0 log_channel(cluster) log [WRN] >>> : client.145678382 does not advance its oldest_client_tid (16121616), >>> 100000 completed requests recorded in session >> Okay, if so it's hard to say and dig out what has happened in client why >> it didn't advance the tid. >> >> Thanks >> >> - Xiubo >> >> >>> Best regards, >>> ================= >>> Frank Schilder >>> AIT Risø Campus >>> Bygning 109, rum S14 >>> >>> _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io