[ceph-users] Issue with CephFS (mds stuck in clientreplay status) since upgrade to 18.2.0.

Lo Re Giuseppe Mon, 27 Nov 2023 01:08:47 -0800

Hi,
We have upgraded one ceph cluster from 17.2.7 to 18.2.0. Since then we are 
having CephFS issues.
For example this morning:
“””
[root@naret-monitor01 ~]# ceph -s
  cluster:
    id:     63334166-d991-11eb-99de-40a6b72108d0
    health: HEALTH_WARN
            1 filesystem is degraded
            3 clients failing to advance oldest client/flush tid
            3 MDSs report slow requests
            6 pgs not scrubbed in time
            29 daemons have recently crashed
…
“””


The ceph orch, ceph crash and ceph fs status commands were hanging.

After a “ceph mgr fail” those commands started to respond.
Then I have noticed that there was one mds with most of the slow operations,

“””
[WRN] MDS_SLOW_REQUEST: 3 MDSs report slow requests
    mds.cephfs.naret-monitor01.nuakzo(mds.0): 18 slow requests are blocked > 30 
secs
    mds.cephfs.naret-monitor01.uvevbf(mds.1): 1683 slow requests are blocked > 
30 secs
    mds.cephfs.naret-monitor02.exceuo(mds.2): 1 slow requests are blocked > 30 
secs
“””

Then I tried to restart it with

“””
[root@naret-monitor01 ~]# ceph orch daemon restart 
mds.cephfs.naret-monitor01.uvevbf
Scheduled to restart mds.cephfs.naret-monitor01.uvevbf on host 'naret-monitor01'
“””

After the cephfs entered into this situation:
“””
[root@naret-monitor01 ~]# ceph fs status
cephfs - 198 clients
======
RANK     STATE                   MDS                  ACTIVITY     DNS    INOS  
 DIRS   CAPS
0       active     cephfs.naret-monitor01.nuakzo  Reqs:    0 /s  17.2k  16.2k  
1892   14.3k
1       active     cephfs.naret-monitor02.ztdghf  Reqs:    0 /s  28.1k  10.3k   
752   6881
2    clientreplay  cephfs.naret-monitor02.exceuo                 63.0k  6491    
541     66
3       active     cephfs.naret-monitor03.lqppte  Reqs:    0 /s  16.7k  13.4k  
8233    990
          POOL              TYPE     USED  AVAIL
   cephfs.cephfs.meta     metadata  5888M  18.5T
   cephfs.cephfs.data       data     119G   215T
cephfs.cephfs.data.e_4_2    data    2289G  3241T
cephfs.cephfs.data.e_8_3    data    9997G   470T
         STANDBY MDS
cephfs.naret-monitor03.eflouf
cephfs.naret-monitor01.uvevbf
MDS version: ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) 
reef (stable)
“””

The file system is totally unresponsive (we can mount it on client nodes but 
any operations like a simple ls hangs).

During the night we had a lot of mds crashes, I can share the content.

Does anybody have an idea on how to tackle this problem?

Best,

Giuseppe
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Issue with CephFS (mds stuck in clientreplay status) since upgrade to 18.2.0.

Reply via email to