[ceph-users] Re: Issue with CephFS (mds stuck in clientreplay status) since upgrade to 18.2.0.

2023-11-27 Thread Dan van der Ster
Hi Giuseppe,

There are likely one or two clients whose op is blocking the reconnect/replay.
If you increase debug_mds perhaps you can find the guilty client and
disconnect it / block it from mounting.

Or for a more disruptive recovery you can try this "Deny all reconnect
to clients " option:
https://docs.ceph.com/en/reef/cephfs/troubleshooting/#avoiding-recovery-roadblocks

Hope that helps,

Dan


On Mon, Nov 27, 2023 at 1:08 AM Lo Re Giuseppe  wrote:
>
> Hi,
> We have upgraded one ceph cluster from 17.2.7 to 18.2.0. Since then we are 
> having CephFS issues.
> For example this morning:
> “””
> [root@naret-monitor01 ~]# ceph -s
>   cluster:
> id: 63334166-d991-11eb-99de-40a6b72108d0
> health: HEALTH_WARN
> 1 filesystem is degraded
> 3 clients failing to advance oldest client/flush tid
> 3 MDSs report slow requests
> 6 pgs not scrubbed in time
> 29 daemons have recently crashed
> …
> “””
>
> The ceph orch, ceph crash and ceph fs status commands were hanging.
>
> After a “ceph mgr fail” those commands started to respond.
> Then I have noticed that there was one mds with most of the slow operations,
>
> “””
> [WRN] MDS_SLOW_REQUEST: 3 MDSs report slow requests
> mds.cephfs.naret-monitor01.nuakzo(mds.0): 18 slow requests are blocked > 
> 30 secs
> mds.cephfs.naret-monitor01.uvevbf(mds.1): 1683 slow requests are blocked 
> > 30 secs
> mds.cephfs.naret-monitor02.exceuo(mds.2): 1 slow requests are blocked > 
> 30 secs
> “””
>
> Then I tried to restart it with
>
> “””
> [root@naret-monitor01 ~]# ceph orch daemon restart 
> mds.cephfs.naret-monitor01.uvevbf
> Scheduled to restart mds.cephfs.naret-monitor01.uvevbf on host 
> 'naret-monitor01'
> “””
>
> After the cephfs entered into this situation:
> “””
> [root@naret-monitor01 ~]# ceph fs status
> cephfs - 198 clients
> ==
> RANK STATE   MDS  ACTIVITY DNS
> INOS   DIRS   CAPS
> 0   active cephfs.naret-monitor01.nuakzo  Reqs:0 /s  17.2k  16.2k 
>  1892   14.3k
> 1   active cephfs.naret-monitor02.ztdghf  Reqs:0 /s  28.1k  10.3k 
>   752   6881
> 2clientreplay  cephfs.naret-monitor02.exceuo 63.0k  6491  
>   541 66
> 3   active cephfs.naret-monitor03.lqppte  Reqs:0 /s  16.7k  13.4k 
>  8233990
>   POOL  TYPE USED  AVAIL
>cephfs.cephfs.meta metadata  5888M  18.5T
>cephfs.cephfs.data   data 119G   215T
> cephfs.cephfs.data.e_4_2data2289G  3241T
> cephfs.cephfs.data.e_8_3data9997G   470T
>  STANDBY MDS
> cephfs.naret-monitor03.eflouf
> cephfs.naret-monitor01.uvevbf
> MDS version: ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) 
> reef (stable)
> “””
>
> The file system is totally unresponsive (we can mount it on client nodes but 
> any operations like a simple ls hangs).
>
> During the night we had a lot of mds crashes, I can share the content.
>
> Does anybody have an idea on how to tackle this problem?
>
> Best,
>
> Giuseppe
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Issue with CephFS (mds stuck in clientreplay status) since upgrade to 18.2.0.

2023-11-27 Thread Lo Re Giuseppe
Hi David,
Thanks a lot for your reply.
Yes we have heavy load from clients on the same subtree. We have multiple MDSs 
that were setup with the hope to distribute the load among them, but this is 
not really happening, in moments of high load we see most of the load on one 
MDS.
We don't use pinning today.
We have placed >1 MDSs in the same servers as we noticed that the memory 
consumption was allowing this.
Right now I have scaled down the MDS services  to 1 as I have learnt that the 
use of multiple MDSs could have been a risky move. Though it was working not 
bad until we upgraded from 17.2.5 up to 17.2.7 and now 18.2.0.
I'll look more in the client stability as per your suggestion.

Best,

Giuseppe

On 27.11.2023, 10:41, "David C." mailto:david.cas...@aevoo.fr>> wrote:


Hi Guiseppe,


Wouldn't you have clients who heavily load the MDS with concurrent access
on the same trees ?


Perhaps, also, look at the stability of all your clients (even if there are
many) [dmesg -T, ...]


How are your 4 active MDS configured (pinning?) ?


Probably nothing to do but normal for 2 MDS to be on the same host
"monitor-02" ?





Cordialement,


*David CASIER*









Le lun. 27 nov. 2023 à 10:09, Lo Re Giuseppe mailto:giuseppe.l...@cscs.ch>> a
écrit :


> Hi,
> We have upgraded one ceph cluster from 17.2.7 to 18.2.0. Since then we are
> having CephFS issues.
> For example this morning:
> “””
> [root@naret-monitor01 ~]# ceph -s
> cluster:
> id: 63334166-d991-11eb-99de-40a6b72108d0
> health: HEALTH_WARN
> 1 filesystem is degraded
> 3 clients failing to advance oldest client/flush tid
> 3 MDSs report slow requests
> 6 pgs not scrubbed in time
> 29 daemons have recently crashed
> …
> “””
>
> The ceph orch, ceph crash and ceph fs status commands were hanging.
>
> After a “ceph mgr fail” those commands started to respond.
> Then I have noticed that there was one mds with most of the slow
> operations,
>
> “””
> [WRN] MDS_SLOW_REQUEST: 3 MDSs report slow requests
> mds.cephfs.naret-monitor01.nuakzo(mds.0): 18 slow requests are blocked
> > 30 secs
> mds.cephfs.naret-monitor01.uvevbf(mds.1): 1683 slow requests are
> blocked > 30 secs
> mds.cephfs.naret-monitor02.exceuo(mds.2): 1 slow requests are blocked
> > 30 secs
> “””
>
> Then I tried to restart it with
>
> “””
> [root@naret-monitor01 ~]# ceph orch daemon restart
> mds.cephfs.naret-monitor01.uvevbf
> Scheduled to restart mds.cephfs.naret-monitor01.uvevbf on host
> 'naret-monitor01'
> “””
>
> After the cephfs entered into this situation:
> “””
> [root@naret-monitor01 ~]# ceph fs status
> cephfs - 198 clients
> ==
> RANK STATE MDS ACTIVITY DNS
> INOS DIRS CAPS
> 0 active cephfs.naret-monitor01.nuakzo Reqs: 0 /s 17.2k
> 16.2k 1892 14.3k
> 1 active cephfs.naret-monitor02.ztdghf Reqs: 0 /s 28.1k
> 10.3k 752 6881
> 2 clientreplay cephfs.naret-monitor02.exceuo 63.0k
> 6491 541 66
> 3 active cephfs.naret-monitor03.lqppte Reqs: 0 /s 16.7k
> 13.4k 8233 990
> POOL TYPE USED AVAIL
> cephfs.cephfs.meta metadata 5888M 18.5T
> cephfs.cephfs.data data 119G 215T
> cephfs.cephfs.data.e_4_2 data 2289G 3241T
> cephfs.cephfs.data.e_8_3 data 9997G 470T
> STANDBY MDS
> cephfs.naret-monitor03.eflouf
> cephfs.naret-monitor01.uvevbf
> MDS version: ceph version 18.2.0
> (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable)
> “””
>
> The file system is totally unresponsive (we can mount it on client nodes
> but any operations like a simple ls hangs).
>
> During the night we had a lot of mds crashes, I can share the content.
>
> Does anybody have an idea on how to tackle this problem?
>
> Best,
>
> Giuseppe
> ___
> ceph-users mailing list -- ceph-users@ceph.io 
> To unsubscribe send an email to ceph-users-le...@ceph.io 
> 
>
___
ceph-users mailing list -- ceph-users@ceph.io 
To unsubscribe send an email to ceph-users-le...@ceph.io 




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Issue with CephFS (mds stuck in clientreplay status) since upgrade to 18.2.0.

2023-11-27 Thread David C.
Hi Guiseppe,

Wouldn't you have clients who heavily load the MDS with concurrent access
on the same trees ?

Perhaps, also, look at the stability of all your clients (even if there are
many) [dmesg -T, ...]

How are your 4 active MDS configured (pinning?) ?

Probably nothing to do but normal for 2 MDS to be on the same host
"monitor-02" ?



Cordialement,

*David CASIER*





Le lun. 27 nov. 2023 à 10:09, Lo Re Giuseppe  a
écrit :

> Hi,
> We have upgraded one ceph cluster from 17.2.7 to 18.2.0. Since then we are
> having CephFS issues.
> For example this morning:
> “””
> [root@naret-monitor01 ~]# ceph -s
>   cluster:
> id: 63334166-d991-11eb-99de-40a6b72108d0
> health: HEALTH_WARN
> 1 filesystem is degraded
> 3 clients failing to advance oldest client/flush tid
> 3 MDSs report slow requests
> 6 pgs not scrubbed in time
> 29 daemons have recently crashed
> …
> “””
>
> The ceph orch, ceph crash and ceph fs status commands were hanging.
>
> After a “ceph mgr fail” those commands started to respond.
> Then I have noticed that there was one mds with most of the slow
> operations,
>
> “””
> [WRN] MDS_SLOW_REQUEST: 3 MDSs report slow requests
> mds.cephfs.naret-monitor01.nuakzo(mds.0): 18 slow requests are blocked
> > 30 secs
> mds.cephfs.naret-monitor01.uvevbf(mds.1): 1683 slow requests are
> blocked > 30 secs
> mds.cephfs.naret-monitor02.exceuo(mds.2): 1 slow requests are blocked
> > 30 secs
> “””
>
> Then I tried to restart it with
>
> “””
> [root@naret-monitor01 ~]# ceph orch daemon restart
> mds.cephfs.naret-monitor01.uvevbf
> Scheduled to restart mds.cephfs.naret-monitor01.uvevbf on host
> 'naret-monitor01'
> “””
>
> After the cephfs entered into this situation:
> “””
> [root@naret-monitor01 ~]# ceph fs status
> cephfs - 198 clients
> ==
> RANK STATE   MDS  ACTIVITY DNS
> INOS   DIRS   CAPS
> 0   active cephfs.naret-monitor01.nuakzo  Reqs:0 /s  17.2k
> 16.2k  1892   14.3k
> 1   active cephfs.naret-monitor02.ztdghf  Reqs:0 /s  28.1k
> 10.3k   752   6881
> 2clientreplay  cephfs.naret-monitor02.exceuo 63.0k
> 6491541 66
> 3   active cephfs.naret-monitor03.lqppte  Reqs:0 /s  16.7k
> 13.4k  8233990
>   POOL  TYPE USED  AVAIL
>cephfs.cephfs.meta metadata  5888M  18.5T
>cephfs.cephfs.data   data 119G   215T
> cephfs.cephfs.data.e_4_2data2289G  3241T
> cephfs.cephfs.data.e_8_3data9997G   470T
>  STANDBY MDS
> cephfs.naret-monitor03.eflouf
> cephfs.naret-monitor01.uvevbf
> MDS version: ceph version 18.2.0
> (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable)
> “””
>
> The file system is totally unresponsive (we can mount it on client nodes
> but any operations like a simple ls hangs).
>
> During the night we had a lot of mds crashes, I can share the content.
>
> Does anybody have an idea on how to tackle this problem?
>
> Best,
>
> Giuseppe
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io