Hmm. Glad it's working now, at least.

On Mon, Dec 13, 2021 at 9:10 AM Frank Schilder <[email protected]> wrote:
>
> Dear Gregory,
>
> thanks for your fast response. The situation started worsening shortly after 
> I sent my e-mail and I had to take action. More operations got stuck in the 
> active MDS, leading to a failure of journal trimming. I more or less went 
> through all combinations of what you wrote. In the end, the last option was 
> to fail rank 0. This also crashed rank 1, but then things started to clear 
> up. There were more blocked ops warnings every now and then, but these were 
> temporary. After a bit of watching the 2 MDSes struggle, the "stopping" one 
> finally became standby and everything is up and clean right now.
>
> The reason for this exercise is a (rare?) race condition between dirfrag 
> subtree exports (multi-active MDS) and certain client operations (rename, 
> create), which leads to a complete standstill of the file system (the fs 
> journal size eventually reaches the pool quota). Its seems to be caused by 
> different clients (servers) accessing the same sub-dir on the file system 
> concurrently. It is only one workload where we observe this, no useful 
> information so far.

Please let us know or file a ticket in the tracker if you get enough
details for us to spend some time debugging that. That's definitely
supposed to work fine, of course. :/
-Greg

>
> Thanks again and best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Gregory Farnum <[email protected]>
> Sent: 13 December 2021 17:39:55
> To: Frank Schilder
> Cc: ceph-users
> Subject: Re: [ceph-users] MDS stuck in stopping state
>
> This looks awkward — just from the ops, it seems mds.1 is trying to
> move some stray items (presumably snapshots of since-deleted files,
> from what you said?) into mds0's stray directory, and then mds.0 tries
> to get auth pins from mds.1 but that fails for some reason which isn't
> apparent from the dump.
>
> Somebody might be able to get farther along by tracing logs of mds.1
> rebooting, but my guess is that rebooting both servers will clear it
> up. You might also try increasing max_mds to 2 and seeing if that jogs
> things loose; I'm not sure what would be less disruptive for you.
> -Greg
>
>
> On Mon, Dec 13, 2021 at 5:37 AM Frank Schilder <[email protected]> wrote:
> >
> > Hi all, I needed to reduce the number of active MDS daemons from 4 to 1. 
> > Unfortunately, the last MDS to stop is stuck in stopping state. Ceph 
> > version is mimic 13.2.10. Each MDS has 3 blocked OPS, that seem to be 
> > related to deleted snapshots; more info below. I failed the MDS in stopping 
> > state already several times in the hope that the operations get flushed 
> > out. Before failing rank 0, I would appreciate if someone could look at 
> > this issue and advise on how to proceed safely.
> >
> > Some diagnostic info:
> >
> > # ceph fs status
> > con-fs2 - 1659 clients
> > =======
> > +------+----------+---------+---------------+-------+-------+
> > | Rank |  State   |   MDS   |    Activity   |  dns  |  inos |
> > +------+----------+---------+---------------+-------+-------+
> > |  0   |  active  | ceph-08 | Reqs:  176 /s | 2844k | 2775k |
> > |  1   | stopping | ceph-17 |               | 27.7k |   59  |
> > +------+----------+---------+---------------+-------+-------+
> > +---------------------+----------+-------+-------+
> > |         Pool        |   type   |  used | avail |
> > +---------------------+----------+-------+-------+
> > |    con-fs2-meta1    | metadata |  555M | 1261G |
> > |    con-fs2-meta2    |   data   |    0  | 1261G |
> > |     con-fs2-data    |   data   | 1321T | 5756T |
> > | con-fs2-data-ec-ssd |   data   |  252G | 4035G |
> > |    con-fs2-data2    |   data   |  389T | 5233T |
> > +---------------------+----------+-------+-------+
> > +-------------+
> > | Standby MDS |
> > +-------------+
> > |   ceph-09   |
> > |   ceph-24   |
> > |   ceph-14   |
> > |   ceph-16   |
> > |   ceph-12   |
> > |   ceph-23   |
> > |   ceph-10   |
> > |   ceph-15   |
> > |   ceph-13   |
> > |   ceph-11   |
> > +-------------+
> > MDS version: ceph version 13.2.10 
> > (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable)
> >
> > # ceph status
> >   cluster:
> >     id:
> >     health: HEALTH_WARN
> >             2 MDSs report slow requests
> >
> >   services:
> >     mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26
> >     mgr: ceph-01(active), standbys: ceph-02, ceph-03, ceph-25, ceph-26
> >     mds: con-fs2-2/2/1 up  {0=ceph-08=up:active,1=ceph-17=up:stopping}, 10 
> > up:standby
> >     osd: 1051 osds: 1050 up, 1050 in
> >
> >   data:
> >     pools:   13 pools, 17374 pgs
> >     objects: 1.01 G objects, 1.9 PiB
> >     usage:   2.3 PiB used, 9.2 PiB / 11 PiB avail
> >     pgs:     17352 active+clean
> >              20    active+clean+scrubbing+deep
> >              2     active+clean+scrubbing
> >
> >   io:
> >     client:   129 MiB/s rd, 175 MiB/s wr, 2.57 kop/s rd, 2.77 kop/s wr
> >
> > # ceph health detail
> > HEALTH_WARN 2 MDSs report slow requests
> > MDS_SLOW_REQUEST 2 MDSs report slow requests
> >     mdsceph-08(mds.0): 3 slow requests are blocked > 30 secs
> >     mdsceph-17(mds.1): 3 slow requests are blocked > 30 secs
> >
> > # ssh ceph-08 ceph daemon mds.ceph-08 dump_blocked_ops
> > {
> >     "ops": [
> >         {
> >             "description": "client_request(mds.1:126521 rename 
> > #0x100/stray5/1000eec35f7 #0x101/stray5/1000eec35f7 caller_uid=0, 
> > caller_gid=0{})",
> >             "initiated_at": "2021-12-13 13:08:59.430597",
> >             "age": 5034.983083,
> >             "duration": 5034.983109,
> >             "type_data": {
> >                 "flag_point": "acquired locks",
> >                 "reqid": "mds.1:126521",
> >                 "op_type": "client_request",
> >                 "client_info": {
> >                     "client": "mds.1",
> >                     "tid": 126521
> >                 },
> >                 "events": [
> >                     {
> >                         "time": "2021-12-13 13:08:59.430597",
> >                         "event": "initiated"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:08:59.430597",
> >                         "event": "header_read"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:08:59.430597",
> >                         "event": "throttled"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:08:59.430601",
> >                         "event": "all_read"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:09:00.730197",
> >                         "event": "dispatched"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:09:01.517306",
> >                         "event": "requesting remote authpins"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:09:01.557219",
> >                         "event": "failed to xlock, waiting"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:09:01.647692",
> >                         "event": "failed to wrlock, waiting"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:09:01.663629",
> >                         "event": "waiting for remote wrlocks"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:09:01.673789",
> >                         "event": "waiting for remote wrlocks"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:09:01.676523",
> >                         "event": "failed to xlock, waiting"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:09:01.691962",
> >                         "event": "failed to xlock, waiting"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:09:01.704202",
> >                         "event": "acquired locks"
> >                     }
> >                 ]
> >             }
> >         },
> >         {
> >             "description": "client_request(mds.1:1 rename 
> > #0x100/stray5/1000eec35f7 #0x101/stray5/1000eec35f7 caller_uid=0, 
> > caller_gid=0{})",
> >             "initiated_at": "2021-12-13 13:31:56.260453",
> >             "age": 3658.153227,
> >             "duration": 3658.153337,
> >             "type_data": {
> >                 "flag_point": "requesting remote authpins",
> >                 "reqid": "mds.1:1",
> >                 "op_type": "client_request",
> >                 "client_info": {
> >                     "client": "mds.1",
> >                     "tid": 1
> >                 },
> >                 "events": [
> >                     {
> >                         "time": "2021-12-13 13:31:56.260453",
> >                         "event": "initiated"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:31:56.260453",
> >                         "event": "header_read"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:31:56.260454",
> >                         "event": "throttled"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:31:56.260461",
> >                         "event": "all_read"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:31:56.260511",
> >                         "event": "dispatched"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:31:56.260604",
> >                         "event": "requesting remote authpins"
> >                     }
> >                 ]
> >             }
> >         },
> >         {
> >             "description": "client_request(mds.1:993 rename 
> > #0x100/stray5/1000eec35f7 #0x101/stray5/1000eec35f7 caller_uid=0, 
> > caller_gid=0{})",
> >             "initiated_at": "2021-12-13 13:15:31.979997",
> >             "age": 4642.433683,
> >             "duration": 4642.433850,
> >             "type_data": {
> >                 "flag_point": "requesting remote authpins",
> >                 "reqid": "mds.1:993",
> >                 "op_type": "client_request",
> >                 "client_info": {
> >                     "client": "mds.1",
> >                     "tid": 993
> >                 },
> >                 "events": [
> >                     {
> >                         "time": "2021-12-13 13:15:31.979997",
> >                         "event": "initiated"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:15:31.979997",
> >                         "event": "header_read"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:15:31.979998",
> >                         "event": "throttled"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:15:31.980003",
> >                         "event": "all_read"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:15:31.980079",
> >                         "event": "dispatched"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:15:31.980174",
> >                         "event": "requesting remote authpins"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:31:50.634734",
> >                         "event": "requesting remote authpins"
> >                     }
> >                 ]
> >             }
> >         }
> >     ],
> >     "complaint_time": 30.000000,
> >     "num_blocked_ops": 3
> > }
> >
> > # ssh ceph-17 ceph daemon mds.ceph-17 dump_blocked_ops
> > {
> >     "ops": [
> >         {
> >             "description": "rejoin:mds.1:126521",
> >             "initiated_at": "2021-12-13 13:30:34.602931",
> >             "age": 3791.164314,
> >             "duration": 3791.164335,
> >             "type_data": {
> >                 "flag_point": "dispatched",
> >                 "reqid": "mds.1:126521",
> >                 "op_type": "no_available_op_found",
> >                 "events": [
> >                     {
> >                         "time": "2021-12-13 13:30:34.602931",
> >                         "event": "initiated"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:30:34.602931",
> >                         "event": "header_read"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:30:34.602932",
> >                         "event": "throttled"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:30:34.602978",
> >                         "event": "all_read"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:30:34.605856",
> >                         "event": "dispatched"
> >                     }
> >                 ]
> >             }
> >         },
> >         {
> >             "description": "slave_request(mds.1:993.0 authpin)",
> >             "initiated_at": "2021-12-13 13:31:50.634857",
> >             "age": 3715.132388,
> >             "duration": 3715.132451,
> >             "type_data": {
> >                 "flag_point": "dispatched",
> >                 "reqid": "mds.1:993",
> >                 "op_type": "slave_request",
> >                 "master_info": {
> >                     "master": "mds.0"
> >                 },
> >                 "request_info": {
> >                     "attempt": 0,
> >                     "op_type": "authpin",
> >                     "lock_type": 0,
> >                     "object_info": "0x1000eec35f7.head",
> >                     "srcdnpath": "",
> >                     "destdnpath": "",
> >                     "witnesses": "",
> >                     "has_inode_export": false,
> >                     "inode_export_v": 0,
> >                     "op_stamp": "0.000000"
> >                 },
> >                 "events": [
> >                     {
> >                         "time": "2021-12-13 13:31:50.634857",
> >                         "event": "initiated"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:31:50.634857",
> >                         "event": "header_read"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:31:50.634858",
> >                         "event": "throttled"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:31:50.634867",
> >                         "event": "all_read"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:31:50.634893",
> >                         "event": "dispatched"
> >                     }
> >                 ]
> >             }
> >         },
> >         {
> >             "description": "slave_request(mds.1:1.0 authpin)",
> >             "initiated_at": "2021-12-13 13:31:56.260729",
> >             "age": 3709.506516,
> >             "duration": 3709.506631,
> >             "type_data": {
> >                 "flag_point": "dispatched",
> >                 "reqid": "mds.1:1",
> >                 "op_type": "slave_request",
> >                 "master_info": {
> >                     "master": "mds.0"
> >                 },
> >                 "request_info": {
> >                     "attempt": 0,
> >                     "op_type": "authpin",
> >                     "lock_type": 0,
> >                     "object_info": "0x1000eec35f7.head",
> >                     "srcdnpath": "",
> >                     "destdnpath": "",
> >                     "witnesses": "",
> >                     "has_inode_export": false,
> >                     "inode_export_v": 0,
> >                     "op_stamp": "0.000000"
> >                 },
> >                 "events": [
> >                     {
> >                         "time": "2021-12-13 13:31:56.260729",
> >                         "event": "initiated"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:31:56.260729",
> >                         "event": "header_read"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:31:56.260731",
> >                         "event": "throttled"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:31:56.260743",
> >                         "event": "all_read"
> >                     },
> >                     {
> >                         "time": "2021-12-13 13:31:56.264063",
> >                         "event": "dispatched"
> >                     }
> >                 ]
> >             }
> >         }
> >     ],
> >     "complaint_time": 30.000000,
> >     "num_blocked_ops": 3
> > }
> >
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> > _______________________________________________
> > ceph-users mailing list -- [email protected]
> > To unsubscribe send an email to [email protected]
> >
>

_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to