Hi all, I needed to reduce the number of active MDS daemons from 4 to 1.
Unfortunately, the last MDS to stop is stuck in stopping state. Ceph version is
mimic 13.2.10. Each MDS has 3 blocked OPS, that seem to be related to deleted
snapshots; more info below. I failed the MDS in stopping state already several
times in the hope that the operations get flushed out. Before failing rank 0, I
would appreciate if someone could look at this issue and advise on how to
proceed safely.
Some diagnostic info:
# ceph fs status
con-fs2 - 1659 clients
=======
+------+----------+---------+---------------+-------+-------+
| Rank | State | MDS | Activity | dns | inos |
+------+----------+---------+---------------+-------+-------+
| 0 | active | ceph-08 | Reqs: 176 /s | 2844k | 2775k |
| 1 | stopping | ceph-17 | | 27.7k | 59 |
+------+----------+---------+---------------+-------+-------+
+---------------------+----------+-------+-------+
| Pool | type | used | avail |
+---------------------+----------+-------+-------+
| con-fs2-meta1 | metadata | 555M | 1261G |
| con-fs2-meta2 | data | 0 | 1261G |
| con-fs2-data | data | 1321T | 5756T |
| con-fs2-data-ec-ssd | data | 252G | 4035G |
| con-fs2-data2 | data | 389T | 5233T |
+---------------------+----------+-------+-------+
+-------------+
| Standby MDS |
+-------------+
| ceph-09 |
| ceph-24 |
| ceph-14 |
| ceph-16 |
| ceph-12 |
| ceph-23 |
| ceph-10 |
| ceph-15 |
| ceph-13 |
| ceph-11 |
+-------------+
MDS version: ceph version 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641)
mimic (stable)
# ceph status
cluster:
id:
health: HEALTH_WARN
2 MDSs report slow requests
services:
mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26
mgr: ceph-01(active), standbys: ceph-02, ceph-03, ceph-25, ceph-26
mds: con-fs2-2/2/1 up {0=ceph-08=up:active,1=ceph-17=up:stopping}, 10
up:standby
osd: 1051 osds: 1050 up, 1050 in
data:
pools: 13 pools, 17374 pgs
objects: 1.01 G objects, 1.9 PiB
usage: 2.3 PiB used, 9.2 PiB / 11 PiB avail
pgs: 17352 active+clean
20 active+clean+scrubbing+deep
2 active+clean+scrubbing
io:
client: 129 MiB/s rd, 175 MiB/s wr, 2.57 kop/s rd, 2.77 kop/s wr
# ceph health detail
HEALTH_WARN 2 MDSs report slow requests
MDS_SLOW_REQUEST 2 MDSs report slow requests
mdsceph-08(mds.0): 3 slow requests are blocked > 30 secs
mdsceph-17(mds.1): 3 slow requests are blocked > 30 secs
# ssh ceph-08 ceph daemon mds.ceph-08 dump_blocked_ops
{
"ops": [
{
"description": "client_request(mds.1:126521 rename
#0x100/stray5/1000eec35f7 #0x101/stray5/1000eec35f7 caller_uid=0,
caller_gid=0{})",
"initiated_at": "2021-12-13 13:08:59.430597",
"age": 5034.983083,
"duration": 5034.983109,
"type_data": {
"flag_point": "acquired locks",
"reqid": "mds.1:126521",
"op_type": "client_request",
"client_info": {
"client": "mds.1",
"tid": 126521
},
"events": [
{
"time": "2021-12-13 13:08:59.430597",
"event": "initiated"
},
{
"time": "2021-12-13 13:08:59.430597",
"event": "header_read"
},
{
"time": "2021-12-13 13:08:59.430597",
"event": "throttled"
},
{
"time": "2021-12-13 13:08:59.430601",
"event": "all_read"
},
{
"time": "2021-12-13 13:09:00.730197",
"event": "dispatched"
},
{
"time": "2021-12-13 13:09:01.517306",
"event": "requesting remote authpins"
},
{
"time": "2021-12-13 13:09:01.557219",
"event": "failed to xlock, waiting"
},
{
"time": "2021-12-13 13:09:01.647692",
"event": "failed to wrlock, waiting"
},
{
"time": "2021-12-13 13:09:01.663629",
"event": "waiting for remote wrlocks"
},
{
"time": "2021-12-13 13:09:01.673789",
"event": "waiting for remote wrlocks"
},
{
"time": "2021-12-13 13:09:01.676523",
"event": "failed to xlock, waiting"
},
{
"time": "2021-12-13 13:09:01.691962",
"event": "failed to xlock, waiting"
},
{
"time": "2021-12-13 13:09:01.704202",
"event": "acquired locks"
}
]
}
},
{
"description": "client_request(mds.1:1 rename
#0x100/stray5/1000eec35f7 #0x101/stray5/1000eec35f7 caller_uid=0,
caller_gid=0{})",
"initiated_at": "2021-12-13 13:31:56.260453",
"age": 3658.153227,
"duration": 3658.153337,
"type_data": {
"flag_point": "requesting remote authpins",
"reqid": "mds.1:1",
"op_type": "client_request",
"client_info": {
"client": "mds.1",
"tid": 1
},
"events": [
{
"time": "2021-12-13 13:31:56.260453",
"event": "initiated"
},
{
"time": "2021-12-13 13:31:56.260453",
"event": "header_read"
},
{
"time": "2021-12-13 13:31:56.260454",
"event": "throttled"
},
{
"time": "2021-12-13 13:31:56.260461",
"event": "all_read"
},
{
"time": "2021-12-13 13:31:56.260511",
"event": "dispatched"
},
{
"time": "2021-12-13 13:31:56.260604",
"event": "requesting remote authpins"
}
]
}
},
{
"description": "client_request(mds.1:993 rename
#0x100/stray5/1000eec35f7 #0x101/stray5/1000eec35f7 caller_uid=0,
caller_gid=0{})",
"initiated_at": "2021-12-13 13:15:31.979997",
"age": 4642.433683,
"duration": 4642.433850,
"type_data": {
"flag_point": "requesting remote authpins",
"reqid": "mds.1:993",
"op_type": "client_request",
"client_info": {
"client": "mds.1",
"tid": 993
},
"events": [
{
"time": "2021-12-13 13:15:31.979997",
"event": "initiated"
},
{
"time": "2021-12-13 13:15:31.979997",
"event": "header_read"
},
{
"time": "2021-12-13 13:15:31.979998",
"event": "throttled"
},
{
"time": "2021-12-13 13:15:31.980003",
"event": "all_read"
},
{
"time": "2021-12-13 13:15:31.980079",
"event": "dispatched"
},
{
"time": "2021-12-13 13:15:31.980174",
"event": "requesting remote authpins"
},
{
"time": "2021-12-13 13:31:50.634734",
"event": "requesting remote authpins"
}
]
}
}
],
"complaint_time": 30.000000,
"num_blocked_ops": 3
}
# ssh ceph-17 ceph daemon mds.ceph-17 dump_blocked_ops
{
"ops": [
{
"description": "rejoin:mds.1:126521",
"initiated_at": "2021-12-13 13:30:34.602931",
"age": 3791.164314,
"duration": 3791.164335,
"type_data": {
"flag_point": "dispatched",
"reqid": "mds.1:126521",
"op_type": "no_available_op_found",
"events": [
{
"time": "2021-12-13 13:30:34.602931",
"event": "initiated"
},
{
"time": "2021-12-13 13:30:34.602931",
"event": "header_read"
},
{
"time": "2021-12-13 13:30:34.602932",
"event": "throttled"
},
{
"time": "2021-12-13 13:30:34.602978",
"event": "all_read"
},
{
"time": "2021-12-13 13:30:34.605856",
"event": "dispatched"
}
]
}
},
{
"description": "slave_request(mds.1:993.0 authpin)",
"initiated_at": "2021-12-13 13:31:50.634857",
"age": 3715.132388,
"duration": 3715.132451,
"type_data": {
"flag_point": "dispatched",
"reqid": "mds.1:993",
"op_type": "slave_request",
"master_info": {
"master": "mds.0"
},
"request_info": {
"attempt": 0,
"op_type": "authpin",
"lock_type": 0,
"object_info": "0x1000eec35f7.head",
"srcdnpath": "",
"destdnpath": "",
"witnesses": "",
"has_inode_export": false,
"inode_export_v": 0,
"op_stamp": "0.000000"
},
"events": [
{
"time": "2021-12-13 13:31:50.634857",
"event": "initiated"
},
{
"time": "2021-12-13 13:31:50.634857",
"event": "header_read"
},
{
"time": "2021-12-13 13:31:50.634858",
"event": "throttled"
},
{
"time": "2021-12-13 13:31:50.634867",
"event": "all_read"
},
{
"time": "2021-12-13 13:31:50.634893",
"event": "dispatched"
}
]
}
},
{
"description": "slave_request(mds.1:1.0 authpin)",
"initiated_at": "2021-12-13 13:31:56.260729",
"age": 3709.506516,
"duration": 3709.506631,
"type_data": {
"flag_point": "dispatched",
"reqid": "mds.1:1",
"op_type": "slave_request",
"master_info": {
"master": "mds.0"
},
"request_info": {
"attempt": 0,
"op_type": "authpin",
"lock_type": 0,
"object_info": "0x1000eec35f7.head",
"srcdnpath": "",
"destdnpath": "",
"witnesses": "",
"has_inode_export": false,
"inode_export_v": 0,
"op_stamp": "0.000000"
},
"events": [
{
"time": "2021-12-13 13:31:56.260729",
"event": "initiated"
},
{
"time": "2021-12-13 13:31:56.260729",
"event": "header_read"
},
{
"time": "2021-12-13 13:31:56.260731",
"event": "throttled"
},
{
"time": "2021-12-13 13:31:56.260743",
"event": "all_read"
},
{
"time": "2021-12-13 13:31:56.264063",
"event": "dispatched"
}
]
}
}
],
"complaint_time": 30.000000,
"num_blocked_ops": 3
}
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]