Thanks again. Now my CephFS is back online!
I ended up build ceph-mon from source myself, with the following patch applied.
and only replacing the mon leader seems sufficient.
Now I’m interested in why such a routine automated minor version upgrade could
get the cluster into such a state in the first place.
diff --git a/src/mon/MDSMonitor.cc b/src/mon/MDSMonitor.cc
index 4373938..786f227 100644
--- a/src/mon/MDSMonitor.cc
+++ b/src/mon/MDSMonitor.cc
@@ -1526,7 +1526,7 @@ int MDSMonitor::filesystem_command(
ss << "removed mds gid " << gid;
return 0;
}
- } else if (prefix == "mds rmfailed") {
+ } else if (prefix == "mds addfailed") {
bool confirm = false;
cmd_getval(cmdmap, "yes_i_really_mean_it", confirm);
if (!confirm) {
@@ -1554,10 +1554,10 @@ int MDSMonitor::filesystem_command(
role.fscid,
[role](std::shared_ptr<Filesystem> fs)
{
- fs->mds_map.failed.erase(role.rank);
+ fs->mds_map.failed.insert(role.rank);
});
- ss << "removed failed mds." << role;
+ ss << "added failed mds." << role;
return 0;
/* TODO: convert to fs commands to update defaults */
} else if (prefix == "mds compat rm_compat") {
diff --git a/src/mon/MonCommands.h b/src/mon/MonCommands.h
index 463419b..5c6a927 100644
--- a/src/mon/MonCommands.h
+++ b/src/mon/MonCommands.h
@@ -334,7 +334,7 @@ COMMAND("mds repaired name=role,type=CephString",
COMMAND("mds rm "
"name=gid,type=CephInt,range=0",
"remove nonactive mds", "mds", "rw")
-COMMAND_WITH_FLAG("mds rmfailed name=role,type=CephString "
+COMMAND_WITH_FLAG("mds addfailed name=role,type=CephString "
"name=yes_i_really_mean_it,type=CephBool,req=false",
"remove failed rank", "mds", "rw", FLAG(HIDDEN))
COMMAND_WITH_FLAG("mds cluster_down", "take MDS cluster down", "mds", "rw",
FLAG(OBSOLETE))
发件人: Patrick Donnelly<mailto:[email protected]>
发送时间: 2021年9月18日 5:06
收件人: 胡 玮文<mailto:[email protected]>
抄送: Eric Dold<mailto:[email protected]>; ceph-users<mailto:[email protected]>
主题: Re: Cephfs - MDS all up:standby, not becoming up:active
On Fri, Sep 17, 2021 at 3:17 PM 胡 玮文 <[email protected]> wrote:
>
> > Did you run the command I suggested before or after you executed `rmfailed`
> > below?
>
>
>
> I run “rmfailed” before reading your mail. Then I got MON crashed. I fixed
> the crash by setting max_mds=2. Then I tried the command you suggested.
>
>
>
> By reading the code[1], I think I really need to undo the “rmfailed” to get
> my MDS out of standby state.
Exactly. If you install the repositories from (available in about ~1 hour):
https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fshaman.ceph.com%2Frepos%2Fceph%2Fceph-mds-addfailed-pacific%2F9a1ccf41c32446e1b31328e7d01ea8e4aaea8cbb%2F&data=04%7C01%7C%7C997ad71e82e84d125e4108d97a1ef2f0%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637675095612004570%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=wPaJ8yc5vFyMh%2BjqBtFjXCgCpQPqqbENrQ5K8n6EhO8%3D&reserved=0
for the monitors (only), and then run:
for i in 0 1; do ceph mds addfailed <fs_name>:$i --yes-i-really-mean-it ; done
it should fix it for you.
--
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]