So our datacenter lost power and 2/3 of our monitors died with FS
corruption. I tried fixing it but it looks like the store.db didn't make
it.
I copied the working journal via
1.
sudo mv /var/lib/ceph/mon/ceph-$(hostname){,.BAK}
2.
sudo ceph-mon -i {mon-id} --mkfs --monmap {tmp}/{map-filename}
--keyring {tmp}/{key-filename}
1.
ceph-mon -i `hostname` --extract-monmap /tmp/monmap
2.
ceph-mon -i {mon-id} --inject-monmap {map-path}
and for a brief moment i had a quorum but any ceph cli commands would
result in cephx errors. Now the two failed monitors have elected a quorum
and the monitor that was working keeps getting kicked out of the cluster::
'''
{
"election_epoch": 402,
"quorum": [
0,
1
],
"quorum_names": [
"kh11-8",
"kh12-8"
],
"quorum_leader_name": "kh11-8",
"monmap": {
"epoch": 1,
"fsid": "a6ae50db-5c71-4ef8-885e-8137c7793da8",
"modified": "0.000000",
"created": "0.000000",
"mons": [
{
"rank": 0,
"name": "kh11-8",
"addr": "10.64.64.134:6789\/0"
},
{
"rank": 1,
"name": "kh12-8",
"addr": "10.64.64.143:6789\/0"
},
{
"rank": 2,
"name": "kh13-8",
"addr": "10.64.64.151:6789\/0"
}
]
}
}
'''
At this point I am not sure what to do as any ceph commands return cephx
errors and I can't seem to verify if the new "quorum" is actually valid
any way to regenerate a cephx authentication key or recover it with
hardware access to the nodes or any advice on how to recover from what
seems to be complete monitor failure?
--
- Sean: I wrote this. -
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com