Hi again,
I just wanted to conclude this thread. We managed to bring the OSDs
back up and reactivate the CephFS so Jacek has access to his data again.
It would be too much to summarize all of it, but a couple of things
are worth mentioning, at least for the curios readers here. ;-)
- After Jacek had bootstrapped a fresh cluster and tried to reactive
the existing OSDs, he accidentally had caused more inconsistencies.
Due to mixup of cephadm and non-cephadm commands and procedures, there
was a lot of cleanup necessary.
- Among others, removing ceph-osd package fixed at least one issue.
But also disabling ceph-volume remainders (from manually using
ceph-volume outside of cephadm) was necessary.
- We managed to extract the mon store from the OSDs and brought back
the osdmap. But that wasn't enough.
- We had to fix the directory content of the OSDs, the first one
actually started successfully, so we proceeded with the second. And
then it happened again, all monitors crashed.
- Shortly before the crash I had noticed two strange OSD keyrings (no
idea how they got there). And as soon as we tried to start one of
those OSDs with a strange keyring, the monitors failed. So apparently,
the original issue (crashing monitors) was transported into the new
cluster.
- We stopped the OSDs, restarted monitors and removed the faulty keys.
The monitors were stable again now.
- We fixed the remaining OSD contents (keyrings, unit.run files etc.),
now all OSDs got up successfully.
- We had to deploy new MDS daemons (necessary after mon store loss)
and then recreate the CephFS based on the existing metadata and data
pools. On first glance, all files were present.
So in the end, the recovery was successful although all the cleanup
and cluster bootstrap hadn't been necessary in retrospective. So my
advice is: first investigate logs to find out the root cause before
making such destructive decisions (wipe the cluster and rebuild). But
I consider it a good practice and a proof for Ceph's resiliency when
it comes to user errors. ;-)
Regards,
Eugen
Zitat von Eugen Block <[email protected]>:
We've continued this topic off list, it's way quicker that way. Due
to the different attempts to start the OSDs whithout the proper
preparation (and by mixing cephadm with non-cephadm commands), there
were some remainders to clean up before making progress.
In the meantime we were able to gather the monmap info from the
first OSD, these steps will be necessary for the remaining OSDs
before we'll be able proceed with activating them.
I will conclude this thread once we've accomplished that.
Zitat von Jacek Rużyczka <[email protected]>:
I've already tried that. No use. Cephadm has a problem with the hostname:
mixtile@blade3n1:~$ sudo ceph cephadm osd activate blade3n1
Error EIO: Module 'cephadm' has experienced an error and cannot handle
commands:
invalid literal for int() with base 10: 'blade3n1'
I know that one or two people have had this error so far, but I have not
found a remedy.
Neither did cephadm deploy work:
mixtile@blade3n1:~$ sudo cephadm deploy --osd-fsid
9f7fd40d-0698-40b9-8718-62942
b03e263 --name osd.blade3n1 --fsid 8aad3073-39a1-11f1-bf6e-f2704a1efa9b
--keyrin
g /var/lib/ceph/8aad3073-39a1-11f1-bf6e-f2704a1efa9b/osd.blade3n1/keyring
Deprecated command used: <function command_deploy at 0xffffa333cea0>
Non-zero exit code 1 from /usr/bin/docker container inspect --format
{{.State.St
atus}} ceph-8aad3073-39a1-11f1-bf6e-f2704a1efa9b-osd-blade3n1
/usr/bin/docker: stdout
/usr/bin/docker: stderr Error response from daemon: No such container:
ceph-8aad
3073-39a1-11f1-bf6e-f2704a1efa9b-osd-blade3n1
Non-zero exit code 1 from /usr/bin/docker container inspect --format
{{.State.St
atus}} ceph-8aad3073-39a1-11f1-bf6e-f2704a1efa9b-osd.blade3n1
/usr/bin/docker: stdout
/usr/bin/docker: stderr Error response from daemon: No such container:
ceph-8aad
3073-39a1-11f1-bf6e-f2704a1efa9b-osd.blade3n1
Deploy daemon osd.blade3n1 ...
Shouldn't it create the necessary container itself?
BTW, I've found out that the utilities ceph-osd, ceph-base, and ceph-volume
were installed for some reason. I removed them, but that didn't help me
either.
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]