Stéphane,
Basically you cannot do anything in your cluster until you reach the
quorum. Except managing it with cephadm to restore a functionning
cluster. If 'ceph -s' doesn't return, it means you lost the quorum, it
is the only reason I'm aware for this. As your cluster is quite simple,
it should be easy to see the state of the monitor daemon on each host
where one should run using `cephadm ls` and/or `podman/docker ps`. And
you should be able to get access to the daemon logs of the monitor daemons.
In one of your message yesterday you reported a log saying the rocks.db
of one of the mon was corrupted. I personnally never saw that but the
first thing to do is to fix this as it will prevent the mon to start.
Follow the doc mentioned by Eugen to reduce your quorum to 1 mon
(deleting the 2 broken ones from the monmap) if necessary (if you don't
find a way to start at least 2 mon). And as said in another message,
ensure you added the label _admin to the hosts where you want to be able
to use the ceph command else the required information to connect to the
cluster will be missing. It is done with 'ceph orch host label add'
command which requires that you fixed the quorum issue. One possibility
if you have one mon healthy and you manage to reduce the quorum to 1 is
to delete the 2 other mon and readd them as new mon so that they are
reinitialized. This way you will not loose anything. Look at cephadm
documentation to learn how to remove and add daemons.
One thing not fully clear for me is how you installed your different
hosts. It seems they are not configured exactly the same way as on one
host the ceph command is not available where it is on the other ones.
Ceph doesn't need a lot of things from the OS when using cephadm but it
is pretty important to ensure that all your Ceph hosts are deployed the
same way/with the same config else you just add to the entropy...
I fully agree with you and Eugen that trying to fix things is a way to
learn a lot but at the same time it is not very easy to help you with
the very limited information we have on what you did to be in such a
strange situation... So if you don't manage to converge, may be it is
better to restart from scratch following carefully the instructions: you
will have plenty of other occasions to learn anyway!
Michel
Le 22/07/2025 à 11:04, Stéphane Barthes a écrit :
Hi Michel,
Does this mean I need to recover quorum, before some fixing happens?
Should I kick a new VM, and add a mon to the cluster, via cephaadm?
This would allow to have 2 running mons?
S. Barthes
Le 22/07/2025 à 10:39, Michel Jouvin a écrit :
Hi Stéphane,
'ceph -s' requires the mon quorum to be reached, else the Ceph
cluster hangs. cephadm is not using the Ceph cluster internal
communication but is building a management cluster on top of it so it
can manage the cluster even if the quorum is lost but it cannot
provide any information requires the quorum to be reached.
Michel
Le 22/07/2025 à 10:33, Stéphane Barthes a écrit :
Hi Malte,
Thank for your reply. Here are a some info :
ceph -s hangs and times out monhunting after 300s
But I can run cephadm shell. Is there a similar command under
cephadm shell?
ceph health detail : same as above.
I would like to repair it, instead of wipe & restart, as it is (from
my point of view) a good way to learn (and there are a few data I'd
like to recover).
What is the problem with ubuntu 24? I did not see warnings regarding
this specific version in
https://docs.ceph.com/en/latest/cephadm/install/#cephadm-install-distros
Regards,
S. Barthes
T: +33 4 72 52 35 40 M: +33 6 14 73 18 34
InTest S.A.
4 Allée du Levant
69890 La Tour de Salvagny
Le 22/07/2025 à 10:02, Malte Stroem a écrit :
Hello Stéphane,
I think, you're mixing and mismatching up a lot!
You always have to show us the output of:
ceph -s
And more! Logs and stuff, e. g.:
ceph health detail
It is clear you missed something here and there.
It is repairable but since it is a test cluster, just delete it and
start again.
And follow the documentation for cephadm. And do not use Ubuntu 24.04.
Best,
Malte
On 22.07.25 09:02, Stéphane Barthes wrote:
Hello,
Today, things have degraded a bit more. ceph-03 mon has failed and
will not restart. It shows the same kind of checksum error in
rocksdb compact operation during startup. As a consequence, I lost
quorum, and ceph commands hang.
Would it be wise to disable rocksbd compact, to restart and find
quorum back? If yes, what is the exactt syntax of the setting in
ceph.conf, I have seen one for OSD, but not sure if it would apply:
[osd]
osd_compact_on_start = true
If I can restart, I will try to out the OSDs, and recreate them.
Last time I saw the OSD seemed fine in the dashboard. Since I hove
no dashboard, is there a command I can use to check their status?
Regards,
S. Barthes
T: +33 4 72 52 35 40 M: +33 6 14 73 18 34
InTest S.A.
4 Allée du Levant
69890 La Tour de Salvagny
Le 21/07/2025 à 14:27, Stéphane Barthes a écrit :
Michel,
cephadm shell starts on all 3 nodes without error, and each host
as the same ceph public key entry in the .ssh/autorized_key file
of the root user.
ceph-01 also has ceph.pub in /etc/ceph with the same key (this is
the node I started the install from)
ceph-2 has no/etc/ceph folder
ceph-3 has a /etc/ceph folder, but no ceph.pub file there
S. Barthes
Le 21/07/2025 à 12:36, Michel Jouvin a écrit :
Hi Stéphane,
Sorry I was busy and did not look at your previous answers... It
is a bit difficult for me to understand how you ended up in this
situation but for me it is strange that ceph-02 complains about
a missing keyring and the corruped rocks.db on a freshly created
cluster is also a bit strange for me. I don't think it makes
sense to destroy and recreate the OSD, I am running several
clusters with hundreds of OSDs and I never saw a mis-initialized
one. The problem is hiding something else I'm afraid. Because of
some misconfiguration, may be one OSD is in a bad state and may
need to be reinitialized but first we should get the 3 mons
running properly and `cephadm shell` working properly on the 3
hosts. And the rocks.db compaction issue for me is related to
your mon, not to an OSD.
Have you checked that SSH configuration for cephadm is working
well from any host to any other one in your cluster (with 3
hosts, it should be really straighforward to check). The ceph-02
problem may be the sign of SSH misconfiguration as cephadm will
use SSH connection to push the keyring, if I am right.
Michel
Le 21/07/2025 à 12:17, Stéphane Barthes a écrit :
Hi,
Should I just wipe the OSD and let ceph rebuild it (as
suggested there :
https://ceph-users.ceph.narkive.com/LO6ebu9r/how-to-recover-
from-corrupted-rocksdb) ?
Which would the suggested way be :
cephadm rm-daemon osd.ceph-01
then
cephadm deploy ?
Regards,
S. Barthes
T: +33 4 72 52 35 40 M: +33 6 14 73 18 34
InTest S.A.
4 Allée du Levant
69890 La Tour de Salvagny
Le 21/07/2025 à 10:33, Stéphane Barthes a écrit :
Michel,
ceph-02 logs :
root@srvr-ceph-02:/# ceph log last debug cephadm
2025-07-21T08:16:54.814+0000 7efe1a884640 -1 auth: unable to
find a keyring on
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/
ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No
such file or directory
2025-07-21T08:16:54.814+0000 7efe1a884640 -1
AuthRegistry(0x7efe14064de0) no keyring found at /etc/ceph/
ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/
keyring,/etc/ceph/keyring.bin, disabling cephx
2025-07-21T08:16:54.818+0000 7efe1a884640 -1 auth: unable to
find a keyring on
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/
ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No
such file or directory
2025-07-21T08:16:54.818+0000 7efe1a884640 -1
AuthRegistry(0x7efe1a883000) no keyring found at /etc/ceph/
ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/
keyring,/etc/ceph/keyring.bin, disabling cephx
2025-07-21T08:16:54.818+0000 7efe137fe640 -1
monclient(hunting): handle_auth_bad_method server
allowed_methods [2] but i only support [1]
2025-07-21T08:16:54.818+0000 7efe18e21640 -1
monclient(hunting): handle_auth_bad_method server
allowed_methods [2] but i only support [1]
2025-07-21T08:16:57.818+0000 7efe137fe640 -1
monclient(hunting): handle_auth_bad_method server
allowed_methods [2] but i only support [1]
2025-07-21T08:16:57.818+0000 7efe13fff640 -1
monclient(hunting): handle_auth_bad_method server
allowed_methods [2] but i only support [1]
2025-07-21T08:17:00.818+0000 7efe137fe640 -1
monclient(hunting): handle_auth_bad_method server
allowed_methods [2] but i only support [1]
2025-07-21T08:17:00.818+0000 7efe18e21640 -1
monclient(hunting): handle_auth_bad_method server
allowed_methods [2] but i only support [1]
^CCluster connection aborted
root@srvr-ceph-02:/#
Regarding the ceph-01 log, there is a LOT. looking from the
end, I see this :
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -19>
2025-07-20T17:52:21.137+0000 7f359f42d8c0 2 auth:
KeyRing::load: loaded key file
/var/lib/ceph/mon/ceph-srvr-ceph-01/keyring
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -18>
2025-07-20T17:52:21.137+0000 7f359f42d8c0 2 mon.srvr-
ceph-01@-1(???) e5 init
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -17>
2025-07-20T17:52:21.137+0000 7f359f42d8c0 4 mgrc
handle_mgr_map Got map version 73
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -16>
2025-07-20T17:52:21.137+0000 7f359f42d8c0 4 mgrc
handle_mgr_map Active mgr is now
[v2:10.32.100.24:6800/4015208663,v1:10.32.100.24:6801/4015208663]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -15>
2025-07-20T17:52:21.137+0000 7f359f42d8c0 4 mgrc reconnect
Starting new session with
[v2:10.32.100.24:6800/4015208663,v1:10.32.100.24:6801/4015208663]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -14>
2025-07-20T17:52:21.137+0000 7f359c206640 -1 mon.srvr-
ceph-01@-1(???) e5 handle_auth_bad_method hmm, they didn't
like 2 result (13) Permission denied
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -13>
2025-07-20T17:52:21.137+0000 7f359f42d8c0 0 mon.srvr-
ceph-01@-1(probing) e5 my rank is now 0 (was -1)
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -12>
2025-07-20T17:52:21.161+0000 7f359d208640 3 rocksdb:
[db/db_impl/ db_impl_compaction_flush.cc:3026] Compaction
error: Corruption: block checksum mismatch: stored =
3368055299, computed = 2100551158 in
/var/lib/ceph/mon/ceph-srvr-ceph-01/ store.db/061999.sst
offset 10379525 size 91317
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -11>
2025-07-20T17:52:21.161+0000 7f359d208640 4 rocksdb: (Original
Log Time 2025/07/20-17:52:21.164193) [db/compaction/
compaction_job.cc:812] [default] compacted to: base level 6
level multiplier 10.00 max bytes base 268435456 files[4 0 0 0
0 0 1] max score 0.00, MB/sec: 514.9 rd, 272.6 wr, level 6,
files in(4, 1) out(0) MB in(4.0, 14.8) out(9.9),
read-write-amplify(7.2) write- amplify(2.5) Corruption: block
checksum mismatch: stored = 3368055299, computed = 2100551158
in /var/lib/ceph/mon/ceph-srvr- ceph-01/store.db/061999.sst
offset 10379525 size 91317, records in: 25191, records dropped: 3
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -10>
2025-07-20T17:52:21.161+0000 7f359d208640 4 rocksdb: (Original
Log Time 2025/07/20-17:52:21.164212) EVENT_LOG_v1
{"time_micros": 1753033941164205, "job": 3, "event":
"compaction_finished", "compaction_time_micros": 38166,
"compaction_time_cpu_micros": 25133, "output_level": 6,
"num_output_files": 0, "total_output_size": 10404253,
"num_input_records": 25191, "num_output_records": 21216,
"num_subcompactions": 1, "output_compression":
"NoCompression", "num_single_delete_mismatches": 0,
"num_single_delete_fallthrough": 0, "lsm_state": [4, 0, 0, 0,
0, 0, 1]}
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -9>
2025-07-20T17:52:21.161+0000 7f359d208640 2 rocksdb:
[db/db_impl/ db_impl_compaction_flush.cc:2545] Waiting after
background compaction error: Corruption: block checksum
mismatch: stored = 3368055299, computed = 2100551158 in
/var/lib/ceph/mon/ceph-srvr- ceph-01/store.db/061999.sst
offset 10379525 size 91317, Accumulated background error
counts: 1
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -8>
2025-07-20T17:52:21.341+0000 7f359c206640 -1 mon.srvr-
ceph-01@0(probing) e5 handle_auth_bad_method hmm, they didn't
like 2 result (13) Permission denied
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -7>
2025-07-20T17:52:21.741+0000 7f359c206640 -1 mon.srvr-
ceph-01@0(probing) e5 handle_auth_bad_method hmm, they didn't
like 2 result (13) Permission denied
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -6>
2025-07-20T17:52:21.741+0000 7f359c206640 1 mon.srvr-
ceph-01@0(probing) e5 handle_auth_request failed to assign
global_id
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -5>
2025-07-20T17:52:21.749+0000 7f35981fe640 5 mon.srvr-
ceph-01@0(probing) e5 _ms_dispatch setting monitor caps on
this connection
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -4>
2025-07-20T17:52:21.749+0000 7f35981fe640 1 mon.srvr-
ceph-01@0(synchronizing) e5 sync_obtain_latest_monmap
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -3>
2025-07-20T17:52:21.749+0000 7f35981fe640 1 mon.srvr-
ceph-01@0(synchronizing) e5 sync_obtain_latest_monmap obtained
monmap e5
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: [285B blob data]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: PutCF( prefix =
mon_sync key = 'latest_monmap' value size = 508)
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: PutCF( prefix =
mon_sync key = 'in_sync' value size = 8)
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: PutCF( prefix =
mon_sync key = 'last_committed_floor' value size = 8)
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -1>
2025-07-20T17:52:21.749+0000 7f35981fe640 -1
/home/jenkins-build/
build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/
AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/
release/17.2.8/rpm/el9/BUILD/ceph-17.2.8/src/mon/MonitorDBStore.h:
In function 'int
MonitorDBStore::apply_transaction(MonitorDBStore::TransactionRef)'
thread 7f35981fe640 time 2025-07-20T17:52:21.750611+0000
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
/home/jenkins-build/build/
workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/
AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/
release/17.2.8/rpm/el9/BUILD/ceph-17.2.8/src/mon/MonitorDBStore.h:
355: ceph_abort_msg("failed to write to db")
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: ceph version 17.2.8
(f817ceb7f187defb1d021d6328fa833eb8e943b3) quincy (stable)
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 1:
(ceph::__ceph_abort(char const*, int, char const*,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&)+0xd3) [0x7f35a03a5469]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 2: /usr/bin/ceph-
mon(+0x1e968e) [0x55e079c5768e]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 3:
(Monitor::sync_start(entity_addrvec_t&, bool)+0x3b5)
[0x55e079c8a145]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 4:
(Monitor::handle_probe_reply(boost::intrusive_ptr<MonOpRequest>)+0x83f)
[0x55e079c90baf]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 5:
(Monitor::handle_probe(boost::intrusive_ptr<MonOpRequest>)+0x36c)
[0x55e079c925dc]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 6:
(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x112d)
[0x55e079cad20d]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 7:
(Monitor::_ms_dispatch(Message*)+0x7c9) [0x55e079cadd89]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 8: /usr/bin/ceph-
mon(+0x1f6d3e) [0x55e079c64d3e]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 9:
(DispatchQueue::entry()+0x53a) [0x7f35a058d34a]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 10: /usr/lib64/ceph/
libceph-common.so.2(+0x3bdea1) [0x7f35a061eea1]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 11: /lib64/
libc.so.6(+0x89e92) [0x7f359fb6ce92]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 12: /lib64/
libc.so.6(+0x10ef20) [0x7f359fbf1f20]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug 0>
2025-07-20T17:52:21.749+0000 7f35981fe640 -1 *** Caught signal
(Aborted) **
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: in thread
7f35981fe640 thread_name:ms_dispatch
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: ceph version 17.2.8
(f817ceb7f187defb1d021d6328fa833eb8e943b3) quincy (stable)
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 1: /lib64/
libc.so.6(+0x3e730) [0x7f359fb21730]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 2: /lib64/
libc.so.6(+0x8bbdc) [0x7f359fb6ebdc]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 3: raise()
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 4: abort()
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 5:
(ceph::__ceph_abort(char const*, int, char const*,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&)+0x190) [0x7f35a03a5526]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 6: /usr/bin/ceph-
mon(+0x1e968e) [0x55e079c5768e]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 7:
(Monitor::sync_start(entity_addrvec_t&, bool)+0x3b5)
[0x55e079c8a145]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 8:
(Monitor::handle_probe_reply(boost::intrusive_ptr<MonOpRequest>)+0x83f)
[0x55e079c90baf]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 9:
(Monitor::handle_probe(boost::intrusive_ptr<MonOpRequest>)+0x36c)
[0x55e079c925dc]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 10:
(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x112d)
[0x55e079cad20d]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 11:
(Monitor::_ms_dispatch(Message*)+0x7c9) [0x55e079cadd89]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 12: /usr/bin/ceph-
mon(+0x1f6d3e) [0x55e079c64d3e]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 13:
(DispatchQueue::entry()+0x53a) [0x7f35a058d34a]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 14: /usr/lib64/ceph/
libceph-common.so.2(+0x3bdea1) [0x7f35a061eea1]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 15: /lib64/
libc.so.6(+0x89e92) [0x7f359fb6ce92]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 16: /lib64/
libc.so.6(+0x10ef20) [0x7f359fbf1f20]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: NOTE: a copy of the
executable, or `objdump -rdS <executable>` is needed to
interpret this.
I do not know if the logs are purged from sensitive data that
would prevent emailing them. looking for "checksum mismatch",
in the logs, there are many of them (138).
How can I fix this checksum issue?
Regards,
S. Barthes
Le 21/07/2025 à 09:59, Michel Jouvin a écrit :
Stéphane,
On ceph-02, I am not sure why the ceph command is not
installed as on the other nodes, if you installed it the same
way. One way to get access to the ceph command on this server
should be to execute:
cephadm shell
This will start a container where you have the ceph
environment installed and configured for your cluster.
The situation is not as bad as I thought reading your first
message. You have the mon quorum so at least ceph command
should be usable. The first thing to do is probably to log on
your ceph-01 node and try to understand why the mon daemon is
crashing. You may want to run on this node:
cephadm ls ---> Look for the exact daemon name corresponding
to the mon
cephadm logs --daemon $daemon_name
Apart from this, it is strange that ceph-03 report a RADOS
error with 'ceph log last...', this probably hides another
issue. Could you tell what the same command says on ceph-02
(when run in cephadm shell)?
Michel
Le 21/07/2025 à 09:44, Stéphane Barthes a écrit :
Michel,
I ran "ceph log last debug cephadm" on my 3 nodes, and
"mileage varies"
ceph-01 :
some errors, and it ends with
2025-07-20T03:24:18.887889+0000 mgr.srvr-ceph-03.dhzbpe
(mgr.134360) 1368 : cephadm [INF] Deploying daemon mon.srvr-
ceph-03 on srvr-ceph-03
when I had to remove the mon daemon and redeploy on ceph-03.
ceph-02 :
root@srvr-ceph-02:~# ceph log last debug cephadm
Command 'ceph' not found, but can be installed with:
snap install microceph # version 18.2.4+snapc9f2b08f92, or
apt install ceph-common # version 17.2.7-0ubuntu0.22.04.2
See 'snap info microceph' for additional versions.
??? should I install ceph-common ???
ceph-03 :
root@srvr-ceph-03:~# ceph log last debug cephadm
Error initializing cluster client: ObjectNotFound('RADOS
object not found (error calling conf_read_file)')
root@srvr-ceph-03:~#
FWIW : ceph health is :
root@srvr-ceph-01:~# ceph health detail
HEALTH_WARN 1 failed cephadm daemon(s); 1/3 mons down,
quorum srvr-ceph-03,srvr-ceph-02; 10 daemons have recently
crashed
[WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
daemon mon.srvr-ceph-01 on srvr-ceph-01 is in error state
[WRN] MON_DOWN: 1/3 mons down, quorum srvr-ceph-03,srvr-ceph-02
mon.srvr-ceph-01 (rank 0) addr
[v2:10.32.100.22:3300/0,v1:10.32.100.22:6789/0] is down (out
of quorum)
[WRN] RECENT_CRASH: 10 daemons have recently crashed
mon.srvr-ceph-01 crashed on host srvr-ceph-01 at
2025-07-20T17:50:10.202091Z
mon.srvr-ceph-01 crashed on host srvr-ceph-01 at
2025-07-20T17:49:47.712267Z
mon.srvr-ceph-01 crashed on host srvr-ceph-01 at
2025-07-20T17:50:21.464475Z
mon.srvr-ceph-01 crashed on host srvr-ceph-01 at
2025-07-20T17:49:36.609442Z
mon.srvr-ceph-01 crashed on host srvr-ceph-01 at
2025-07-20T17:49:58.966663Z
mon.srvr-ceph-01 crashed on host srvr-ceph-01 at
2025-07-20T17:51:36.947240Z
mon.srvr-ceph-01 crashed on host srvr-ceph-01 at
2025-07-20T17:52:21.751711Z
mon.srvr-ceph-01 crashed on host srvr-ceph-01 at
2025-07-20T17:51:48.490875Z
mon.srvr-ceph-01 crashed on host srvr-ceph-01 at
2025-07-20T17:51:59.651129Z
mon.srvr-ceph-01 crashed on host srvr-ceph-01 at
2025-07-20T17:52:10.552756Z
S. Barthes
Le 21/07/2025 à 09:31, Michel Jouvin a écrit :
Stephane,
If you are using cephadm, the OS (distrib and version) you
use should not matter. When using cephadm with several
servers (the general case!), it is important to setup
properly the SSH key used by cephadm for the communication
between nodes (cephadm is sort of a SSH-based management
cluster) and to check that you can log in from one node to
the other using SSH. Can you confirm that it is the case?
Also cephadm has a specific log file. I don't use much the
dashboard, not sure how you display it (it may be part of
the logs displayed by the dashboard) but you can access it
with the command:
ceph log last debug cephadm
Michel
Le 21/07/2025 à 09:19, Stéphane Barthes a écrit :
Hi,
Yes, I did use cephadm, to bootstrap the 1st node in the
cluster, installed cephadm on the other nodes, and used
the dashboard to add the nodes to the cluster.
Regards,
S. Barthes
Le 21/07/2025 à 09:12, Michel Jouvin a écrit :
Hi Stephane,
How did you configure your cluster? Have you been using
cephadm? If not, I really advise you to recreate your
cluster with cephadm, that includes a script to bootstrap
the cluster. In particular if you don't have a detail
knowledge about Ceph architecture and management, it will
ensure that your cluster is properly configured and let
you progressively learn about Ceph details...
Best regards.
Michel
Le 21/07/2025 à 09:02, Stéphane Barthes a écrit :
Hello,
I am very new to ceph and have started a small cluster
to get started with ceph.
But so far my experience, is not very impressive,
probably by lack of knowledge and good practices.
I started with Ubuntu 24, installed 3 VM for a ceph
cluster, and some how could not get it running. Adding
nodes would fail adding OSDs with some weird error(I
found it on the web but could not solve the problem).
I then made a new cluster with 3 ubuntu 22 VM. Install
ok, start ok, I created 1 pool to test storing stuff
there and work my way across crash testing. However the
cluster dies during the weekly vm snapshot. It may not a
good idea to run vm backups on a ceph host, but I find
this a little surprising. (crash testing started earlier
than expected)
Bottom line is that, after the backup the cluster is in
warning state with missing mons, or logrotate and
sometimes crashed machines. systemctl restart service or
Rebooting node usually fixes it.
I am now stuck in a situation I cannot fix :
- 1 Machine is ceph rbd client cannot auth : auth
method 'x' error -13. I have tried quite a few things,
and none unlocked the situation. I am currently trying
to reboot the machine, but the busy/stuck rbd device
seems to block it. I am not looking forward to hard
reset it.
- Node with the mgr service will not restart mon, or
logrotate. I did reboot it again today, but I guess this
is not how a node is expected to behave.
So my questions :
- How can I unlock my stuck ceph client, when this
kind of error occurs?
- Is this expected behavior that client looses
access to cluster, which kind of kills the machine?
- Where should I look in the ceph nodes logs to
figure what is going wrong, and how to fix it, so that
is run in a stable manner?
Regards,
--
S. Barthes
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io