[ceph-users] Re: Newby woes with ceph

Michel Jouvin Tue, 22 Jul 2025 01:41:18 -0700

Hi Stéphane,

'ceph -s' requires the mon quorum to be reached, else the Ceph clusterhangs. cephadm is not using the Ceph cluster internal communication butis building a management cluster on top of it so it can manage thecluster even if the quorum is lost but it cannot provide any informationrequires the quorum to be reached.


Michel

Le 22/07/2025 à 10:33, Stéphane Barthes a écrit :

Hi Malte,


Thank for  your reply.  Here are a some info :


ceph -s hangs and times out monhunting after 300s
But I can run cephadm shell. Is there a similar command under cephadmshell?
ceph health detail : same as above.
I would like to repair it, instead of wipe & restart, as it is (frommy point of view) a good way to learn (and there are a few data I'dlike to recover).
What is the problem with ubuntu 24? I did not see warnings regardingthis specific version inhttps://docs.ceph.com/en/latest/cephadm/install/#cephadm-install-distros
Regards,

S. Barthes
T: +33 4 72 52 35 40 M: +33 6 14 73 18 34
InTest S.A.
4 Allée du Levant
69890 La Tour de Salvagny
Le 22/07/2025 à 10:02, Malte Stroem a écrit :
Hello Stéphane,

I think, you're mixing and mismatching up a lot!

You always have to show us the output of:

ceph -s

And more! Logs and stuff, e. g.:

ceph health detail

It is clear you missed something here and there.
It is repairable but since it is a test cluster, just delete it andstart again.
And follow the documentation for cephadm. And do not use Ubuntu 24.04.

Best,
Malte

On 22.07.25 09:02, Stéphane Barthes wrote:
Hello,
Today, things have degraded a bit more. ceph-03 mon has failed andwill not restart. It shows the same kind of checksum error inrocksdb compact operation during startup. As a consequence, I lostquorum, and ceph commands hang.
Would it be wise to disable rocksbd compact, to restart and findquorum back? If yes, what is the exactt syntax of the setting inceph.conf, I have seen one for OSD, but not sure if it would apply:
[osd]

osd_compact_on_start = true
If I can restart, I will try to out the OSDs, and recreate them.Last time I saw the OSD seemed fine in the dashboard. Since I hoveno dashboard, is there a command I can use to check their status?
Regards,

S. Barthes
T: +33 4 72 52 35 40 M: +33 6 14 73 18 34
InTest S.A.
4 Allée du Levant
69890 La Tour de Salvagny

Le 21/07/2025 à 14:27, Stéphane Barthes a écrit :
Michel,
cephadm shell starts on all 3 nodes without error, and each host asthe same ceph public key entry in the .ssh/autorized_key file ofthe root user.
ceph-01 also has ceph.pub in /etc/ceph with the same key (this isthe node I started the install from)
ceph-2 has no/etc/ceph folder

ceph-3 has a /etc/ceph folder, but no ceph.pub file there


S. Barthes
Le 21/07/2025 à 12:36, Michel Jouvin a écrit :
Hi Stéphane,
Sorry I was busy and did not look at your previous answers... Itis a bit difficult for me to understand how you ended up in thissituation but for me it is strange that ceph-02 complains about amissing keyring and the corruped rocks.db on a freshly createdcluster is also a bit strange for me. I don't think it makes senseto destroy and recreate the OSD, I am running several clusterswith hundreds of OSDs and I never saw a mis-initialized one. Theproblem is hiding something else I'm afraid. Because of somemisconfiguration, may be one OSD is in a bad state and may need tobe reinitialized but first we should get the 3 mons runningproperly and `cephadm shell` working properly on the 3 hosts. Andthe rocks.db compaction issue for me is related to your mon, notto an OSD.
Have you checked that SSH configuration for cephadm is workingwell from any host to any other one in your cluster (with 3 hosts,it should be really straighforward to check). The ceph-02 problemmay be the sign of SSH misconfiguration as cephadm will use SSHconnection to push the keyring, if I am right.
Michel

Le 21/07/2025 à 12:17, Stéphane Barthes a écrit :
Hi,
Should I just wipe the OSD and let ceph rebuild it (as suggestedthere :https://ceph-users.ceph.narkive.com/LO6ebu9r/how-to-recover-from-corrupted-rocksdb) ?
Which would the suggested way be  :

cephadm rm-daemon osd.ceph-01

then

cephadm deploy ?


Regards,


S. Barthes
T: +33 4 72 52 35 40 M: +33 6 14 73 18 34
InTest S.A.
4 Allée du Levant
69890 La Tour de Salvagny
Le 21/07/2025 à 10:33, Stéphane Barthes a écrit :
Michel,

ceph-02 logs :

root@srvr-ceph-02:/# ceph log last debug cephadm
2025-07-21T08:16:54.814+0000 7efe1a884640 -1 auth: unable tofind a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) Nosuch file or directory
2025-07-21T08:16:54.814+0000 7efe1a884640 -1AuthRegistry(0x7efe14064de0) no keyring found at /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling cephx
2025-07-21T08:16:54.818+0000 7efe1a884640 -1 auth: unable tofind a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) Nosuch file or directory
2025-07-21T08:16:54.818+0000 7efe1a884640 -1AuthRegistry(0x7efe1a883000) no keyring found at /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling cephx
2025-07-21T08:16:54.818+0000 7efe137fe640 -1 monclient(hunting):handle_auth_bad_method server allowed_methods [2] but i onlysupport [1]
2025-07-21T08:16:54.818+0000 7efe18e21640 -1 monclient(hunting):handle_auth_bad_method server allowed_methods [2] but i onlysupport [1]
2025-07-21T08:16:57.818+0000 7efe137fe640 -1 monclient(hunting):handle_auth_bad_method server allowed_methods [2] but i onlysupport [1]
2025-07-21T08:16:57.818+0000 7efe13fff640 -1 monclient(hunting):handle_auth_bad_method server allowed_methods [2] but i onlysupport [1]
2025-07-21T08:17:00.818+0000 7efe137fe640 -1 monclient(hunting):handle_auth_bad_method server allowed_methods [2] but i onlysupport [1]
2025-07-21T08:17:00.818+0000 7efe18e21640 -1 monclient(hunting):handle_auth_bad_method server allowed_methods [2] but i onlysupport [1]
^CCluster connection aborted
root@srvr-ceph-02:/#
Regarding the ceph-01 log, there is a LOT. looking from the end,I see this :
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -19>2025-07-20T17:52:21.137+0000 7f359f42d8c0 2 auth:KeyRing::load: loaded key file/var/lib/ceph/mon/ceph-srvr-ceph-01/keyringJul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -18>2025-07-20T17:52:21.137+0000 7f359f42d8c0 2 mon.srvr-ceph-01@-1(???) e5 initJul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -17>2025-07-20T17:52:21.137+0000 7f359f42d8c0 4 mgrc handle_mgr_mapGot map version 73Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -16>2025-07-20T17:52:21.137+0000 7f359f42d8c0 4 mgrc handle_mgr_mapActive mgr is now[v2:10.32.100.24:6800/4015208663,v1:10.32.100.24:6801/4015208663]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -15>2025-07-20T17:52:21.137+0000 7f359f42d8c0 4 mgrc reconnectStarting new session with[v2:10.32.100.24:6800/4015208663,v1:10.32.100.24:6801/4015208663]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -14>2025-07-20T17:52:21.137+0000 7f359c206640 -1 mon.srvr-ceph-01@-1(???) e5 handle_auth_bad_method hmm, they didn't like2 result (13) Permission deniedJul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -13>2025-07-20T17:52:21.137+0000 7f359f42d8c0 0 mon.srvr-ceph-01@-1(probing) e5 my rank is now 0 (was -1)Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -12>2025-07-20T17:52:21.161+0000 7f359d208640 3 rocksdb:[db/db_impl/ db_impl_compaction_flush.cc:3026] Compaction error:Corruption: block checksum mismatch: stored = 3368055299,computed = 2100551158 in /var/lib/ceph/mon/ceph-srvr-ceph-01/store.db/061999.sst offset 10379525 size 91317Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -11>2025-07-20T17:52:21.161+0000 7f359d208640 4 rocksdb: (OriginalLog Time 2025/07/20-17:52:21.164193) [db/compaction/compaction_job.cc:812] [default] compacted to: base level 6level multiplier 10.00 max bytes base 268435456 files[4 0 0 0 00 1] max score 0.00, MB/sec: 514.9 rd, 272.6 wr, level 6, filesin(4, 1) out(0) MB in(4.0, 14.8) out(9.9),read-write-amplify(7.2) write- amplify(2.5) Corruption: blockchecksum mismatch: stored = 3368055299, computed = 2100551158 in/var/lib/ceph/mon/ceph-srvr- ceph-01/store.db/061999.sst offset10379525 size 91317, records in: 25191, records dropped: 3Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -10>2025-07-20T17:52:21.161+0000 7f359d208640 4 rocksdb: (OriginalLog Time 2025/07/20-17:52:21.164212) EVENT_LOG_v1{"time_micros": 1753033941164205, "job": 3, "event":"compaction_finished", "compaction_time_micros": 38166,"compaction_time_cpu_micros": 25133, "output_level": 6,"num_output_files": 0, "total_output_size": 10404253,"num_input_records": 25191, "num_output_records": 21216,"num_subcompactions": 1, "output_compression": "NoCompression","num_single_delete_mismatches": 0,"num_single_delete_fallthrough": 0, "lsm_state": [4, 0, 0, 0, 0,0, 1]}Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -9>2025-07-20T17:52:21.161+0000 7f359d208640 2 rocksdb:[db/db_impl/ db_impl_compaction_flush.cc:2545] Waiting afterbackground compaction error: Corruption: block checksummismatch: stored = 3368055299, computed = 2100551158 in/var/lib/ceph/mon/ceph-srvr- ceph-01/store.db/061999.sst offset10379525 size 91317, Accumulated background error counts: 1Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -8>2025-07-20T17:52:21.341+0000 7f359c206640 -1 mon.srvr-ceph-01@0(probing) e5 handle_auth_bad_method hmm, they didn'tlike 2 result (13) Permission deniedJul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -7>2025-07-20T17:52:21.741+0000 7f359c206640 -1 mon.srvr-ceph-01@0(probing) e5 handle_auth_bad_method hmm, they didn'tlike 2 result (13) Permission deniedJul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -6>2025-07-20T17:52:21.741+0000 7f359c206640 1 mon.srvr-ceph-01@0(probing) e5 handle_auth_request failed to assignglobal_idJul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -5>2025-07-20T17:52:21.749+0000 7f35981fe640 5 mon.srvr-ceph-01@0(probing) e5 _ms_dispatch setting monitor caps on thisconnectionJul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -4>2025-07-20T17:52:21.749+0000 7f35981fe640 1 mon.srvr-ceph-01@0(synchronizing) e5 sync_obtain_latest_monmapJul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -3>2025-07-20T17:52:21.749+0000 7f35981fe640 1 mon.srvr-ceph-01@0(synchronizing) e5 sync_obtain_latest_monmap obtainedmonmap e5
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: [285B blob data]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: PutCF( prefix =mon_sync key = 'latest_monmap' value size = 508)Jul 20 17:52:21 srvr-ceph-01 bash[3424]: PutCF( prefix =mon_sync key = 'in_sync' value size = 8)Jul 20 17:52:21 srvr-ceph-01 bash[3424]: PutCF( prefix =mon_sync key = 'last_committed_floor' value size = 8)Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -1>2025-07-20T17:52:21.749+0000 7f35981fe640 -1/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/17.2.8/rpm/el9/BUILD/ceph-17.2.8/src/mon/MonitorDBStore.h:In function 'intMonitorDBStore::apply_transaction(MonitorDBStore::TransactionRef)'thread 7f35981fe640 time 2025-07-20T17:52:21.750611+0000Jul 20 17:52:21 srvr-ceph-01 bash[3424]:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/17.2.8/rpm/el9/BUILD/ceph-17.2.8/src/mon/MonitorDBStore.h:355: ceph_abort_msg("failed to write to db")Jul 20 17:52:21 srvr-ceph-01 bash[3424]: ceph version 17.2.8(f817ceb7f187defb1d021d6328fa833eb8e943b3) quincy (stable)Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 1:(ceph::__ceph_abort(char const*, int, char const*,std::__cxx11::basic_string<char, std::char_traits<char>,std::allocator<char> > const&)+0xd3) [0x7f35a03a5469]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 2: /usr/bin/ceph-mon(+0x1e968e) [0x55e079c5768e]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 3:(Monitor::sync_start(entity_addrvec_t&, bool)+0x3b5)[0x55e079c8a145]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 4:(Monitor::handle_probe_reply(boost::intrusive_ptr<MonOpRequest>)+0x83f)[0x55e079c90baf]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 5:(Monitor::handle_probe(boost::intrusive_ptr<MonOpRequest>)+0x36c)[0x55e079c925dc]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 6:(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x112d)[0x55e079cad20d]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 7:(Monitor::_ms_dispatch(Message*)+0x7c9) [0x55e079cadd89]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 8: /usr/bin/ceph-mon(+0x1f6d3e) [0x55e079c64d3e]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 9:(DispatchQueue::entry()+0x53a) [0x7f35a058d34a]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 10: /usr/lib64/ceph/libceph-common.so.2(+0x3bdea1) [0x7f35a061eea1]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 11: /lib64/libc.so.6(+0x89e92) [0x7f359fb6ce92]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 12: /lib64/libc.so.6(+0x10ef20) [0x7f359fbf1f20]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug 0>2025-07-20T17:52:21.749+0000 7f35981fe640 -1 *** Caught signal(Aborted) **Jul 20 17:52:21 srvr-ceph-01 bash[3424]: in thread 7f35981fe640thread_name:ms_dispatchJul 20 17:52:21 srvr-ceph-01 bash[3424]: ceph version 17.2.8(f817ceb7f187defb1d021d6328fa833eb8e943b3) quincy (stable)Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 1: /lib64/libc.so.6(+0x3e730) [0x7f359fb21730]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 2: /lib64/libc.so.6(+0x8bbdc) [0x7f359fb6ebdc]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  3: raise()
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  4: abort()
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 5:(ceph::__ceph_abort(char const*, int, char const*,std::__cxx11::basic_string<char, std::char_traits<char>,std::allocator<char> > const&)+0x190) [0x7f35a03a5526]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 6: /usr/bin/ceph-mon(+0x1e968e) [0x55e079c5768e]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 7:(Monitor::sync_start(entity_addrvec_t&, bool)+0x3b5)[0x55e079c8a145]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 8:(Monitor::handle_probe_reply(boost::intrusive_ptr<MonOpRequest>)+0x83f)[0x55e079c90baf]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 9:(Monitor::handle_probe(boost::intrusive_ptr<MonOpRequest>)+0x36c)[0x55e079c925dc]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 10:(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x112d)[0x55e079cad20d]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 11:(Monitor::_ms_dispatch(Message*)+0x7c9) [0x55e079cadd89]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 12: /usr/bin/ceph-mon(+0x1f6d3e) [0x55e079c64d3e]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 13:(DispatchQueue::entry()+0x53a) [0x7f35a058d34a]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 14: /usr/lib64/ceph/libceph-common.so.2(+0x3bdea1) [0x7f35a061eea1]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 15: /lib64/libc.so.6(+0x89e92) [0x7f359fb6ce92]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 16: /lib64/libc.so.6(+0x10ef20) [0x7f359fbf1f20]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: NOTE: a copy of theexecutable, or `objdump -rdS <executable>` is needed tointerpret this.
I do not know if the logs are purged from sensitive data thatwould prevent emailing them. looking for "checksum mismatch", inthe logs, there are many of them (138).
How can I fix this checksum issue?


Regards,


S. Barthes

Le 21/07/2025 à 09:59, Michel Jouvin a écrit :
Stéphane,
On ceph-02, I am not sure why the ceph command is not installedas on the other nodes, if you installed it the same way. Oneway to get access to the ceph command on this server should beto execute:
cephadm shell
This will start a container where you have the ceph environmentinstalled and configured for your cluster.
The situation is not as bad as I thought reading your firstmessage. You have the mon quorum so at least ceph commandshould be usable. The first thing to do is probably to log onyour ceph-01 node and try to understand why the mon daemon iscrashing. You may want to run on this node:
cephadm ls ---> Look for the exact daemon name correspondingto the mon
cephadm logs --daemon $daemon_name
Apart from this, it is strange that ceph-03 report a RADOSerror with 'ceph log last...', this probably hides anotherissue. Could you tell what the same command says on ceph-02(when run in cephadm shell)?
Michel

Le 21/07/2025 à 09:44, Stéphane Barthes a écrit :
Michel,
I ran "ceph log last debug cephadm" on my 3 nodes, and"mileage varies"
ceph-01 :

some errors, and it ends with
2025-07-20T03:24:18.887889+0000 mgr.srvr-ceph-03.dhzbpe(mgr.134360) 1368 : cephadm [INF] Deploying daemon mon.srvr-ceph-03 on srvr-ceph-03
when I had to remove the mon daemon and redeploy on ceph-03.

ceph-02 :

root@srvr-ceph-02:~# ceph log last debug cephadm
Command 'ceph' not found, but can be installed with:
snap install microceph    # version 18.2.4+snapc9f2b08f92, or
apt  install ceph-common  # version 17.2.7-0ubuntu0.22.04.2
See 'snap info microceph' for additional versions.

??? should I install ceph-common ???

ceph-03 :

root@srvr-ceph-03:~# ceph log last debug cephadm
Error initializing cluster client: ObjectNotFound('RADOSobject not found (error calling conf_read_file)')
root@srvr-ceph-03:~#

FWIW : ceph health is :

root@srvr-ceph-01:~# ceph health detail
HEALTH_WARN 1 failed cephadm daemon(s); 1/3 mons down, quorumsrvr-ceph-03,srvr-ceph-02; 10 daemons have recently crashed
[WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
    daemon mon.srvr-ceph-01 on srvr-ceph-01 is in error state
[WRN] MON_DOWN: 1/3 mons down, quorum srvr-ceph-03,srvr-ceph-02
mon.srvr-ceph-01 (rank 0) addr[v2:10.32.100.22:3300/0,v1:10.32.100.22:6789/0] is down (outof quorum)
[WRN] RECENT_CRASH: 10 daemons have recently crashed
mon.srvr-ceph-01 crashed on host srvr-ceph-01 at2025-07-20T17:50:10.202091Z mon.srvr-ceph-01 crashed on host srvr-ceph-01 at2025-07-20T17:49:47.712267Z mon.srvr-ceph-01 crashed on host srvr-ceph-01 at2025-07-20T17:50:21.464475Z mon.srvr-ceph-01 crashed on host srvr-ceph-01 at2025-07-20T17:49:36.609442Z mon.srvr-ceph-01 crashed on host srvr-ceph-01 at2025-07-20T17:49:58.966663Z mon.srvr-ceph-01 crashed on host srvr-ceph-01 at2025-07-20T17:51:36.947240Z mon.srvr-ceph-01 crashed on host srvr-ceph-01 at2025-07-20T17:52:21.751711Z mon.srvr-ceph-01 crashed on host srvr-ceph-01 at2025-07-20T17:51:48.490875Z mon.srvr-ceph-01 crashed on host srvr-ceph-01 at2025-07-20T17:51:59.651129Z mon.srvr-ceph-01 crashed on host srvr-ceph-01 at2025-07-20T17:52:10.552756Z
S. Barthes
Le 21/07/2025 à 09:31, Michel Jouvin a écrit :
Stephane,
If you are using cephadm, the OS (distrib and version) youuse should not matter. When using cephadm with severalservers (the general case!), it is important to setupproperly the SSH key used by cephadm for the communicationbetween nodes (cephadm is sort of a SSH-based managementcluster) and to check that you can log in from one node tothe other using SSH. Can you confirm that it is the case?
Also cephadm has a specific log file. I don't use much thedashboard, not sure how you display it (it may be part of thelogs displayed by the dashboard) but you can access it withthe command:
ceph log last debug cephadm

Michel

Le 21/07/2025 à 09:19, Stéphane Barthes a écrit :
Hi,
Yes, I did use cephadm, to bootstrap the 1st node in thecluster, installed cephadm on the other nodes, and used thedashboard to add the nodes to the cluster.
Regards,

S. Barthes

Le 21/07/2025 à 09:12, Michel Jouvin a écrit :
Hi Stephane,
How did you configure your cluster? Have you been usingcephadm? If not, I really advise you to recreate yourcluster with cephadm, that includes a script to bootstrapthe cluster. In particular if you don't have a detailknowledge about Ceph architecture and management, it willensure that your cluster is properly configured and let youprogressively learn about Ceph details...
Best regards.

Michel

Le 21/07/2025 à 09:02, Stéphane Barthes a écrit :
Hello,
I am very new to ceph and have started a small cluster toget started with ceph.
But so far my experience, is not very impressive, probablyby lack of knowledge and good practices.
I started with Ubuntu 24, installed 3 VM for a cephcluster, and some how could not get it running. Addingnodes would fail adding OSDs with some weird error(I foundit on the web but could not solve the problem).
I then made a new cluster with 3 ubuntu 22 VM. Install ok,start ok, I created 1 pool to test storing stuff there andwork my way across crash testing. However the cluster diesduring the weekly vm snapshot. It may not a good idea torun vm backups on a ceph host, but I find this a littlesurprising. (crash testing started earlier than expected)
Bottom line is that, after the backup the cluster is inwarning state with missing mons, or logrotate andsometimes crashed machines. systemctl restart service orRebooting node usually fixes it.
I am now stuck in a situation I cannot fix :
- 1 Machine is ceph rbd client cannot auth : authmethod 'x' error -13. I have tried quite a few things, andnone unlocked the situation. I am currently trying toreboot the machine, but the busy/stuck rbd device seems toblock it. I am not looking forward to hard reset it.
- Node with the mgr service will not restart mon, orlogrotate. I did reboot it again today, but I guess thisis not how a node is expected to behave.
So my questions :
- How can I unlock my stuck ceph client, when thiskind of error occurs?
- Is this expected behavior that client looses accessto cluster, which kind of kills the machine?
- Where should I look in the ceph nodes logs to figurewhat is going wrong, and how to fix it, so that is run ina stable manner?
Regards,

--
S. Barthes

_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Newby woes with ceph

Reply via email to