Hi, > even “ceph -s” was hanging
that command is contacting the MONs, are you sure the cluster is healthy? If ' ceph -s' hangs, I suspect either a network or a MON (quorum) issue. The default ceph uid/gid has always been 167 since I've been working with Ceph, so more than 10 years now, and it's the same within containers. And yes, there is a difference between cephadm setup and package based installs. Did you ensure that you don't have any other ceph packages installed except cephadm and maybe ceph-common? There can be issues when both deb packages and containers are running for the same daemon. So if only the MGR is affected, I recommend to ensure that no ceph-mgr package is installed. It looks like the telemetry module is responsible for the crash, can you turn that off? In latest releases you can force to turn off a module: ceph mgr module force disable telemetry Can you check if that doesn't crash the MGR? Also some debug (debug_mgr) logs might be helpful for the devs. I don't find an existing tracker issue for this, I'd suggest to create a new one. Regards, Eugen Am So., 8. Feb. 2026 um 14:46 Uhr schrieb Daniel Brown via ceph-users < [email protected]>: > > > Greetings — > > Have been seeing repeated crashes on my mgr module. Seems to run for about > 45 to 50 seconds and then boom. Cephadm setup here. Did (try to) enable a > couple modules lately. iostat, stats diskprediction_local, but have toggled > them back off - unfortunately it hasn’t fixed the issue. > > I was seeing “no mgr” with ceph -s — to get around that I’ve tweaked the > “StartLimitInterval” setting in the ceph-[CLUSTER-UID]@.service file down > to 1m, and have 4x mgr’s setup so that I can get a couple commands run > before mgr crashes and another starts. The 30m default there seems… high, > imo - I was having intervals with no mgr which makes it tough to do much > with the cluster - even “ceph -s” was hanging. Everything else in the > cluster seems “normal” - still serving data. > > > One other note — which I think is generally unrelated — I did upgrade one > of my cluster nodes from “Plucky Puffin" (25.04) ubuntu, to “Questing > Quokka” (25.10) ubuntu. After the upgrade, cephadm managed containers > didn’t want to start. I tracked that down to having the ceph user userid in > /etc/passwd set at 64045, but the container seeming to want userid 167. > Most things under /var/lib/ceph/[CLUSTER UID]/ … appear to be owned by > user/group 167:167 — I assume this is a default inside the container. > Workaround here was to manually change the UID/GID for ceph in /etc/passwd > and /etc/group. I’m going to imagine this is some collision between cephadm > managed deployments, and how Ubuntu / apt installs cephadm. > > > > The aforementioned mgr Crashes look like: > > > { > "assert_condition": "cursor != root", > "assert_file": > "/ceph/rpmbuild/BUILD/ceph-20.2.0/src/mgr/PyFormatter.h", > "assert_func": "virtual void PyFormatter::close_section()", > "assert_line": 84, > "assert_msg": "/ceph/rpmbuild/BUILD/ceph-20.2.0/src/mgr/PyFormatter.h: > In function 'virtual void PyFormatter::close_section()' thread ffff34e55700 > time > 2026-02-08T13:10:07.526894+0000\n/ceph/rpmbuild/BUILD/ceph-20.2.0/src/mgr/PyFormatter.h: > 84: FAILED ceph_assert(cursor != root)\n", > "assert_thread_name": "telemetry", > "backtrace": [ > "__kernel_rt_sigreturn()", > "/lib64/libc.so.6(+0x82a78) [0xffff83603a78]", > "raise()", > "abort()", > "(ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x190) [0xffff84039874]", > "/usr/bin/ceph-mgr(+0xcf540) [0xaaaacdf2f540]", > > "(ActivePyModules::get_perf_schema_python(std::__cxx11::basic_string<char, > std::char_traits<char>, std::allocator<char> > const&, > std::__cxx11::basic_string<char, std::char_traits<char>, > std::allocator<char> > const&)+0xf6c) [0xaaaacdf46ec0]", > "/usr/bin/ceph-mgr(+0x105528) [0xaaaacdf65528]", > "/lib64/libpython3.9.so.1.0(+0xf7bfc) [0xffff84ae9bfc]", > "_PyEval_EvalFrameDefault()", > "/lib64/libpython3.9.so.1.0(+0xda3f0) [0xffff84acc3f0]", > "_PyEval_EvalFrameDefault()", > "/lib64/libpython3.9.so.1.0(+0xc47e8) [0xffff84ab67e8]", > "_PyFunction_Vectorcall()", > "_PyEval_EvalFrameDefault()", > "/lib64/libpython3.9.so.1.0(+0xc47e8) [0xffff84ab67e8]", > "_PyFunction_Vectorcall()", > "_PyEval_EvalFrameDefault()", > "/lib64/libpython3.9.so.1.0(+0xc47e8) [0xffff84ab67e8]", > "_PyFunction_Vectorcall()", > "_PyEval_EvalFrameDefault()", > "/lib64/libpython3.9.so.1.0(+0xc47e8) [0xffff84ab67e8]", > "_PyFunction_Vectorcall()", > "_PyEval_EvalFrameDefault()", > "/lib64/libpython3.9.so.1.0(+0xda3f0) [0xffff84acc3f0]", > "/lib64/libpython3.9.so.1.0(+0xea93c) [0xffff84adc93c]", > "/lib64/libpython3.9.so.1.0(+0xcf304) [0xffff84ac1304]", > "/lib64/libpython3.9.so.1.0(+0x197d78) [0xffff84b89d78]", > "_PyObject_CallMethod_SizeT()", > "(PyModuleRunner::serve()+0x6c) [0xaaaacdfdf6cc]", > "(PyModuleRunner::PyModuleRunnerThread::entry()+0x148) > [0xaaaacdfdff08]" > ], > "ceph_version": "20.2.0", > "crash_id": > "2026-02-08T13:10:07.528739Z_f18e6b74-438b-47db-9438-5a3861fdef2d", > "entity_name": "mgr.hc-945901a5cad1b6e3.mtijbv", > "os_id": "centos", > "os_name": "CentOS Stream", > "os_version": "9", > "os_version_id": "9", > "process_name": "ceph-mgr", > "stack_sig": > "319d76a0d71a4644f9d65f592f6e621cca918d9205fc759ab7acf6944bc77fdd", > "timestamp": "2026-02-08T13:10:07.528739Z", > "utsname_hostname": "hc-945901a5cad1b6e3", > "utsname_machine": "aarch64", > "utsname_release": "6.14.0-1010-raspi", > "utsname_sysname": "Linux", > "utsname_version": "#10-Ubuntu SMP PREEMPT_DYNAMIC Tue Jul 15 19:09:05 > UTC 2025" > } > _______________________________________________ > ceph-users mailing list -- [email protected] > To unsubscribe send an email to [email protected] > _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
