There's a lot going on with your cluster... You seem to have broken
the mgr, which is why you're not seeing any deployment attempts, I
assume. Sometimes the MGR can be "broken" if some processes never
finish, your missing ceph06 might cause that, it's hard to say since
you've tried a lot of different things.
According to the mgr log, you've set the dashboard prometheus api host to:
http://ceph08.internal.mousetech.com:9095/api/v1
Since you tried to move it to dell02, which failed, I assume it's the
reason why the prometheus module is broken. What you did with your
OSDs, I don't fully understand, tbh. But we may ignore it for now, I
hope. And just to have mentioned it: I strongly recommend to check
your host removal procedure. It doesn't seem to be optimal to keep
your cluster in a healthy state.
If there's no chance to bring back ceph06.internal.mousetech.com, I'd
probably remove it from the config-key store:
ceph config-key ls | grep host.ceph06.internal.mousetech.com
ceph config-key rm mgr/cephadm/host.ceph06.internal.mousetech.com
ceph config-key rm mgr/cephadm/host.ceph06.internal.mousetech.com.devices.0
(I'm just assuming that the keys will look like that.)
Then I would disable the prometheus mgr module again (ceph mgr module
disable prometheus), and I would probably also reset your
prometheus-api-host:
ceph dashboard reset-prometheus-api-host
Then fail the mgr (ceph mgr fail) and wait a minute or two. If you
don't mind, share the ceph status after you did those steps. And then
we'll go from there.
Zitat von Tim Holloway <t...@mousetech.com>:
It gets worse.
It looks like the physical disk backing the 2 failing OSDs is
failing. I destroyed the host for one of them - which causes me to
flash-back to the nightmare of having a deleted OSD get permanently
stuck deleting just like in Pacific. Because I cannot restart the
OSD, the deletion could not complete.
The deleted host was a backup mds, I needed a new mds so I told the
system to create one on the dell02 machine. I got the same behaviour
as for prometheus. The dell02 machine shows in ceph orch ls as
having an un-started mds, there's an empty mds logfile created, but
no systemd units. And nothing in the cephadm log about the creation
of the mds.
The other cephadm log (/var/log/ceph/<fsid>/ceph.cephadm.log)
indicates attempts to decommission the old (ceph06) mds, but that
machine cannot be contacted as it no longer exists.
I've posted yesterday's and today's ceph.cephadm.log:
https://www.mousetech.com/share/ceph.cephadm.log-20250326.gz
https://www.mousetech.com/share/ceph.cephadm.log
Latest health report is dismal:
HEALTH_ERR 1 failed cephadm daemon(s); 1 hosts fail cephadm check;
insufficient standby MDS daemons available; 2 mgr modules have
failed; too many PGs per OSD (648 > max 560)
[WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
daemon osd.3 on ceph06.internal.mousetech.com is in error state
[WRN] CEPHADM_HOST_CHECK_FAILED: 1 hosts fail cephadm check
host ceph06.internal.mousetech.com (10.0.1.56) failed check:
Can't communicate with remote host `10.0.1.56`, possibly because the
host is not reachable or python3 is not installed on the host.
[Errno 113] Connect call failed ('10.0.1.56', 22)
[WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available
have 0; want 1 more
[ERR] MGR_MODULE_ERROR: 2 mgr modules have failed
Module 'cephadm' has failed: 'ceph06.internal.mousetech.com'
Module 'prometheus' has failed: gaierror(-2, 'Name or service not known')
[WRN] TOO_MANY_PGS: too many PGs per OSD (648 > max 560)
On 3/26/25 16:55, Tim Holloway wrote:
OSD mystery is solved.
Both OSDs were LVM-based imported as vdisks for Ceph VMs.
Apparently something scrambled either the VM manager or the host
disk subsystem as the VM disks were getting I/O errors and even
disappearing from the VM.
I rebooted the physical machine and that cleared it. All OSDs now
happy again.
...
Well, it looks like one OSD has been damaged permanently, so I purged it. (:
On 3/26/25 15:08, Tim Holloway wrote:
Sorry, duplicated a URL. The mgr log is
https://www.mousetech.com/share/ceph-mgr.log
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io