[ceph-users] Re: Prometheus anomaly in Reef

Eugen Block Sat, 05 Apr 2025 11:22:43 -0700

There's a lot going on with your cluster... You seem to have brokenthe mgr, which is why you're not seeing any deployment attempts, Iassume. Sometimes the MGR can be "broken" if some processes neverfinish, your missing ceph06 might cause that, it's hard to say sinceyou've tried a lot of different things.


According to the mgr log, you've set the dashboard prometheus api host to:


http://ceph08.internal.mousetech.com:9095/api/v1

Since you tried to move it to dell02, which failed, I assume it's thereason why the prometheus module is broken. What you did with yourOSDs, I don't fully understand, tbh. But we may ignore it for now, Ihope. And just to have mentioned it: I strongly recommend to checkyour host removal procedure. It doesn't seem to be optimal to keepyour cluster in a healthy state.

If there's no chance to bring back ceph06.internal.mousetech.com, I'dprobably remove it from the config-key store:


ceph config-key ls | grep host.ceph06.internal.mousetech.com
ceph config-key rm mgr/cephadm/host.ceph06.internal.mousetech.com
ceph config-key rm mgr/cephadm/host.ceph06.internal.mousetech.com.devices.0

(I'm just assuming that the keys will look like that.)

Then I would disable the prometheus mgr module again (ceph mgr moduledisable prometheus), and I would probably also reset yourprometheus-api-host:


ceph dashboard reset-prometheus-api-host

Then fail the mgr (ceph mgr fail) and wait a minute or two. If youdon't mind, share the ceph status after you did those steps. And thenwe'll go from there.


Zitat von Tim Holloway <t...@mousetech.com>:

It gets worse.
It looks like the physical disk backing the 2 failing OSDs isfailing. I destroyed the host for one of them - which causes me toflash-back to the nightmare of having a deleted OSD get permanentlystuck deleting just like in Pacific. Because I cannot restart theOSD, the deletion could not complete.
The deleted host was a backup mds, I needed a new mds so I told thesystem to create one on the dell02 machine. I got the same behaviouras for prometheus. The dell02 machine shows in ceph orch ls ashaving an un-started mds, there's an empty mds logfile created, butno systemd units. And nothing in the cephadm log about the creationof the mds.
The other cephadm log (/var/log/ceph/<fsid>/ceph.cephadm.log)indicates attempts to decommission the old (ceph06) mds, but thatmachine cannot be contacted as it no longer exists.
I've posted yesterday's and today's ceph.cephadm.log:

 https://www.mousetech.com/share/ceph.cephadm.log-20250326.gz

https://www.mousetech.com/share/ceph.cephadm.log

Latest health report is dismal:
HEALTH_ERR 1 failed cephadm daemon(s); 1 hosts fail cephadm check;insufficient standby MDS daemons available; 2 mgr modules havefailed; too many PGs per OSD (648 > max 560)
[WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
    daemon osd.3 on ceph06.internal.mousetech.com is in error state
[WRN] CEPHADM_HOST_CHECK_FAILED: 1 hosts fail cephadm check
host ceph06.internal.mousetech.com (10.0.1.56) failed check:Can't communicate with remote host `10.0.1.56`, possibly because thehost is not reachable or python3 is not installed on the host.[Errno 113] Connect call failed ('10.0.1.56', 22)
[WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available
    have 0; want 1 more
[ERR] MGR_MODULE_ERROR: 2 mgr modules have failed
    Module 'cephadm' has failed: 'ceph06.internal.mousetech.com'
    Module 'prometheus' has failed: gaierror(-2, 'Name or service not known')
[WRN] TOO_MANY_PGS: too many PGs per OSD (648 > max 560)

On 3/26/25 16:55, Tim Holloway wrote:
OSD mystery is solved.
Both OSDs were LVM-based imported as vdisks for Ceph VMs.Apparently something scrambled either the VM manager or the hostdisk subsystem as the VM disks were getting I/O errors and evendisappearing from the VM.
I rebooted the physical machine and that cleared it. All OSDs nowhappy again.
...

Well, it looks like one OSD has been damaged permanently, so I purged it. (:

On 3/26/25 15:08, Tim Holloway wrote:
Sorry, duplicated a URL. The mgr log is

https://www.mousetech.com/share/ceph-mgr.log
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Prometheus anomaly in Reef

Reply via email to