Greetings group!
We recently reloaded a cluster from scratch using cephadm and reef. The
cluster came up, no issues. We then decided to upgrade two existing cephadm
clusters that were on quincy. Those two clusters came up just fine but
there is an issue with the Grafana graphs on both clusters ( which were
working before the upgrade ). They are now blank. There is an error in the
Prometheus alerts (PrometheusJobMissing) that is alerting and it states the
following:
The prometheus job that scrapes from Ceph is no longer defined, this will
effectively mean you'll have no metrics or alerts for the cluster. Please
review the job definitions in the prometheus.yml file of the prometheus
instance.
summary: The scrape job for Ceph is missing from Prometheus
When I look at the Prometheus.yml file on the performance monitoring node,
this is what is there( I replaced ip with x.x.x.x ):
global:
scrape_interval: 10s
evaluation_interval: 10s
rule_files:
- /etc/prometheus/alerting/*
alerting:
alertmanagers:
- scheme: http
http_sd_configs:
- url:
http://x.x.x.x:8765/sd/prometheus/sd-config?service=alertmanager
scrape_configs:
- job_name: 'ceph'
honor_labels: true
http_sd_configs:
- url:
http://x.x.x.x:8765/sd/prometheus/sd-config?service=mgr-prometheus
- job_name: 'node'
http_sd_configs:
- url: http://x.x.x.x:8765/sd/prometheus/sd-config?service=node-exporter
- job_name: 'ceph-exporter'
honor_labels: true
http_sd_configs:
- url: http://x.x.x.x:8765/sd/prometheus/sd-config?service=ceph-exporter
When I open a run "netstat -ntlp" on the active mgr node, I see the 8765
port being used by docker. However, when I try to use the chrome browser to
access the URLs listed in the Prometheus.yml file, the page times out.
However, if I do this with the active manager on the cluster that was
installed from scratch ( and not upgraded ), the URL for that cluster
returns output( different for each URL ).
So it appears to me that the service discovery function is not working for
upgrades from quincy. Also, the ceph-exporter service was not installed on
the cluster during the upgrade process. I manually added the service when I
noticed that it was not there ( when comparing the from scratch cluster to
the upgraded cluster ).
Not sure if this will help or is even related, but I saw it in the cephadm
log:
2023-11-15T04:22:30.789998+0000 mgr. CEPH-MON-01.mlmups (mgr.144601) 753 :
cephadm 4 host CEPH-MON-02 `cephadm gather-facts` failed: Cannot decode
JSON:
Traceback (most recent call last):
File "/usr/share/ceph/mgr/cephadm/serve.py", line 1425, in
_run_cephadm_json
return json.loads(''.join(out))
File "/lib64/python3.6/json/__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "/lib64/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/lib64/python3.6/json/decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Is there any way to fix the service discovery?
Thanks!
-Brent
Existing Clusters:
Test: Reef 18.2.0 ( all virtual on nvme )
US Production(HDD): Reef 18.2.0 with 11 osd servers, 3 mons, 4 gateways, 2
iscsi gateways
UK Production(HDD): Nautilus 14.2.22 with 18 osd servers, 3 mons, 4
gateways, 2 iscsi gateways
US Production(SSD): Reef 18.2.0 Cephadm with 6 osd servers, 5 mons, 4
gateways
UK Production(SSD): Reef 18.2.0 cephadm with 7 osd servers, 5 mons, 4
gateways
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]