Dear Ceph community,
we are in the curious situation that typical orchestrator queries
provide wrong or outdated information about different services.
E.g. `ceph orch ls` will report wrong numbers on active services.
Or `ceph orch ps` reports many OSDs as "starting" and many services with
an old version (15.2.14, but we are on 16.2.7).
Also the refresh times seem way of (capital M == months?).
However, the cluster is healthy (`ceph status` is happy).
And sample validation of affected services with systemctl also shows
that they are up and ok.
We already tried the following without success:
a) re-registering cephadm as orchestrator backend
0|0[root@osd-1 ~]# ceph orch pause
0|0[root@osd-1 ~]# ceph orch set backend ''
0|0[root@osd-1 ~]# ceph mgr module disable cephadm
0|0[root@osd-1 ~]# ceph orch ls
Error ENOENT: No orchestrator configured (try `ceph orch set backend`)
0|0[root@osd-1 ~]# ceph mgr module enable cephadm
0|0[root@osd-1 ~]# ceph orch set backend 'cephadm'
b) a failover of the MGR (hoping it would restart/reset the orchestrator
module)
0|0[root@osd-1 ~]# ceph status | grep mgr
mgr: osd-1(active, since 6m), standbys: osd-5.jcfyqe,
osd-4.oylrhe, osd-3
0|0[root@osd-1 ~]# ceph mgr fail
0|0[root@osd-1 ~]# ceph status | grep mgr
mgr: osd-5.jcfyqe(active, since 7s), standbys:
osd-4.oylrhe, osd-3, osd-1
Is there any other way to somehow reset the orchestrator
information/connection?
I added different relevant outputs below.
I also went through the MGR logs and found an issue with querying the
docker repos.
I attempted to upgrade the MGRs to 16.2.9 a few weeks ago due to a
different bug.
But this upgrade never went through.
Apparently due to cephadm not being able to pull the image.
Interestingly, I'm able to pull the image manually with docker pull. But
cephadm is not.
I also get an error with `ceph orch upgrade ls` to check on available
versions.
I'm not sure, if this is relevant to the orchestrator problem we have.
But to be safe, I also added the logs/output below.
Thank you for all your help!
Best Wishes,
Mathias
0|0[root@osd-1 ~]# ceph status
cluster:
id: 55633ec3-6c0c-4a02-990c-0f87e0f7a01f
health: HEALTH_OK
services:
mon: 5 daemons, quorum osd-1,osd-2,osd-5,osd-4,osd-3 (age
86m)
mgr: osd-5.jcfyqe(active, since 21m), standbys:
osd-4.oylrhe, osd-3, osd-1
mds: 1/1 daemons up, 1 standby
osd: 270 osds: 270 up (since 13d), 270 in (since 5w)
cephfs-mirror: 1 daemon active (1 hosts)
rgw: 3 daemons active (3 hosts, 2 zones)
data:
volumes: 1/1 healthy
pools: 17 pools, 6144 pgs
objects: 692.54M objects, 1.2 PiB
usage: 1.8 PiB used, 1.7 PiB / 3.5 PiB avail
pgs: 6114 active+clean
29 active+clean+scrubbing+deep
1 active+clean+scrubbing
io:
client: 0 B/s rd, 421 MiB/s wr, 52 op/s rd, 240 op/s wr
0|0[root@osd-1 ~]# ceph orch ls
NAME PORTS RUNNING REFRESHED
AGE PLACEMENT
alertmanager ?:9093,9094 0/1 -
8M count:1
cephfs-mirror 0/1 -
5M count:1
crash 2/6 7M ago
4M *
grafana ?:3000 0/1 -
8M count:1
ingress.rgw.default 172.16.39.131:443,1967 0/2 -
4M osd-1
ingress.rgw.ext 172.16.39.132:443,1968 4/2 7M ago
4M osd-5
ingress.rgw.ext-website 172.16.39.133:443,1969 0/2 -
4M osd-3
mds.cephfs 2/2 9M ago
4M count-per-host:1;label:mds
mgr 5/5 9M ago
9M count:5
mon 5/5 9M ago
9M count:5
node-exporter ?:9100 2/6 7M ago
7w *
osd.all-available-devices 0 -
5w *
osd.osd 54 <deleting>
7M label:osd
osd.unmanaged 180 9M ago
- <unmanaged>
prometheus ?:9095 0/2 -
8M count:2
rgw.cubi 4/0 9M ago
- <unmanaged>
rgw.default ?:8100 2/1 7M ago
4M osd-1
rgw.ext ?:8100 2/1 7M ago
4M osd-5
rgw.ext-website ?:8200 0/1 -
4M osd-3
0|0[root@osd-1 ~]# ceph orch ps | grep starting | head -n 3
osd.0 osd-1 starting -
- - 3072M <unknown> <unknown> <unknown>
osd.1 osd-2 starting -
- - 3072M <unknown> <unknown> <unknown>
osd.10 osd-1 starting -
- - 3072M <unknown> <unknown> <unknown>
0|0[root@osd-1 ~]# ceph orch ps | grep 15.2.14 | head -n 3
mds.cephfs.osd-1.fhmalo osd-1 running (9M) 9M
ago 9M 370M - 15.2.14 d4c4064fa0de f138649b2e4f
mds.cephfs.osd-2.vqanmk osd-2 running (9M) 9M
ago 9M 3119M - 15.2.14 d4c4064fa0de a2752217770f
osd.100 osd-1 running (9M) 9M
ago 9M 3525M 3072M 15.2.14 d4c4064fa0de 1ea3fc9c3caf
0|0[root@osd-1 ~]# cephadm version
Using recent ceph image
quay.io/ceph/ceph@sha256:bb6a71f7f481985f6d3b358e3b9ef64c6755b3db5aa53198e0aac38be5c8ae54
ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific
(stable)
0|0[root@osd-1 ~]# ceph versions
{
"mon": {
"ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503)
pacific (stable)": 5
},
"mgr": {
"ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503)
pacific (stable)": 4
},
"osd": {
"ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503)
pacific (stable)": 270
},
"mds": {
"ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503)
pacific (stable)": 2
},
"cephfs-mirror": {
"ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503)
pacific (stable)": 1
},
"rgw": {
"ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503)
pacific (stable)": 3
},
"overall": {
"ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503)
pacific (stable)": 285
}
}
From MGR logs:
Jun 29 09:00:07 osd-5 bash[9702]: debug 2022-06-29T07:00:07.046+0000
7fdd4e467700 0 [cephadm ERROR cephadm.serve] cephadm exited with an
error code: 1, stderr:Pulling container image quay.io/ceph/ceph:v16.2.9...
Jun 29 09:00:07 osd-5 bash[9702]: Non-zero exit code 1 from /bin/docker
pull quay.io/ceph/ceph:v16.2.9
Jun 29 09:00:07 osd-5 bash[9702]: /bin/docker: stderr Error response
from daemon: Get "https://quay.io/v2/": context deadline exceeded
(Client.Timeout exceeded while awaiting headers)
Jun 29 09:00:07 osd-5 bash[9702]: ERROR: Failed command: /bin/docker
pull quay.io/ceph/ceph:v16.2.9
Jun 29 09:00:07 osd-5 bash[9702]: Traceback (most recent call last):
Jun 29 09:00:07 osd-5 bash[9702]: File
"/usr/share/ceph/mgr/cephadm/serve.py", line 1363, in _remote_connection
Jun 29 09:00:07 osd-5 bash[9702]: yield (conn, connr)
Jun 29 09:00:07 osd-5 bash[9702]: File
"/usr/share/ceph/mgr/cephadm/serve.py", line 1256, in _run_cephadm
Jun 29 09:00:07 osd-5 bash[9702]: code, '\n'.join(err)))
0|0[root@osd-1 ~]# ceph orch upgrade ls
Error EINVAL: Traceback (most recent call last):
File "/usr/share/ceph/mgr/mgr_module.py", line 1384, in _handle_command
return self.handle_command(inbuf, cmd)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 168, in
handle_command
return dispatch[cmd['prefix']].call(self, cmd, inbuf)
File "/usr/share/ceph/mgr/mgr_module.py", line 397, in call
return self.func(mgr, **kwargs)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 107, in
<lambda>
wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args,
**l_kwargs) # noqa: E731
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 96, in
wrapper
return func(*args, **kwargs)
File "/usr/share/ceph/mgr/orchestrator/module.py", line 1337, in
_upgrade_ls
r = raise_if_exception(completion)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 225, in
raise_if_exception
raise e
requests.exceptions.ConnectionError: None: Max retries exceeded with
url: /v2/ceph/ceph/tags/list (Caused by None)
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]