We noticed that our DNS settings were inconsistent and partially wrong.
The NetworkManager somehow set useless nameservers in the
/etc/resolv.conf of our hosts.
But in particular, the DNS settings in the MGR containers needed fixing
as well.
I fixed etc/resolv.conf on our hosts and in the container of the active
MGR daemon.
This fixed all the issues I described, including the output orch ps and
orch ls as well as registry queries such as docker pull and upgrade ls.
Afterwards, I was able to do an upgrade to Quinzy.
And as far as I can tell, the newly deployed MGR containers picked up
the proper DNS settings from the hosts.
Best, Mathias
On 6/29/2022 10:45 AM, Mathias Kuhring wrote:
> Dear Ceph community,
>
> we are in the curious situation that typical orchestrator queries
> provide wrong or outdated information about different services.
> E.g. `ceph orch ls` will report wrong numbers on active services.
> Or `ceph orch ps` reports many OSDs as "starting" and many services
> with an old version (15.2.14, but we are on 16.2.7).
> Also the refresh times seem way of (capital M == months?).
> However, the cluster is healthy (`ceph status` is happy).
> And sample validation of affected services with systemctl also shows
> that they are up and ok.
>
> We already tried the following without success:
>
> a) re-registering cephadm as orchestrator backend
> 0|0[root@osd-1 ~]# ceph orch pause
> 0|0[root@osd-1 ~]# ceph orch set backend ''
> 0|0[root@osd-1 ~]# ceph mgr module disable cephadm
> 0|0[root@osd-1 ~]# ceph orch ls
> Error ENOENT: No orchestrator configured (try `ceph orch set backend`)
> 0|0[root@osd-1 ~]# ceph mgr module enable cephadm
> 0|0[root@osd-1 ~]# ceph orch set backend 'cephadm'
>
> b) a failover of the MGR (hoping it would restart/reset the
> orchestrator module)
> 0|0[root@osd-1 ~]# ceph status | grep mgr
> mgr: osd-1(active, since 6m), standbys: osd-5.jcfyqe,
> osd-4.oylrhe, osd-3
> 0|0[root@osd-1 ~]# ceph mgr fail
> 0|0[root@osd-1 ~]# ceph status | grep mgr
> mgr: osd-5.jcfyqe(active, since 7s), standbys:
> osd-4.oylrhe, osd-3, osd-1
>
> Is there any other way to somehow reset the orchestrator
> information/connection?
> I added different relevant outputs below.
>
> I also went through the MGR logs and found an issue with querying the
> docker repos.
> I attempted to upgrade the MGRs to 16.2.9 a few weeks ago due to a
> different bug.
> But this upgrade never went through.
> Apparently due to cephadm not being able to pull the image.
> Interestingly, I'm able to pull the image manually with docker pull.
> But cephadm is not.
> I also get an error with `ceph orch upgrade ls` to check on available
> versions.
> I'm not sure, if this is relevant to the orchestrator problem we have.
> But to be safe, I also added the logs/output below.
>
> Thank you for all your help!
>
> Best Wishes,
> Mathias
>
>
> 0|0[root@osd-1 ~]# ceph status
> cluster:
> id: 55633ec3-6c0c-4a02-990c-0f87e0f7a01f
> health: HEALTH_OK
>
> services:
> mon: 5 daemons, quorum osd-1,osd-2,osd-5,osd-4,osd-3
> (age 86m)
> mgr: osd-5.jcfyqe(active, since 21m), standbys:
> osd-4.oylrhe, osd-3, osd-1
> mds: 1/1 daemons up, 1 standby
> osd: 270 osds: 270 up (since 13d), 270 in (since 5w)
> cephfs-mirror: 1 daemon active (1 hosts)
> rgw: 3 daemons active (3 hosts, 2 zones)
>
> data:
> volumes: 1/1 healthy
> pools: 17 pools, 6144 pgs
> objects: 692.54M objects, 1.2 PiB
> usage: 1.8 PiB used, 1.7 PiB / 3.5 PiB avail
> pgs: 6114 active+clean
> 29 active+clean+scrubbing+deep
> 1 active+clean+scrubbing
>
> io:
> client: 0 B/s rd, 421 MiB/s wr, 52 op/s rd, 240 op/s wr
>
> 0|0[root@osd-1 ~]# ceph orch ls
> NAME PORTS RUNNING REFRESHED
> AGE PLACEMENT
> alertmanager ?:9093,9094 0/1 -
> 8M count:1
> cephfs-mirror 0/1 -
> 5M count:1
> crash 2/6 7M
> ago 4M *
> grafana ?:3000 0/1 -
> 8M count:1
> ingress.rgw.default 172.16.39.131:443,1967 0/2 -
> 4M osd-1
> ingress.rgw.ext 172.16.39.132:443,1968 4/2 7M
> ago 4M osd-5
> ingress.rgw.ext-website 172.16.39.133:443,1969 0/2 -
> 4M osd-3
> mds.cephfs 2/2 9M
> ago 4M count-per-host:1;label:mds
> mgr 5/5 9M
> ago 9M count:5
> mon 5/5 9M
> ago 9M count:5
> node-exporter ?:9100 2/6 7M
> ago 7w *
> osd.all-available-devices 0 -
> 5w *
> osd.osd 54 <deleting>
> 7M label:osd
> osd.unmanaged 180 9M
> ago - <unmanaged>
> prometheus ?:9095 0/2 -
> 8M count:2
> rgw.cubi 4/0 9M
> ago - <unmanaged>
> rgw.default ?:8100 2/1 7M
> ago 4M osd-1
> rgw.ext ?:8100 2/1 7M
> ago 4M osd-5
> rgw.ext-website ?:8200 0/1 -
> 4M osd-3
>
> 0|0[root@osd-1 ~]# ceph orch ps | grep starting | head -n 3
> osd.0 osd-1 starting -
> - - 3072M <unknown> <unknown> <unknown>
> osd.1 osd-2 starting -
> - - 3072M <unknown> <unknown> <unknown>
> osd.10 osd-1 starting -
> - - 3072M <unknown> <unknown> <unknown>
>
> 0|0[root@osd-1 ~]# ceph orch ps | grep 15.2.14 | head -n 3
> mds.cephfs.osd-1.fhmalo osd-1 running (9M) 9M
> ago 9M 370M - 15.2.14 d4c4064fa0de f138649b2e4f
> mds.cephfs.osd-2.vqanmk osd-2 running (9M) 9M
> ago 9M 3119M - 15.2.14 d4c4064fa0de a2752217770f
> osd.100 osd-1 running (9M) 9M
> ago 9M 3525M 3072M 15.2.14 d4c4064fa0de 1ea3fc9c3caf
>
> 0|0[root@osd-1 ~]# cephadm version
> Using recent ceph image
> quay.io/ceph/ceph@sha256:bb6a71f7f481985f6d3b358e3b9ef64c6755b3db5aa53198e0aac38be5c8ae54
> ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific
> (stable)
>
> 0|0[root@osd-1 ~]# ceph versions
> {
> "mon": {
> "ceph version 16.2.7
> (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 5
> },
> "mgr": {
> "ceph version 16.2.7
> (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 4
> },
> "osd": {
> "ceph version 16.2.7
> (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 270
> },
> "mds": {
> "ceph version 16.2.7
> (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 2
> },
> "cephfs-mirror": {
> "ceph version 16.2.7
> (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 1
> },
> "rgw": {
> "ceph version 16.2.7
> (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 3
> },
> "overall": {
> "ceph version 16.2.7
> (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 285
> }
> }
>
> From MGR logs:
> Jun 29 09:00:07 osd-5 bash[9702]: debug 2022-06-29T07:00:07.046+0000
> 7fdd4e467700 0 [cephadm ERROR cephadm.serve] cephadm exited with an
> error code: 1, stderr:Pulling container image
> quay.io/ceph/ceph:v16.2.9...
> Jun 29 09:00:07 osd-5 bash[9702]: Non-zero exit code 1 from
> /bin/docker pull quay.io/ceph/ceph:v16.2.9
> Jun 29 09:00:07 osd-5 bash[9702]: /bin/docker: stderr Error response
> from daemon: Get "https://quay.io/v2/": context deadline exceeded
> (Client.Timeout exceeded while awaiting headers)
> Jun 29 09:00:07 osd-5 bash[9702]: ERROR: Failed command: /bin/docker
> pull quay.io/ceph/ceph:v16.2.9
> Jun 29 09:00:07 osd-5 bash[9702]: Traceback (most recent call last):
> Jun 29 09:00:07 osd-5 bash[9702]: File
> "/usr/share/ceph/mgr/cephadm/serve.py", line 1363, in _remote_connection
> Jun 29 09:00:07 osd-5 bash[9702]: yield (conn, connr)
> Jun 29 09:00:07 osd-5 bash[9702]: File
> "/usr/share/ceph/mgr/cephadm/serve.py", line 1256, in _run_cephadm
> Jun 29 09:00:07 osd-5 bash[9702]: code, '\n'.join(err)))
>
> 0|0[root@osd-1 ~]# ceph orch upgrade ls
> Error EINVAL: Traceback (most recent call last):
> File "/usr/share/ceph/mgr/mgr_module.py", line 1384, in _handle_command
> return self.handle_command(inbuf, cmd)
> File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 168, in
> handle_command
> return dispatch[cmd['prefix']].call(self, cmd, inbuf)
> File "/usr/share/ceph/mgr/mgr_module.py", line 397, in call
> return self.func(mgr, **kwargs)
> File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 107, in
> <lambda>
> wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args,
> **l_kwargs) # noqa: E731
> File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 96, in
> wrapper
> return func(*args, **kwargs)
> File "/usr/share/ceph/mgr/orchestrator/module.py", line 1337, in
> _upgrade_ls
> r = raise_if_exception(completion)
> File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 225, in
> raise_if_exception
> raise e
> requests.exceptions.ConnectionError: None: Max retries exceeded with
> url: /v2/ceph/ceph/tags/list (Caused by None)
>
--
Mathias Kuhring
Dr. rer. nat.
Bioinformatician
HPC & Core Unit Bioinformatics
Berlin Institute of Health at Charité (BIH)
E-Mail: [email protected]
Mobile: +49 172 3475576
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]