Hi everyone,
I'm facing a weird issue with one of my pacific clusters.
Brief into:
- 5 Nodes Ubuntu 20.04. on 16.2.7 ( ceph01…05 )
- bootstrapped with cephadm recent image from quay.io (around 1 year ago)
- approx. 200TB capacity 5% used
- 5 OSD (2 HDD / 2 SSD / 1 NVMe) on each node
- each node has a MON, yeah 5 MONs in charge
- 3 RGW
- 2 MGR
- 3 MDS (2 active and 1 stby)
The cluster is serving S3 files and cephFS for k8s PVCs and is doing very well.
But:
During a regular maintenance I found a heavy rotating store.db on EVERY node.
Taking a further look, I found weird stuff going on in the #####.log
The log is growing with a rate of approx. 400k/s and is rotating when reaching
a certain size.
store.db
-rw-r--r-- 1 ceph ceph 11445745 Jan 13 09:53 1546576.log
-rw-r--r-- 1 ceph ceph 67352998 Jan 13 09:53 1546578.sst
-rw-r--r-- 1 ceph ceph 67349926 Jan 13 09:53 1546579.sst
-rw-r--r-- 1 ceph ceph 67363989 Jan 13 09:53 1546580.sst
-rw-r--r-- 1 ceph ceph 41063487 Jan 13 09:53 1546581.sst
executing refresh((['ceph01', 'ceph02', 'ceph03', 'ceph04', 'ceph05'],)) failed.
Traceback (most recent call last):
File "/lib/python3.6/site-packages/execnet/gateway_bootstrap.py", line 48, in
bootstrap_exec
s = io.read(1)
File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 402, in read
raise EOFError("expected %d bytes, got %d" % (numbytes, len(buf)))
EOFError: expected 1 bytes, got 0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/share/ceph/mgr/cephadm/serve.py", line 1357, in _remote_connection
conn, connr = self.mgr._get_connection(addr)
File "/usr/share/ceph/mgr/cephadm/module.py", line 1340, in _get_connection
sudo=True if self.ssh_user != 'root' else False)
File "/lib/python3.6/site-packages/remoto/backends/__init__.py", line 35, in
__init__
self.gateway = self._make_gateway(hostname)
File "/lib/python3.6/site-packages/remoto/backends/__init__.py", line 46, in
_make_gateway
self._make_connection_string(hostname)
File "/lib/python3.6/site-packages/execnet/multi.py", line 134, in makegateway
gw = gateway_bootstrap.bootstrap(io, spec)
File "/lib/python3.6/site-packages/execnet/gateway_bootstrap.py", line 102,
in bootstrap
bootstrap_exec(io, spec)
File "/lib/python3.6/site-packages/execnet/gateway_bootstrap.py", line 53, in
bootstrap_exec
raise HostNotFound(io.remoteaddress)
execnet.gateway_bootstrap.HostNotFound: -F /tmp/cephadm-conf-6p_ae5op -i
/tmp/cephadm-identity-hc1rt28x ubuntuadmin@<< IP_OF_CEPH-01 REPLACED >>
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/share/ceph/mgr/cephadm/utils.py", line 76, in do_work
return f(*arg)
File "/usr/share/ceph/mgr/cephadm/serve.py", line 312, in refresh
with self._remote_connection(host) as tpl:
File "/lib64/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "/usr/share/ceph/mgr/cephadm/serve.py", line 1391, in _remote_connection
raise OrchestratorError(msg) from e
orchestrator._interface.OrchestratorError: Failed to connect to ceph01 <<
IP_OF_CEPH-01 REPLACED >>).
Please make sure that the host is reachable and accepts connections using the
cephadm SSH key
...
... [some binary stuff here] …
...
ceph01.sjtrntß$Skd???>ö#?c????Z+Removing orphan daemon mds.cephfs.ceph02…cephadm
ceph01.sjtrntß$Skd???>ö#?cXx??Z-Removing daemon mds.cephfs.ceph02 from
ceph01cephadm
ceph01.sjtrntß$Skd???>_#?cԕ?0?Z"Removing key for mds.cephfs.ceph02cephadm
ceph01.sjtrntß$Skd???>_#?cUƾ0?Z=Reconfiguring mds.cephfs.ceph02 (unknown last
config time)...cephadm
ceph01.sjtrntß$Skd???>_#?cE?"2?Z0Reconfiguring daemon mds.cephfs.ceph02 on
ceph01cephadm
ceph01.sjtrntß$Skd???>`#?c??&?Zcephadm exited with an error code: 1,
stderr:Non-zero exit code 1 from /usr/bin/docker container inspect --format
ää.State.Status¨¨ ceph-<<cluster-ID REPLACED>>-mds-cephfs-ceph02
/usr/bin/docker: stdout
/usr/bin/docker: stderr Error: No such container: ceph-<<cluster-ID
REPLACED>>-mds-cephfs-ceph02
Non-zero exit code 1 from /usr/bin/docker container inspect --format
ää.State.Status¨¨ ceph-<<cluster-ID REPLACED>>-mds.cephfs.ceph02
/usr/bin/docker: stdout
/usr/bin/docker: stderr Error: No such container: ceph-<<cluster-ID
REPLACED>>-mds.cephfs.ceph02
Reconfig daemon mds.cephfs.ceph02 ...
ERROR: cannot reconfig, data path /var/lib/ceph/<<cluster-ID
REPLACED>>/mds.cephfs.ceph02 does not exist
Traceback (most recent call last):
File "/usr/share/ceph/mgr/cephadm/serve.py", line 1363, in _remote_connection
yield (conn, connr)
File "/usr/share/ceph/mgr/cephadm/serve.py", line 1256, in _run_cephadm
code, 'ön'.join(err)))
orchestrator._interface.OrchestratorError: cephadm exited with an error code:
1, stderr:Non-zero exit code 1 from /usr/bin/docker container inspect --format
ää.State.Status¨¨ ceph-<<cluster-ID REPLACED>>-mds-cephfs-ceph02
/usr/bin/docker: stdout
/usr/bin/docker: stderr Error: No such container: ceph-<<cluster-ID
REPLACED>>-mds-cephfs-ceph02
Non-zero exit code 1 from /usr/bin/docker container inspect --format
ää.State.Status¨¨ ceph-<<cluster-ID REPLACED>>-mds.cephfs.ceph02
/usr/bin/docker: stdout
/usr/bin/docker: stderr Error: No such container: ceph-<<cluster-ID
REPLACED>>-mds.cephfs.ceph02
Reconfig daemon mds.cephfs.ceph02 ...
ERROR: cannot reconfig, data path /var/lib/ceph/<<cluster-ID
REPLACED>>/mds.cephfs.ceph02 does not existcephadm
Unable to add a Daemon without Service.
?t
Please use `ceph orch apply ...` to create a Service.
Note, you might want to create the service with "unmanaged=true"
Traceback (most recent call last):
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 125, in wrapper
return OrchResult(f(*args, **kwargs))
File "/usr/share/ceph/mgr/cephadm/module.py", line 2440, in add_daemon
ret.extend(self._add_daemon(d_type, spec))
File "/usr/share/ceph/mgr/cephadm/module.py", line 2378, in _add_daemon
raise OrchestratorError('Unable to add a Daemon without Service.\n'
orchestrator._interface.OrchestratorError: Unable to add a Daemon without
Service.
Please use `ceph orch apply ...` to create a Service.
I’m confused about the attempt of cephadm to do „things“ to a ceph02 daemon
which is obviously not residing on node ceph01. Almost the same log lines are
appearing on each MON host in its store.db.
All in all it looks fare from healthy and I’m really concerned about that.
Any help is highly appreciated! Thanks a lot.
Cheers,
Jürgen
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]