Hi Andrew,
we have had bad experiences with ubuntu's auto update, especially when
updating packages from systemd,dbus and docker.
for example: one effect was internal communication errors, only a
restart of the node helped.
Cheers, Joachim
___________________________________
Clyso GmbH - Ceph Foundation Member
[email protected]
https://www.clyso.com
Am 07.08.2021 um 11:04 schrieb Andrew Walker-Brown:
Thanks David,
Spent some more time digging in the logs/google. Also had a further 2 nodes
fail this morning (different nodes).
Looks like it’s related to apt-auto updates on Ubuntu 20.04, although we don’t
run unattended upgrades. Docker appears to get a terminate signal which
shutsdown/restarts all the containers but some don’t come back cleanly.
There’s was also some legacy unused interfaces/bonds in the netplan config.
Anyway, cleaned all that up...so hopefully it’s resolved.
Cheers,
A.
Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
From: David Caro<mailto:[email protected]>
Sent: 06 August 2021 09:20
To: Andrew Walker-Brown<mailto:[email protected]>
Cc: Marc<mailto:[email protected]>;
[email protected]<mailto:[email protected]>
Subject: Re: [ceph-users] Re: All OSDs on one host down
On 08/06 07:59, Andrew Walker-Brown wrote:
Hi Marc,
Yes i’m probably doing just that.
The ceph admin guides aren’t exactly helpful on this. The cluster was deployed
using cephadm and it’s been running perfectly until now.
Wouldn’t running “journalctl -u ceph-osd@5” on host ceph-004 show me the logs
for osd.5 on that host?
On my containerized setup, the services that cephadm created are:
dcaro@node1:~ $ sudo systemctl list-units | grep ceph
[email protected]
loaded active
running Ceph crash.node1 for d49b287a-b680-11eb-95d4-e45f010c03a8
ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@mgr.node1.mhqltg.service
loaded active
running Ceph mgr.node1.mhqltg for d49b287a-b680-11eb-95d4-e45f010c03a8
[email protected]
loaded active
running Ceph mon.node1 for d49b287a-b680-11eb-95d4-e45f010c03a8
[email protected]
loaded active
running Ceph osd.3 for d49b287a-b680-11eb-95d4-e45f010c03a8
[email protected]
loaded active
running Ceph osd.7 for d49b287a-b680-11eb-95d4-e45f010c03a8
system-ceph\x2dd49b287a\x2db680\x2d11eb\x2d95d4\x2de45f010c03a8.slice
loaded active
active system-ceph\x2dd49b287a\x2db680\x2d11eb\x2d95d4\x2de45f010c03a8.slice
ceph-d49b287a-b680-11eb-95d4-e45f010c03a8.target
loaded active
active Ceph cluster d49b287a-b680-11eb-95d4-e45f010c03a8
ceph.target
loaded active
active All Ceph clusters and services
where the string after 'ceph-' is the fsid of the cluster.
Hope that helps (you can use the systemctl list-units also to search the
specific ones on yours).
Cheers,
A
Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
From: Marc<mailto:[email protected]>
Sent: 06 August 2021 08:54
To: Andrew Walker-Brown<mailto:[email protected]>;
[email protected]<mailto:[email protected]>
Subject: RE: All OSDs on one host down
I’ve tried restarting on of the osds but that fails, journalctl shows
osd not found.....not convinced I’ve got the systemctl command right.
You are not mixing 'not container commands' with 'container commands'. As in,
if you execute this journalctl outside of the container it will not find
anything of course.
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
--
David Caro
SRE - Cloud Services
Wikimedia Foundation <https://wikimediafoundation.org/>
PGP Signature: 7180 83A2 AC8B 314F B4CE 1171 4071 C7E1 D262 69C3
"Imagine a world in which every single human being can freely share in the
sum of all knowledge. That's our commitment."
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]