Hi Andrew,

we have had bad experiences with ubuntu's auto update, especially when updating packages from systemd,dbus and docker. for example: one effect was internal communication errors, only a restart of the node helped.

Cheers, Joachim

___________________________________
Clyso GmbH - Ceph Foundation Member
[email protected]
https://www.clyso.com

Am 07.08.2021 um 11:04 schrieb Andrew Walker-Brown:
Thanks David,

Spent some more time digging in the logs/google.  Also had a further 2 nodes 
fail this morning (different nodes).

Looks like it’s related to apt-auto updates on Ubuntu 20.04, although we don’t 
run unattended upgrades.  Docker appears to get a terminate signal which 
shutsdown/restarts all the containers but some don’t come back cleanly.  
There’s was also some legacy unused interfaces/bonds in the netplan config.

Anyway, cleaned all that up...so hopefully it’s resolved.

Cheers,

A.



Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10

From: David Caro<mailto:[email protected]>
Sent: 06 August 2021 09:20
To: Andrew Walker-Brown<mailto:[email protected]>
Cc: Marc<mailto:[email protected]>; 
[email protected]<mailto:[email protected]>
Subject: Re: [ceph-users] Re: All OSDs on one host down

On 08/06 07:59, Andrew Walker-Brown wrote:
Hi Marc,

Yes i’m probably doing just that.

The ceph admin guides aren’t exactly helpful on this.  The cluster was deployed 
using cephadm and it’s been running perfectly until now.

Wouldn’t running “journalctl -u ceph-osd@5” on host ceph-004 show me the logs 
for osd.5 on that host?
On my containerized setup, the services that cephadm created are:

dcaro@node1:~ $ sudo systemctl list-units | grep ceph
   [email protected]                
                                                                 loaded active 
running   Ceph crash.node1 for d49b287a-b680-11eb-95d4-e45f010c03a8
   ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@mgr.node1.mhqltg.service           
                                                                 loaded active 
running   Ceph mgr.node1.mhqltg for d49b287a-b680-11eb-95d4-e45f010c03a8
   [email protected]                  
                                                                 loaded active 
running   Ceph mon.node1 for d49b287a-b680-11eb-95d4-e45f010c03a8
   [email protected]                      
                                                                 loaded active 
running   Ceph osd.3 for d49b287a-b680-11eb-95d4-e45f010c03a8
   [email protected]                      
                                                                 loaded active 
running   Ceph osd.7 for d49b287a-b680-11eb-95d4-e45f010c03a8
   system-ceph\x2dd49b287a\x2db680\x2d11eb\x2d95d4\x2de45f010c03a8.slice        
                                                                 loaded active 
active    system-ceph\x2dd49b287a\x2db680\x2d11eb\x2d95d4\x2de45f010c03a8.slice
   ceph-d49b287a-b680-11eb-95d4-e45f010c03a8.target                             
                                                                 loaded active 
active    Ceph cluster d49b287a-b680-11eb-95d4-e45f010c03a8
   ceph.target                                                                  
                                                                 loaded active 
active    All Ceph clusters and services

where the string after 'ceph-' is the fsid of the cluster.
Hope that helps (you can use the systemctl list-units also to search the 
specific ones on yours).


Cheers,
A





Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10

From: Marc<mailto:[email protected]>
Sent: 06 August 2021 08:54
To: Andrew Walker-Brown<mailto:[email protected]>; 
[email protected]<mailto:[email protected]>
Subject: RE: All OSDs on one host down

I’ve tried restarting on of the osds but that fails, journalctl shows
osd not found.....not convinced I’ve got the systemctl command right.

You are not mixing 'not container commands' with 'container commands'. As in, 
if you execute this journalctl outside of the container it will not find 
anything of course.


_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
--
David Caro
SRE - Cloud Services
Wikimedia Foundation <https://wikimediafoundation.org/>
PGP Signature: 7180 83A2 AC8B 314F B4CE  1171 4071 C7E1 D262 69C3

"Imagine a world in which every single human being can freely share in the
sum of all knowledge. That's our commitment."

_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to