[ceph-users] Not timing out watcher

2021-08-07 Thread li jerry
Hello all
I use RBD as a data disk for KVM
When an abnormal power failure occurs in the kvm server, the rbd watcher often 
takes 15 minutes to clear automatically.

Environmental information:
ceph: 15.2.12
os: ubuntu 20.04
kernel: 5.4.0-42-generic
libvirt: 6.0.0
QEMU: 4.2.1


Is there any setting that allows the rbd watcher to be quickly cleared when the 
server is abnormally powered off?


Thanks everyone for your help!!!



-Jerry

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd object mapping

2021-08-07 Thread Tony Liu
There are two types of "object", RBD-image-object and 8MiB-block-object.
When create a RBD image, a RBD-image-object is created and 12800 
8MiB-block-objects
are allocated. That whole RBD-image-object is mapped to a single PG, which is 
mapped
to 3 OSDs (replica 3). That means, all user data on that RBD image is stored in 
those
3 OSDs. Is my understanding correct?

I doubt it, because, for example, a Ceph cluster with bunch of 2TB drives, and 
user
won't be able to create RBD image bigger than 2TB. I don't believe that's true.
So, what am I missing here?

Thanks!
Tony

From: Konstantin Shalygin 
Sent: August 7, 2021 11:35 AM
To: Tony Liu
Cc: ceph-users; d...@ceph.io
Subject: Re: [ceph-users] rbd object mapping

Object map show where your object with any object name will be placed in 
defined pool with your crush map, and which of osd will serve this PG.
You can type anything in object name - and the the future placement or 
placement of existing object - this how algo works.

12800 means that your 100GiB image is a 12800 objects of 8 MiB of pool vm. All 
this objects prefixed with rbd header (seems block_name_prefix modern name of 
this)


Cheers,
k

On 7 Aug 2021, at 21:27, Tony Liu 
mailto:tonyliu0...@hotmail.com>> wrote:

This shows one RBD image is treated as one object, and it's mapped to one PG.
"object" here means a RBD image.

# ceph osd map vm fcb09c9c-4cd9-44d8-a20b-8961c6eedf8e_disk
osdmap e18381 pool 'vm' (4) object 'fcb09c9c-4cd9-44d8-a20b-8961c6eedf8e_disk' 
-> pg 4.c7a78d40 (4.0) -> up ([4,17,6], p4) acting ([4,17,6], p4)

When show the info of this image, what's that "12800 objects" mean?
And what's that "order 23 (8 MiB objects)" mean?
What's "objects" here?

# rbd info vm/fcb09c9c-4cd9-44d8-a20b-8961c6eedf8e_disk
rbd image 'fcb09c9c-4cd9-44d8-a20b-8961c6eedf8e_disk':
   size 100 GiB in 12800 objects
   order 23 (8 MiB objects)
   snapshot_count: 0
   id: affa8fb94beb7e
   block_name_prefix: rbd_data.affa8fb94beb7e
   format: 2


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd object mapping

2021-08-07 Thread Konstantin Shalygin
Object map show where your object with any object name will be placed in 
defined pool with your crush map, and which of osd will serve this PG.
You can type anything in object name - and the the future placement or 
placement of existing object - this how algo works.

12800 means that your 100GiB image is a 12800 objects of 8 MiB of pool vm. All 
this objects prefixed with rbd header (seems block_name_prefix modern name of 
this)


Cheers,
k

> On 7 Aug 2021, at 21:27, Tony Liu  wrote:
> 
> This shows one RBD image is treated as one object, and it's mapped to one PG.
> "object" here means a RBD image.
> 
> # ceph osd map vm fcb09c9c-4cd9-44d8-a20b-8961c6eedf8e_disk
> osdmap e18381 pool 'vm' (4) object 
> 'fcb09c9c-4cd9-44d8-a20b-8961c6eedf8e_disk' -> pg 4.c7a78d40 (4.0) -> up 
> ([4,17,6], p4) acting ([4,17,6], p4)
> 
> When show the info of this image, what's that "12800 objects" mean?
> And what's that "order 23 (8 MiB objects)" mean?
> What's "objects" here?
> 
> # rbd info vm/fcb09c9c-4cd9-44d8-a20b-8961c6eedf8e_disk 
> rbd image 'fcb09c9c-4cd9-44d8-a20b-8961c6eedf8e_disk':
>size 100 GiB in 12800 objects
>order 23 (8 MiB objects)
>snapshot_count: 0
>id: affa8fb94beb7e
>block_name_prefix: rbd_data.affa8fb94beb7e
>format: 2
> 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] rbd object mapping

2021-08-07 Thread Tony Liu
Hi,

This shows one RBD image is treated as one object, and it's mapped to one PG.
"object" here means a RBD image.

# ceph osd map vm fcb09c9c-4cd9-44d8-a20b-8961c6eedf8e_disk
osdmap e18381 pool 'vm' (4) object 'fcb09c9c-4cd9-44d8-a20b-8961c6eedf8e_disk' 
-> pg 4.c7a78d40 (4.0) -> up ([4,17,6], p4) acting ([4,17,6], p4)

When show the info of this image, what's that "12800 objects" mean?
And what's that "order 23 (8 MiB objects)" mean?
What's "objects" here?

# rbd info vm/fcb09c9c-4cd9-44d8-a20b-8961c6eedf8e_disk 
rbd image 'fcb09c9c-4cd9-44d8-a20b-8961c6eedf8e_disk':
size 100 GiB in 12800 objects
order 23 (8 MiB objects)
snapshot_count: 0
id: affa8fb94beb7e
block_name_prefix: rbd_data.affa8fb94beb7e
format: 2


Thanks!
Tony
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] BUG #51821 - client is using insecure global_id reclaim

2021-08-07 Thread Daniel Persson
Hi everyone.

I suggested asking for help here instead of in the bug tracker so that I
will try it.

https://tracker.ceph.com/issues/51821?next_issue_id=51820_issue_id=51824

I have a problem that I can't seem to figure out how to resolve the issue.

AUTH_INSECURE_GLOBAL_ID_RECLAIM: client is using insecure global_id reclaim
AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED: mons are allowing insecure
global_id reclaim


Both of these have to do with reclaiming ID and securing that no client
could steal or reuse another client's ID. I understand the reason for this
and want to resolve the issue.

Currently, I have three different clients.

* One Windows client using the latest Ceph-Dokan build. (ceph version
15.0.0-22274-g5656003758 (5656003758614f8fd2a8c49c2e7d4f5cd637b0ea) pacific
(rc))
* One Linux Debian build using the built packages for that kernel. (
4.19.0-17-amd64)
* And one client that I've built from source for a raspberry PI as there is
no arm build for the Pacific release. (5.11.0-1015-raspi)

If I switch over to not allow global id reclaim, none of these clients
could connect, and using the command "ceph status" on one of my nodes will
also fail.

All of them giving the same error message:

monclient(hunting): handle_auth_bad_method server allowed_methods [2]
but i only support [2]


Has anyone encountered this problem and have any suggestions?

PS. The reason I have 3 different hosts is that this is a test environment
where I try to resolve and look at issues before we upgrade our production
environment to pacific. DS.

Best regards
Daniel
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: All OSDs on one host down

2021-08-07 Thread Clyso GmbH - Ceph Foundation Member

we have been working and using Ceph adm for more than 2 years..

for this and other reasons have changed our update strategy to immutable 
infrastructure and are currently in the middle of migrating to different 
flavours of https://github.com/gardenlinux/gardenlinux.


___
Clyso GmbH - Ceph Foundation Member
supp...@clyso.com
https://www.clyso.com

Am 07.08.2021 um 11:51 schrieb Andrew Walker-Brown:


Yeah I think that’s along the lines of what I’ve faced here.  
Hopefully i’ve managed to disable the auto updates.


Sent from Mail  for 
Windows


*From: *Clyso GmbH - Ceph Foundation Member 


*Sent: *07 August 2021 10:46
*To: *Andrew Walker-Brown ; David 
Caro 
*Cc: *Marc ; ceph-users@ceph.io 


*Subject: *Re: [ceph-users] Re: All OSDs on one host down

Hi Andrew,

we have had bad experiences with ubuntu's auto update, especially when
updating packages from systemd,dbus and docker.
for example: one effect was internal communication errors, only a
restart of the node helped.

Cheers, Joachim

___
Clyso GmbH - Ceph Foundation Member
supp...@clyso.com
https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.clyso.com%2Fdata=04%7C01%7C%7Cb914f016fb804af1cdcd08d959882b99%7C84df9e7fe9f640afb435%7C1%7C0%7C637639263643148448%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=UEAe8op8%2B9DVtQfbXdoRM1s%2F3GZ64riD3y8QD3Y8EVU%3Dreserved=0 



Am 07.08.2021 um 11:04 schrieb Andrew Walker-Brown:
> Thanks David,
>
> Spent some more time digging in the logs/google.  Also had a further 
2 nodes fail this morning (different nodes).

>
> Looks like it’s related to apt-auto updates on Ubuntu 20.04, 
although we don’t run unattended upgrades.  Docker appears to get a 
terminate signal which shutsdown/restarts all the containers but some 
don’t come back cleanly.  There’s was also some legacy unused 
interfaces/bonds in the netplan config.

>
> Anyway, cleaned all that up...so hopefully it’s resolved.
>
> Cheers,
>
> A.
>
>
>
> Sent from 
Mail> 
for Windows 10

>
> From: David Caro>

> Sent: 06 August 2021 09:20
> To: Andrew Walker-Brown>
> Cc: Marc>; 
ceph-users@ceph.io>

> Subject: Re: [ceph-users] Re: All OSDs on one host down
>
> On 08/06 07:59, Andrew Walker-Brown wrote:
>> Hi Marc,
>>
>> Yes i’m probably doing just that.
>>
>> The ceph admin guides aren’t exactly helpful on this.  The cluster 
was deployed using cephadm and it’s been running perfectly until now.

>>
>> Wouldn’t running “journalctl -u ceph-osd@5” on host ceph-004 show 
me the logs for osd.5 on that host?

> On my containerized setup, the services that cephadm created are:
>
> dcaro@node1:~ $ sudo systemctl list-units | grep ceph
> ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@crash.node1.service loaded 
active running   Ceph crash.node1 for d49b287a-b680-11eb-95d4-e45f010c03a8
> ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@mgr.node1.mhqltg.service 
loaded active running   Ceph mgr.node1.mhqltg for 
d49b287a-b680-11eb-95d4-e45f010c03a8
> ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@mon.node1.service loaded 
active running   Ceph mon.node1 for d49b287a-b680-11eb-95d4-e45f010c03a8
> ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@osd.3.service loaded 
active running   Ceph osd.3 for d49b287a-b680-11eb-95d4-e45f010c03a8
> 

[ceph-users] Re: All OSDs on one host down

2021-08-07 Thread Clyso GmbH - Ceph Foundation Member

Hi Andrew,

we have had bad experiences with ubuntu's auto update, especially when 
updating packages from systemd,dbus and docker.
for example: one effect was internal communication errors, only a 
restart of the node helped.


Cheers, Joachim

___
Clyso GmbH - Ceph Foundation Member
supp...@clyso.com
https://www.clyso.com

Am 07.08.2021 um 11:04 schrieb Andrew Walker-Brown:

Thanks David,

Spent some more time digging in the logs/google.  Also had a further 2 nodes 
fail this morning (different nodes).

Looks like it’s related to apt-auto updates on Ubuntu 20.04, although we don’t 
run unattended upgrades.  Docker appears to get a terminate signal which 
shutsdown/restarts all the containers but some don’t come back cleanly.  
There’s was also some legacy unused interfaces/bonds in the netplan config.

Anyway, cleaned all that up...so hopefully it’s resolved.

Cheers,

A.



Sent from Mail for Windows 10

From: David Caro
Sent: 06 August 2021 09:20
To: Andrew Walker-Brown
Cc: Marc; 
ceph-users@ceph.io
Subject: Re: [ceph-users] Re: All OSDs on one host down

On 08/06 07:59, Andrew Walker-Brown wrote:

Hi Marc,

Yes i’m probably doing just that.

The ceph admin guides aren’t exactly helpful on this.  The cluster was deployed 
using cephadm and it’s been running perfectly until now.

Wouldn’t running “journalctl -u ceph-osd@5” on host ceph-004 show me the logs 
for osd.5 on that host?

On my containerized setup, the services that cephadm created are:

dcaro@node1:~ $ sudo systemctl list-units | grep ceph
   ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@crash.node1.service
 loaded active 
running   Ceph crash.node1 for d49b287a-b680-11eb-95d4-e45f010c03a8
   ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@mgr.node1.mhqltg.service   
 loaded active 
running   Ceph mgr.node1.mhqltg for d49b287a-b680-11eb-95d4-e45f010c03a8
   ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@mon.node1.service  
 loaded active 
running   Ceph mon.node1 for d49b287a-b680-11eb-95d4-e45f010c03a8
   ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@osd.3.service  
 loaded active 
running   Ceph osd.3 for d49b287a-b680-11eb-95d4-e45f010c03a8
   ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@osd.7.service  
 loaded active 
running   Ceph osd.7 for d49b287a-b680-11eb-95d4-e45f010c03a8
   system-ceph\x2dd49b287a\x2db680\x2d11eb\x2d95d4\x2de45f010c03a8.slice
 loaded active 
activesystem-ceph\x2dd49b287a\x2db680\x2d11eb\x2d95d4\x2de45f010c03a8.slice
   ceph-d49b287a-b680-11eb-95d4-e45f010c03a8.target 
 loaded active 
activeCeph cluster d49b287a-b680-11eb-95d4-e45f010c03a8
   ceph.target  
 loaded active 
activeAll Ceph clusters and services

where the string after 'ceph-' is the fsid of the cluster.
Hope that helps (you can use the systemctl list-units also to search the 
specific ones on yours).



Cheers,
A





Sent from Mail for Windows 10

From: Marc
Sent: 06 August 2021 08:54
To: Andrew Walker-Brown; 
ceph-users@ceph.io
Subject: RE: All OSDs on one host down


I’ve tried restarting on of the osds but that fails, journalctl shows
osd not found.not convinced I’ve got the systemctl command right.


You are not mixing 'not container commands' with 'container commands'. As in, 
if you execute this journalctl outside of the container it will not find 
anything of course.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

--
David Caro
SRE - Cloud Services
Wikimedia Foundation 
PGP Signature: 7180 83A2 AC8B 314F B4CE  1171 4071 C7E1 D262 69C3

"Imagine a world in which every single human being can freely share in the
sum of all knowledge. That's our commitment."

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io

[ceph-users] Re: All OSDs on one host down

2021-08-07 Thread mabi
Indeed if you upgrade Docker such as with the APT unattended-upgrades the 
Docker daemon will get restarted meaning all your containers too :( That's just 
how Docker works.

You might want to switch to podman instead of Docker in order to avoid that. I 
use podman precisely for this reason.

‐‐‐ Original Message ‐‐‐

On Saturday, August 7th, 2021 at 11:04 AM, Andrew Walker-Brown 
 wrote:

> Thanks David,
>
> Spent some more time digging in the logs/google. Also had a further 2 nodes 
> fail this morning (different nodes).
>
> Looks like it’s related to apt-auto updates on Ubuntu 20.04, although we 
> don’t run unattended upgrades. Docker appears to get a terminate signal which 
> shutsdown/restarts all the containers but some don’t come back cleanly. 
> There’s was also some legacy unused interfaces/bonds in the netplan config.
>
> Anyway, cleaned all that up...so hopefully it’s resolved.
>
> Cheers,
>
> A.
>
> Sent from Mailhttps://go.microsoft.com/fwlink/?LinkId=550986 for Windows 10
>
> From: David Caromailto:dc...@wikimedia.org
>
> Sent: 06 August 2021 09:20
>
> To: Andrew Walker-Brownmailto:andrew_jbr...@hotmail.com
>
> Cc: Marcmailto:m...@f1-outsourcing.eu; 
> ceph-users@ceph.iomailto:ceph-users@ceph.io
>
> Subject: Re: [ceph-users] Re: All OSDs on one host down
>
> On 08/06 07:59, Andrew Walker-Brown wrote:
>
> > Hi Marc,
> >
> > Yes i’m probably doing just that.
> >
> > The ceph admin guides aren’t exactly helpful on this. The cluster was 
> > deployed using cephadm and it’s been running perfectly until now.
> >
> > Wouldn’t running “journalctl -u ceph-osd@5” on host ceph-004 show me the 
> > logs for osd.5 on that host?
>
> On my containerized setup, the services that cephadm created are:
>
> dcaro@node1:~ $ sudo systemctl list-units | grep ceph
>
> ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@crash.node1.service loaded active 
> running Ceph crash.node1 for d49b287a-b680-11eb-95d4-e45f010c03a8
>
> ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@mgr.node1.mhqltg.service loaded 
> active running Ceph mgr.node1.mhqltg for d49b287a-b680-11eb-95d4-e45f010c03a8
>
> ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@mon.node1.service loaded active 
> running Ceph mon.node1 for d49b287a-b680-11eb-95d4-e45f010c03a8
>
> ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@osd.3.service loaded active running 
> Ceph osd.3 for d49b287a-b680-11eb-95d4-e45f010c03a8
>
> ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@osd.7.service loaded active running 
> Ceph osd.7 for d49b287a-b680-11eb-95d4-e45f010c03a8
>
> system-ceph\x2dd49b287a\x2db680\x2d11eb\x2d95d4\x2de45f010c03a8.slice loaded 
> active active 
> system-ceph\x2dd49b287a\x2db680\x2d11eb\x2d95d4\x2de45f010c03a8.slice
>
> ceph-d49b287a-b680-11eb-95d4-e45f010c03a8.target loaded active active Ceph 
> cluster d49b287a-b680-11eb-95d4-e45f010c03a8
>
> ceph.target loaded active active All Ceph clusters and services
>
> where the string after 'ceph-' is the fsid of the cluster.
>
> Hope that helps (you can use the systemctl list-units also to search the 
> specific ones on yours).
>
> > Cheers,
> >
> > A
> >
> > Sent from Mailhttps://go.microsoft.com/fwlink/?LinkId=550986 for Windows 10
> >
> > From: Marcmailto:m...@f1-outsourcing.eu
> >
> > Sent: 06 August 2021 08:54
> >
> > To: Andrew Walker-Brownmailto:andrew_jbr...@hotmail.com; 
> > ceph-users@ceph.iomailto:ceph-users@ceph.io
> >
> > Subject: RE: All OSDs on one host down
> >
> > > I’ve tried restarting on of the osds but that fails, journalctl shows
> > >
> > > osd not found.not convinced I’ve got the systemctl command right.
> >
> > You are not mixing 'not container commands' with 'container commands'. As 
> > in, if you execute this journalctl outside of the container it will not 
> > find anything of course.
> >
> > ceph-users mailing list -- ceph-users@ceph.io
> >
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
> David Caro
>
> SRE - Cloud Services
>
> Wikimedia Foundation https://wikimediafoundation.org/
>
> PGP Signature: 7180 83A2 AC8B 314F B4CE 1171 4071 C7E1 D262 69C3
>
> "Imagine a world in which every single human being can freely share in the
>
> sum of all knowledge. That's our commitment."
>
> ceph-users mailing list -- ceph-users@ceph.io
>
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: All OSDs on one host down

2021-08-07 Thread E Taka
A few hours ago we had the same problem, also with Ubuntu 20.04, and
there is a coincidence in time with the latest docker update, which
was triggered from Puppet. After all, all the containers came back up
without a reboot. Thanks for the hint.

Note to myself: change the package parameter for the Ubuntu package
'docker.io' from 'latest' to 'installed'.

Am Sa., 7. Aug. 2021 um 11:05 Uhr schrieb Andrew Walker-Brown
:
>
> Thanks David,
>
> Spent some more time digging in the logs/google.  Also had a further 2 nodes 
> fail this morning (different nodes).
>
> Looks like it’s related to apt-auto updates on Ubuntu 20.04, although we 
> don’t run unattended upgrades.  Docker appears to get a terminate signal 
> which shutsdown/restarts all the containers but some don’t come back cleanly. 
>  There’s was also some legacy unused interfaces/bonds in the netplan config.
>
> Anyway, cleaned all that up...so hopefully it’s resolved.
>
> Cheers,
>
> A.
>
>
>
> Sent from Mail for Windows 10
>
> From: David Caro
> Sent: 06 August 2021 09:20
> To: Andrew Walker-Brown
> Cc: Marc; 
> ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: All OSDs on one host down
>
> On 08/06 07:59, Andrew Walker-Brown wrote:
> > Hi Marc,
> >
> > Yes i’m probably doing just that.
> >
> > The ceph admin guides aren’t exactly helpful on this.  The cluster was 
> > deployed using cephadm and it’s been running perfectly until now.
> >
> > Wouldn’t running “journalctl -u ceph-osd@5” on host ceph-004 show me the 
> > logs for osd.5 on that host?
>
> On my containerized setup, the services that cephadm created are:
>
> dcaro@node1:~ $ sudo systemctl list-units | grep ceph
>   ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@crash.node1.service   
>   loaded 
> active running   Ceph crash.node1 for d49b287a-b680-11eb-95d4-e45f010c03a8
>   ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@mgr.node1.mhqltg.service  
>   loaded 
> active running   Ceph mgr.node1.mhqltg for 
> d49b287a-b680-11eb-95d4-e45f010c03a8
>   ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@mon.node1.service 
>   loaded 
> active running   Ceph mon.node1 for d49b287a-b680-11eb-95d4-e45f010c03a8
>   ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@osd.3.service 
>   loaded 
> active running   Ceph osd.3 for d49b287a-b680-11eb-95d4-e45f010c03a8
>   ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@osd.7.service 
>   loaded 
> active running   Ceph osd.7 for d49b287a-b680-11eb-95d4-e45f010c03a8
>   system-ceph\x2dd49b287a\x2db680\x2d11eb\x2d95d4\x2de45f010c03a8.slice   
>   loaded 
> active active
> system-ceph\x2dd49b287a\x2db680\x2d11eb\x2d95d4\x2de45f010c03a8.slice
>   ceph-d49b287a-b680-11eb-95d4-e45f010c03a8.target
>   loaded 
> active activeCeph cluster d49b287a-b680-11eb-95d4-e45f010c03a8
>   ceph.target 
>   loaded 
> active activeAll Ceph clusters and services
>
> where the string after 'ceph-' is the fsid of the cluster.
> Hope that helps (you can use the systemctl list-units also to search the 
> specific ones on yours).
>
>
> >
> > Cheers,
> > A
> >
> >
> >
> >
> >
> > Sent from Mail for Windows 
> > 10
> >
> > From: Marc
> > Sent: 06 August 2021 08:54
> > To: Andrew Walker-Brown; 
> > ceph-users@ceph.io
> > Subject: RE: All OSDs on one host down
> >
> > >
> > > I’ve tried restarting on of the osds but that fails, journalctl shows
> > > osd not found.not convinced I’ve got the systemctl command right.
> > >
> >
> > You are not mixing 'not container commands' with 'container commands'. As 
> > in, if you execute this journalctl outside of the container it will not 
> > find anything of course.
> >
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
> --
> David Caro
> SRE - Cloud Services
> Wikimedia Foundation 
> PGP Signature: 7180 83A2 AC8B 314F B4CE  1171 4071 C7E1 D262 69C3
>
> "Imagine a world in which every single human being can freely share in the
> sum of all 

[ceph-users] Re: All OSDs on one host down

2021-08-07 Thread Andrew Walker-Brown
Yeah I think that’s along the lines of what I’ve faced here.  Hopefully i’ve 
managed to disable the auto updates.

Sent from Mail for Windows

From: Clyso GmbH - Ceph Foundation Member
Sent: 07 August 2021 10:46
To: Andrew Walker-Brown; David 
Caro
Cc: Marc; 
ceph-users@ceph.io
Subject: Re: [ceph-users] Re: All OSDs on one host down

Hi Andrew,

we have had bad experiences with ubuntu's auto update, especially when
updating packages from systemd,dbus and docker.
for example: one effect was internal communication errors, only a
restart of the node helped.

Cheers, Joachim

___
Clyso GmbH - Ceph Foundation Member
supp...@clyso.com
https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.clyso.com%2Fdata=04%7C01%7C%7Cb914f016fb804af1cdcd08d959882b99%7C84df9e7fe9f640afb435%7C1%7C0%7C637639263643148448%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=UEAe8op8%2B9DVtQfbXdoRM1s%2F3GZ64riD3y8QD3Y8EVU%3Dreserved=0

Am 07.08.2021 um 11:04 schrieb Andrew Walker-Brown:
> Thanks David,
>
> Spent some more time digging in the logs/google.  Also had a further 2 nodes 
> fail this morning (different nodes).
>
> Looks like it’s related to apt-auto updates on Ubuntu 20.04, although we 
> don’t run unattended upgrades.  Docker appears to get a terminate signal 
> which shutsdown/restarts all the containers but some don’t come back cleanly. 
>  There’s was also some legacy unused interfaces/bonds in the netplan config.
>
> Anyway, cleaned all that up...so hopefully it’s resolved.
>
> Cheers,
>
> A.
>
>
>
> Sent from 
> Mail
>  for Windows 10
>
> From: David Caro
> Sent: 06 August 2021 09:20
> To: Andrew Walker-Brown
> Cc: Marc; 
> ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: All OSDs on one host down
>
> On 08/06 07:59, Andrew Walker-Brown wrote:
>> Hi Marc,
>>
>> Yes i’m probably doing just that.
>>
>> The ceph admin guides aren’t exactly helpful on this.  The cluster was 
>> deployed using cephadm and it’s been running perfectly until now.
>>
>> Wouldn’t running “journalctl -u ceph-osd@5” on host ceph-004 show me the 
>> logs for osd.5 on that host?
> On my containerized setup, the services that cephadm created are:
>
> dcaro@node1:~ $ sudo systemctl list-units | grep ceph
>ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@crash.node1.service  
>loaded 
> active running   Ceph crash.node1 for d49b287a-b680-11eb-95d4-e45f010c03a8
>ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@mgr.node1.mhqltg.service 
>loaded 
> active running   Ceph mgr.node1.mhqltg for 
> d49b287a-b680-11eb-95d4-e45f010c03a8
>ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@mon.node1.service
>loaded 
> active running   Ceph mon.node1 for d49b287a-b680-11eb-95d4-e45f010c03a8
>ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@osd.3.service
>loaded 
> active running   Ceph osd.3 for d49b287a-b680-11eb-95d4-e45f010c03a8
>ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@osd.7.service
>loaded 
> active running   Ceph osd.7 for d49b287a-b680-11eb-95d4-e45f010c03a8
>system-ceph\x2dd49b287a\x2db680\x2d11eb\x2d95d4\x2de45f010c03a8.slice  
>loaded 
> active active
> system-ceph\x2dd49b287a\x2db680\x2d11eb\x2d95d4\x2de45f010c03a8.slice
>ceph-d49b287a-b680-11eb-95d4-e45f010c03a8.target   
>loaded 
> active activeCeph cluster d49b287a-b680-11eb-95d4-e45f010c03a8
>ceph.target
>loaded 
> active activeAll Ceph clusters and services
>
> where the string after 'ceph-' is the fsid of the cluster.
> Hope that helps (you can use the systemctl 

[ceph-users] Re: All OSDs on one host down

2021-08-07 Thread Andrew Walker-Brown
Thanks David,

Spent some more time digging in the logs/google.  Also had a further 2 nodes 
fail this morning (different nodes).

Looks like it’s related to apt-auto updates on Ubuntu 20.04, although we don’t 
run unattended upgrades.  Docker appears to get a terminate signal which 
shutsdown/restarts all the containers but some don’t come back cleanly.  
There’s was also some legacy unused interfaces/bonds in the netplan config.

Anyway, cleaned all that up...so hopefully it’s resolved.

Cheers,

A.



Sent from Mail for Windows 10

From: David Caro
Sent: 06 August 2021 09:20
To: Andrew Walker-Brown
Cc: Marc; 
ceph-users@ceph.io
Subject: Re: [ceph-users] Re: All OSDs on one host down

On 08/06 07:59, Andrew Walker-Brown wrote:
> Hi Marc,
>
> Yes i’m probably doing just that.
>
> The ceph admin guides aren’t exactly helpful on this.  The cluster was 
> deployed using cephadm and it’s been running perfectly until now.
>
> Wouldn’t running “journalctl -u ceph-osd@5” on host ceph-004 show me the logs 
> for osd.5 on that host?

On my containerized setup, the services that cephadm created are:

dcaro@node1:~ $ sudo systemctl list-units | grep ceph
  ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@crash.node1.service 
loaded active 
running   Ceph crash.node1 for d49b287a-b680-11eb-95d4-e45f010c03a8
  ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@mgr.node1.mhqltg.service
loaded active 
running   Ceph mgr.node1.mhqltg for d49b287a-b680-11eb-95d4-e45f010c03a8
  ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@mon.node1.service   
loaded active 
running   Ceph mon.node1 for d49b287a-b680-11eb-95d4-e45f010c03a8
  ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@osd.3.service   
loaded active 
running   Ceph osd.3 for d49b287a-b680-11eb-95d4-e45f010c03a8
  ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@osd.7.service   
loaded active 
running   Ceph osd.7 for d49b287a-b680-11eb-95d4-e45f010c03a8
  system-ceph\x2dd49b287a\x2db680\x2d11eb\x2d95d4\x2de45f010c03a8.slice 
loaded active 
activesystem-ceph\x2dd49b287a\x2db680\x2d11eb\x2d95d4\x2de45f010c03a8.slice
  ceph-d49b287a-b680-11eb-95d4-e45f010c03a8.target  
loaded active 
activeCeph cluster d49b287a-b680-11eb-95d4-e45f010c03a8
  ceph.target   
loaded active 
activeAll Ceph clusters and services

where the string after 'ceph-' is the fsid of the cluster.
Hope that helps (you can use the systemctl list-units also to search the 
specific ones on yours).


>
> Cheers,
> A
>
>
>
>
>
> Sent from Mail for Windows 10
>
> From: Marc
> Sent: 06 August 2021 08:54
> To: Andrew Walker-Brown; 
> ceph-users@ceph.io
> Subject: RE: All OSDs on one host down
>
> >
> > I’ve tried restarting on of the osds but that fails, journalctl shows
> > osd not found.not convinced I’ve got the systemctl command right.
> >
>
> You are not mixing 'not container commands' with 'container commands'. As in, 
> if you execute this journalctl outside of the container it will not find 
> anything of course.
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

--
David Caro
SRE - Cloud Services
Wikimedia Foundation 
PGP Signature: 7180 83A2 AC8B 314F B4CE  1171 4071 C7E1 D262 69C3

"Imagine a world in which every single human being can freely share in the
sum of all knowledge. That's our commitment."

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io