On Sun, Jan 20, 2019 at 11:30 PM Brian Topping <[email protected]> wrote:
>
> Hi all, looks like I might have pooched something. Between the two nodes I
> have, I moved all the PGs to one machine, reformatted the other machine,
> rebuilt that machine, and moved the PGs back. In both cases, I did this by
> taking the OSDs on the machine being moved from “out” and waiting for health
> to be restored, then took them down.
>
> This worked great up to the point I had the mon/manager/rgw where they
> started, all the OSDs/PGs on the other machine that had been rebuilt. The
> next step was to rebuild the master machine, copy /etc/ceph and /var/lib/ceph
> with cpio, then re-add new OSDs on the master machine as it were.
>
> This didn’t work so well. The master has come up just fine, but it’s not
> connecting to the OSDs. Of the four OSDs, only two came up, and the other two
> did not (IDs 1 and 3). For it's part, the OSD machine is reporting lines like
> the following in it’s logs:
>
> > [2019-01-20 16:22:10,106][systemd][WARNING] failed activating OSD, retries
> > left: 2
> > [2019-01-20 16:22:15,111][ceph_volume.process][INFO ] Running command:
> > /usr/sbin/ceph-volume lvm trigger 1-e3bfc69e-a145-4e19-aac2-5f888e1ed2ce
> > [2019-01-20 16:22:15,271][ceph_volume.process][INFO ] stderr -->
> > RuntimeError: could not find osd.1 with fsid
> > e3bfc69e-a145-4e19-aac2-5f888e1ed2ce
When creating an OSD, ceph-volume will capture the ID and the FSID and
use these to create a systemd unit. When the system boots, it queries
LVM for devices that match that ID/FSID information.
Is it possible you've attempted to create an OSD and then failed, and
tried again? That would explain why there would be a systemd unit with
an FSID that doesn't match. By the output, it does look like
you have an OSD 1, but with a different FSID (467... instead of
e3b...). You could try to disable the failing systemd unit with:
systemctl disable
[email protected]
(Follow up with OSD 3) and then run:
ceph-volume lvm activate --all
Hopefully that can get you back into activated OSDs
>
>
> I see this for the volumes:
>
> > [root@gw02 ceph]# ceph-volume lvm list
> >
> > ====== osd.1 =======
> >
> > [block]
> > /dev/ceph-c7640f3e-0bf5-4d75-8dd4-00b6434c84d9/osd-block-4672bb90-8cea-4580-85f2-1e692811a05a
> >
> > type block
> > osd id 1
> > cluster fsid 1cf94ce9-1323-4c43-865f-68f4ae9e6af3
> > cluster name ceph
> > osd fsid 4672bb90-8cea-4580-85f2-1e692811a05a
> > encrypted 0
> > cephx lockbox secret
> > block uuid 3M5fen-JgsL-t4vz-bh3m-k3pf-hjBV-4R7Cff
> > block device
> > /dev/ceph-c7640f3e-0bf5-4d75-8dd4-00b6434c84d9/osd-block-4672bb90-8cea-4580-85f2-1e692811a05a
> > vdo 0
> > crush device class None
> > devices /dev/sda3
> >
> > ====== osd.3 =======
> >
> > [block]
> > /dev/ceph-f5f453df-1d41-4883-b0f8-d662c6ba8bea/osd-block-084cf33d-8a38-4c82-884a-7c88e3161479
> >
> > type block
> > osd id 3
> > cluster fsid 1cf94ce9-1323-4c43-865f-68f4ae9e6af3
> > cluster name ceph
> > osd fsid 084cf33d-8a38-4c82-884a-7c88e3161479
> > encrypted 0
> > cephx lockbox secret
> > block uuid PSU2ba-6PbF-qhm7-RMER-lCkR-j58b-G9B6A7
> > block device
> > /dev/ceph-f5f453df-1d41-4883-b0f8-d662c6ba8bea/osd-block-084cf33d-8a38-4c82-884a-7c88e3161479
> > vdo 0
> > crush device class None
> > devices /dev/sdb3
> >
> > ====== osd.5 =======
> >
> > [block]
> > /dev/ceph-033e2bbe-5005-45d9-9ecd-4b541fe010bd/osd-block-e854930d-1617-4fe7-b3cd-98ef284643fd
> >
> > type block
> > osd id 5
> > cluster fsid 1cf94ce9-1323-4c43-865f-68f4ae9e6af3
> > cluster name ceph
> > osd fsid e854930d-1617-4fe7-b3cd-98ef284643fd
> > encrypted 0
> > cephx lockbox secret
> > block uuid F5YIfz-quO4-gbmW-rxyP-qXxe-iN7a-Po1mL9
> > block device
> > /dev/ceph-033e2bbe-5005-45d9-9ecd-4b541fe010bd/osd-block-e854930d-1617-4fe7-b3cd-98ef284643fd
> > vdo 0
> > crush device class None
> > devices /dev/sdc3
> >
> > ====== osd.7 =======
> >
> > [block]
> > /dev/ceph-1f3d4406-af86-4813-8d06-a001c57408fa/osd-block-5c0d0404-390e-4801-94a9-da52c104206f
> >
> > type block
> > osd id 7
> > cluster fsid 1cf94ce9-1323-4c43-865f-68f4ae9e6af3
> > cluster name ceph
> > osd fsid 5c0d0404-390e-4801-94a9-da52c104206f
> > encrypted 0
> > cephx lockbox secret
> > block uuid wgfOqi-iCu0-WIGb-uZPb-0R3n-ClQ3-0IewMe
> > block device
> > /dev/ceph-1f3d4406-af86-4813-8d06-a001c57408fa/osd-block-5c0d0404-390e-4801-94a9-da52c104206f
> > vdo 0
> > crush device class None
> > devices /dev/sdd3
>
> What I am wondering is if device mapper has lost something with a kernel or
> library change:
>
> > [root@gw02 ceph]# ls -l /dev/dm*
> > brw-rw----. 1 root disk 253, 0 Jan 20 16:19 /dev/dm-0
> > brw-rw----. 1 ceph ceph 253, 1 Jan 20 16:19 /dev/dm-1
> > brw-rw----. 1 ceph ceph 253, 2 Jan 20 16:19 /dev/dm-2
> > brw-rw----. 1 ceph ceph 253, 3 Jan 20 16:19 /dev/dm-3
> > brw-rw----. 1 ceph ceph 253, 4 Jan 20 16:19 /dev/dm-4
> > [root@gw02 ~]# dmsetup ls
> > ceph--1f3d4406--af86--4813--8d06--a001c57408fa-osd--block--5c0d0404--390e--4801--94a9--da52c104206f
> > (253:1)
> > ceph--f5f453df--1d41--4883--b0f8--d662c6ba8bea-osd--block--084cf33d--8a38--4c82--884a--7c88e3161479
> > (253:4)
> > ceph--033e2bbe--5005--45d9--9ecd--4b541fe010bd-osd--block--e854930d--1617--4fe7--b3cd--98ef284643fd
> > (253:2)
> > hndc1.centos02-root (253:0)
> > ceph--c7640f3e--0bf5--4d75--8dd4--00b6434c84d9-osd--block--4672bb90--8cea--4580--85f2--1e692811a05a
> > (253:3)
>
> How can I debug this? I suspect this is just some kind of a UID swap that
> that happened somewhere, but I don’t know what the chain of truth is through
> the database files to connect the two together and make sure I have the
> correct OSD blocks where the mon expects to find them.
>
> Thanks! Brian
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com