Hi Alison,
I have observed exactly that with OSDs "converted" from ceph-disk to
ceph-volume. Someone thought it would be a great idea to store the /dev-device
name in the config instead of the uuid or any other stable device path:
# cat /etc/ceph/osd/287-2eaf591b-bced-4097-9499-5fda071c6161.json
{
...
"block": {
"path": "/dev/disk/by-partuuid/0c8a9f89-efa7-4c75-87ad-2f0d5aa2d649",
"uuid": "0c8a9f89-efa7-4c75-87ad-2f0d5aa2d649"
},
...
"data": {
"path": "/dev/sdm1",
"uuid": "2eaf591b-bced-4097-9499-5fda071c6161"
},
...
}
Funnily enough, it has the by-uuid path stored as well, but the /dev path is
actually used during activation. My "fix" is to re-generate the OSD-json just
before every ceph-disk OSD start.
You seem to be using LVM OSDs already, so this is a bit weird (can't be the
exact same issue). Still, I would not be surprised if you are bitten by
something similar, some stored config (cache) overrides the actual drive
location. It is really a bliss that the developers implemented a check that a
partition actually points to the data with the correct OSD ID, otherwise our
cluster would be rigged by now.
I would start by using low-level commands (ceph-volume) directly to see if the
issue is low-level or sits in some higher-level interface. Log-in to the OSD
node and check what "ceph-volume inventory" says and if you can manually
activate/deactivate the OSD on disk (be careful to include the --no-systemd
option everywhere to avoid unintended change of persistent configurations).
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: [email protected] <[email protected]>
Sent: Friday, August 25, 2023 10:29 PM
To: [email protected]
Subject: [ceph-users] Re: A couple OSDs not starting after host reboot
Hi,
Thank you for your reply. I don’t think the device names changed, but ceph
seems to be confused about which device the OSD is on. It’s reporting that
there are 2 OSDs on the same device although this is not true.
ceph device ls-by-host <osd-node> | grep sdu
ATA_HGST_HUH728080ALN600_VJH4GLUX sdu osd.665
ATA_HGST_HUH728080ALN600_VJH60MAX sdu osd.657
The osd.665 is actually on device sdm. Could this be the cause of the issue? Is
there a way to correct it?
Thanks,
Alison
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]