Re: [ceph-users] 14.2.1 OSDs crash and sometimes fail to start back up, workaround
Slight correction. I removed and added back only the OSDs that were crashing. I noticed it seemed to be only certain OSDs that were crashing. Once they were rebuilt, they stopped crashing. Further info, We originally had deployed Luminous code, upgraded to mimic, then upgraded to nautilus. Perhaps there was issues with OSDs related to upgrades? I don’t know. Perhaps a clean install of 14.2.1 would not have done this? I don’t know. -Ed > On Jul 12, 2019, at 11:32 AM, Edward Kalk wrote: > > It seems that I have been able to workaround my issues. > I’ve attempted to reproduce by rebooting nodes and using the stop all OSDs > wait a bit and start them. > At this time, no OSDs are crashing like before. OSDs seem to have no problems > starting either. > What I did is remove completely the OSDs one at a time and reissue them > allowing CEPH 14.2.1 to reengineer them. > I have attached my doc I use to accomplish > this. *BEfore I do it, I mark the OSD as “out” via the GUI or CLI and allow > it to reweight to 0%, this is monitored via Ceph -s. I do this so that there > is not an actual disk fail which then puts me into dual disk fail when I’m > rebuilding an OSD. > > -Edward Kalk > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] 14.2.1 OSDs crash and sometimes fail to start back up, workaround
It seems that I have been able to workaround my issues. I’ve attempted to reproduce by rebooting nodes and using the stop all OSDs wait a bit and start them. At this time, no OSDs are crashing like before. OSDs seem to have no problems starting either. What I did is remove completely the OSDs one at a time and reissue them allowing CEPH 14.2.1 to reengineer them. Remove a disk: 1.) see which OSD is which disk: sudo ceph-volume lvm list 2.) ceph osd out X EX: synergy@synergy3:~$ ceph osd out 21 marked out osd.21. 2.a) ceph osd down osd.X Ex: ceph osd down osd.21 2.aa) Stop OSD daemon: sudo systemctl stop ceph-osd@X EX: sudo systemctl stop ceph-osd@21 2.b) ceph osd rm osd.X EX: ceph osd rm osd.21 3.) check status : ceph -s 4.)Observe data migration: ceph -w 5.) remove from CRUSH: ceph osd crush remove {name} EX: ceph osd crush remove osd.21 5.b) del auth: ceph auth del osd.21 6.) find info on disk: sudo hdparm -I /dev/sdd 7.) see sata ports: lsscsi --verbose 8.) Go pull the disk and replace it, or not and do the following steps to re-use it. additional steps to remove and reuse a disk: (without ejecting, as ejecting and replace drops this for us) (do this last after following the CEPH docs for remove a disk.) 9.) sudo gdisk /dev/sdX (x,z,Y,Y) 9.a) 94 lsblk 95 dmsetup remove ceph--e36dc03d--bf0d--462a--b4e6--8e49819bec0b-osd--block--d5574ac1--f72f--4942--8f4a--ac24891b2ee6 10.) deploy a /dev/sdX disk: from 216.106.44.209 (ceph-mon0) you must be in the "my_cluster" folder: EX: Synergy@Ceph-Mon0:~/my_cluster$ ceph-deploy osd create --data /dev/sdd synergy1 I have attached my doc I use to accomplish this. *BEfore I do it, I mark the OSD as “out” via the GUI or CLI and allow it to reweight to 0%, this is monitored via Ceph -s. I do this so that there is not an actual disk fail which then puts me into dual disk fail when I’m rebuilding an OSD. -Edward Kalk ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com