Re: [ceph-users] 14.2.1 OSDs crash and sometimes fail to start back up, workaround

2019-07-12 Thread Edward Kalk
Slight correction. I removed and added back only the OSDs that were crashing.
I noticed it seemed to be only certain OSDs that were crashing. Once they were 
rebuilt, they stopped crashing.

Further info, We originally had deployed Luminous code, upgraded to mimic, then 
upgraded to nautilus.
Perhaps there was issues with OSDs related to upgrades? I don’t know.
Perhaps a clean install of 14.2.1 would not have done this? I don’t know.

-Ed

> On Jul 12, 2019, at 11:32 AM, Edward Kalk  wrote:
> 
> It seems that I have been able to workaround my issues.
> I’ve attempted to reproduce by rebooting nodes and using the stop all OSDs 
> wait a bit and start them.
> At this time, no OSDs are crashing like before. OSDs seem to have no problems 
> starting either.
> What I did is remove completely the OSDs one at a time and reissue them 
> allowing CEPH 14.2.1 to reengineer them.
>  I have attached my doc I use to accomplish 
> this. *BEfore I do it, I mark the OSD as “out” via the GUI or CLI and allow 
> it to reweight to 0%, this is monitored via Ceph -s. I do this so that there 
> is not an actual disk fail which then puts me into dual disk fail when I’m 
> rebuilding an OSD.
> 
> -Edward Kalk
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 14.2.1 OSDs crash and sometimes fail to start back up, workaround

2019-07-12 Thread Edward Kalk
It seems that I have been able to workaround my issues.
I’ve attempted to reproduce by rebooting nodes and using the stop all OSDs wait 
a bit and start them.
At this time, no OSDs are crashing like before. OSDs seem to have no problems 
starting either.
What I did is remove completely the OSDs one at a time and reissue them 
allowing CEPH 14.2.1 to reengineer them.
Remove a disk:
1.) see which OSD is which disk: sudo ceph-volume lvm list

2.) ceph osd out X
EX:
synergy@synergy3:~$ ceph osd out 21
marked out osd.21.

2.a) ceph osd down osd.X
Ex:
ceph osd down osd.21

2.aa) Stop OSD daemon: sudo systemctl stop ceph-osd@X
EX:
sudo systemctl stop ceph-osd@21

2.b) ceph osd rm osd.X
EX:
ceph osd rm osd.21

3.) check status : ceph -s

4.)Observe data migration: ceph -w

5.) remove from CRUSH: ceph osd crush remove {name}
EX: ceph osd crush remove osd.21
5.b) del auth: ceph auth del osd.21

6.) find info on disk:
sudo hdparm -I /dev/sdd

7.) see sata ports: lsscsi --verbose

8.) Go pull the disk and replace it, or not and do the following steps to 
re-use it.

additional steps to remove and reuse a disk: (without ejecting, as ejecting and 
replace drops this for us)
(do this last after following the CEPH docs for remove a disk.)
9.) sudo gdisk /dev/sdX (x,z,Y,Y)
9.a)
 94  lsblk
 95  dmsetup remove 
ceph--e36dc03d--bf0d--462a--b4e6--8e49819bec0b-osd--block--d5574ac1--f72f--4942--8f4a--ac24891b2ee6

 10.) deploy a /dev/sdX disk: from 216.106.44.209 (ceph-mon0) you must be in 
the "my_cluster" folder:
EX: Synergy@Ceph-Mon0:~/my_cluster$ ceph-deploy osd create --data /dev/sdd 
synergy1
 I have attached my doc I use to accomplish this. *BEfore I do it, I mark the 
OSD as “out” via the GUI or CLI and allow it to reweight to 0%, this is 
monitored via Ceph -s. I do this so that there is not an actual disk fail which 
then puts me into dual disk fail when I’m rebuilding an OSD.

-Edward Kalk

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] P1 production down - 4 OSDs down will not start 14.2.1 nautilus

2019-07-11 Thread Edward Kalk
Production has been restored, it just took about 26 minuets for linux to let me 
execute the OSD start command this time. The longest yet.
sudo systemctl start ceph-osd@X
(Yes, this has happened to us about 4 times now.)
-Ed

> On Jul 11, 2019, at 11:38 AM, Edward Kalk  wrote:
> 
> Rebooted node 4, on node 1 and 2, 2 OSDs each crashed and will not start.
> 
> 
> The logs are similar, seems to be the BUG related to 38724
> Tried to manually start the OSDs, failed. It’s been about 20 minuets with 
> prod down.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 14.2.1 Nautilus OSDs crash

2019-07-11 Thread Edward Kalk
http://tracker.ceph.com/issues/38724 
^seems this bug is related. I’ve added notes to it.
Triggers seem to be a node reboot or remove or add a new OSD.

There seem to be pack port duplicates for Mimic and Luminous
Copied to RADOS - Backport #39692 : 
mimic: _txc_add_transaction error (39) Directory not empty not handled on 
operation 21 (op 1, counting from 0)New 
Copied to RADOS - Backport #39693 : 
nautilus: _txc_add_transaction error (39) Directory not empty not handled on 
operation 21 (op 1, counting from 0) New 
Copied to RADOS - Backport #39694 : 
luminous: _txc_add_transaction error (39) Directory not empty not handled on 
operation 21 (op 1, counting from 0)

This may have an impact to production when multiple OSDs fail to start 
repeatedly after hitting the BUG. Linux stops the start due to too many 
attempts. Our production VM becomes unresponsive for about 10 minuets and then 
the OSD try to start again and typically starts. Sometimes it does not and we 
go another 10 minuets. I have had this happen and the Prod VM crashes.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] iSCSI on Ubuntu and HA / Multipathing

2019-07-10 Thread Edward Kalk
The Docs say : http://docs.ceph.com/docs/nautilus/rbd/iscsi-targets/ 

Red Hat Enterprise Linux/CentOS 7.5 (or newer); Linux kernel v4.16 (or newer)
^^Is there a version combination of CEPH and Ubuntu that works? Is anyone 
running iSCSI on Ubuntu ?
-Ed___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ubuntu 18.04 - Mimic - Nautilus

2019-07-10 Thread Edward Kalk
Interesting. So is it not good that I am running Ubuntu 16.04 and 14.2.1. ?
-Ed

> On Jul 10, 2019, at 1:46 PM, Reed Dier  wrote:
> 
> It does not appear that that page has been updated in a while.
> 
> The official Ceph deb repos only include Mimic and Nautilus packages for 
> 18.04,
> While the Ubuntu-bionic repos include a Luminous build.
> 
> Hope that helps.
> 
> Reed
> 
>> On Jul 10, 2019, at 1:20 PM, Edward Kalk > <mailto:ek...@socket.net>> wrote:
>> 
>> When reviewing: http://docs.ceph.com/docs/master/start/os-recommendations/ 
>> <http://docs.ceph.com/docs/master/start/os-recommendations/> I see there is 
>> no mention of “mimic” or “nautilus”.
>> What are the OS recommendations for them, specifically nautilus which is the 
>> one I’m running?
>> 
>> Is 18.04 advisable at all?
>> 
>> -Ed
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ubuntu 18.04 - Mimic - Nautilus

2019-07-10 Thread Edward Kalk
When reviewing: http://docs.ceph.com/docs/master/start/os-recommendations/ 
 I see there is no 
mention of “mimic” or “nautilus”.
What are the OS recommendations for them, specifically nautilus which is the 
one I’m running?

Is 18.04 advisable at all?

-Ed___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Few OSDs crash on partner nodes when a node is rebooted

2019-07-10 Thread Edward Kalk
Wondering if the fix for http://tracker.ceph.com/issues/38724 
<http://tracker.ceph.com/issues/38724> will be included in : 
https://tracker.ceph.com/projects/ceph/roadmap#v14.2.2 
<https://tracker.ceph.com/projects/ceph/roadmap#v14.2.2> ?

> On Jul 10, 2019, at 7:55 AM, Edward Kalk  wrote:
> 
> Hello,
> 
> Has anyone else seen this? Basically I reboot a node and 2-3 OSDs on other 
> hosts crash.
> Then they fail to start back up and have seem to hit a startup bug. 
> http://tracker.ceph.com/issues/38724 <http://tracker.ceph.com/issues/38724>
> 
> What’s weird is that it seemed to be the same OSDs that crash, so as a work 
> around I tried to rip and rebuild them.
> Will test rebooting a node today and see if they still crash or if any others 
> crash this time.
> 
> -Ed Kalk
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Few OSDs crash on partner nodes when a node is rebooted

2019-07-10 Thread Edward Kalk
Hello,

Has anyone else seen this? Basically I reboot a node and 2-3 OSDs on other 
hosts crash.
Then they fail to start back up and have seem to hit a startup bug. 
http://tracker.ceph.com/issues/38724 

What’s weird is that it seemed to be the same OSDs that crash, so as a work 
around I tried to rip and rebuild them.
Will test rebooting a node today and see if they still crash or if any others 
crash this time.

-Ed Kalk___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com