Re: [ceph-users] Nautilus 14.2.1 / 14.2.2 crash

2019-07-24 Thread Alex Litvak
The only possible hint, crush coincides with a scrub time interval start.  Why it didn't happen yesterday at the same time, I have no idea.  I returned default debug settings with a hope that I get a 
little bit more info when next crush happens.  I really would like to debug only specific components rather than turning everything up to 20.  Sorry for hijacking the post, I will create a new one 
when I have more information.


On 7/23/2019 9:50 PM, Alex Litvak wrote:
I just had an osd crashed with no logs (debug was not enabled).  Happened 24 hours later after actual upgrade from 14.2.1 to 14.2.2.  Nothing else changed as far as environment or load.  Disk is OK. 
Restarted osd and it came back.  Had cluster up for 2 month until the upgrade without an issue.


On 7/23/2019 2:56 PM, Nathan Fish wrote:

I have not had any more OSDs crash, but the 3 that crashed still crash
on startup. I may purge and recreate them, but there's no hurry. I
have 18 OSDs per host and plenty of free space currently.

On Tue, Jul 23, 2019 at 2:19 AM Ashley Merrick  wrote:


Have they been stable since, or still had some crash?

,Thanks

 On Sat, 20 Jul 2019 10:09:08 +0800 Nigel Williams 
 wrote 


On Sat, 20 Jul 2019 at 04:28, Nathan Fish  wrote:

On further investigation, it seems to be this bug:
http://tracker.ceph.com/issues/38724


We just upgraded to 14.2.2, and had a dozen OSDs at 14.2.2 go down this bug, 
recovered with:

systemctl reset-failed ceph-osd@160
systemctl start ceph-osd@160


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nautilus 14.2.1 / 14.2.2 crash

2019-07-23 Thread Alex Litvak
I just had an osd crashed with no logs (debug was not enabled).  Happened 24 hours later after actual upgrade from 14.2.1 to 14.2.2.  Nothing else changed as far as environment or load.  Disk is OK. 
Restarted osd and it came back.  Had cluster up for 2 month until the upgrade without an issue.


On 7/23/2019 2:56 PM, Nathan Fish wrote:

I have not had any more OSDs crash, but the 3 that crashed still crash
on startup. I may purge and recreate them, but there's no hurry. I
have 18 OSDs per host and plenty of free space currently.

On Tue, Jul 23, 2019 at 2:19 AM Ashley Merrick  wrote:


Have they been stable since, or still had some crash?

,Thanks

 On Sat, 20 Jul 2019 10:09:08 +0800 Nigel Williams 
 wrote 


On Sat, 20 Jul 2019 at 04:28, Nathan Fish  wrote:

On further investigation, it seems to be this bug:
http://tracker.ceph.com/issues/38724


We just upgraded to 14.2.2, and had a dozen OSDs at 14.2.2 go down this bug, 
recovered with:

systemctl reset-failed ceph-osd@160
systemctl start ceph-osd@160


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nautilus 14.2.1 / 14.2.2 crash

2019-07-23 Thread Nathan Fish
I have not had any more OSDs crash, but the 3 that crashed still crash
on startup. I may purge and recreate them, but there's no hurry. I
have 18 OSDs per host and plenty of free space currently.

On Tue, Jul 23, 2019 at 2:19 AM Ashley Merrick  wrote:
>
> Have they been stable since, or still had some crash?
>
> ,Thanks
>
>  On Sat, 20 Jul 2019 10:09:08 +0800 Nigel Williams 
>  wrote 
>
>
> On Sat, 20 Jul 2019 at 04:28, Nathan Fish  wrote:
>
> On further investigation, it seems to be this bug:
> http://tracker.ceph.com/issues/38724
>
>
> We just upgraded to 14.2.2, and had a dozen OSDs at 14.2.2 go down this bug, 
> recovered with:
>
> systemctl reset-failed ceph-osd@160
> systemctl start ceph-osd@160
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nautilus 14.2.1 / 14.2.2 crash

2019-07-23 Thread Ashley Merrick
Have they been stable since, or still had some crash?



,Thanks


 On Sat, 20 Jul 2019 10:09:08 +0800 Nigel Williams 
 wrote 





On Sat, 20 Jul 2019 at 04:28, Nathan Fish  wrote:

On further investigation, it seems to be this bug:
 http://tracker.ceph.com/issues/38724


We just upgraded to 14.2.2, and had a dozen OSDs at 14.2.2 go down this bug, 
recovered with:



systemctl reset-failed ceph-osd@160
systemctl start ceph-osd@160







___

ceph-users mailing list 

mailto:ceph-users@lists.ceph.com 

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nautilus 14.2.1 / 14.2.2 crash

2019-07-19 Thread Alex Litvak
I was planning to upgrade 14.2.1 to 14.2.2 next week.  Since there are few reports of crashes, does any one knows if upgrade somehow triggers the issue?  If not, that what is?  Since this has been 
reported before the upgrade by some, just wondering if upgrade to 14.2.2 makes the problem worse.


On 7/19/2019 9:09 PM, Nigel Williams wrote:


On Sat, 20 Jul 2019 at 04:28, Nathan Fish mailto:lordci...@gmail.com>> wrote:

On further investigation, it seems to be this bug:
http://tracker.ceph.com/issues/38724


We just upgraded to 14.2.2, and had a dozen OSDs at 14.2.2 go down this bug, 
recovered with:

systemctl reset-failed ceph-osd@160
systemctl start ceph-osd@160



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nautilus 14.2.1 / 14.2.2 crash

2019-07-19 Thread Nathan Fish
Good to know. I tried reset-failed and restart several times, it
didn't work on any of them. I also rebooted one of the hosts, didn't
help. Thankfully it seems they failed far enough apart that our
nearly-empty cluster rebuilt in time. But it's rather worrying.

On Fri, Jul 19, 2019 at 10:09 PM Nigel Williams
 wrote:
>
>
> On Sat, 20 Jul 2019 at 04:28, Nathan Fish  wrote:
>>
>> On further investigation, it seems to be this bug:
>> http://tracker.ceph.com/issues/38724
>
>
> We just upgraded to 14.2.2, and had a dozen OSDs at 14.2.2 go down this bug, 
> recovered with:
>
> systemctl reset-failed ceph-osd@160
> systemctl start ceph-osd@160
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nautilus 14.2.1 / 14.2.2 crash

2019-07-19 Thread Nigel Williams
On Sat, 20 Jul 2019 at 04:28, Nathan Fish  wrote:

> On further investigation, it seems to be this bug:
> http://tracker.ceph.com/issues/38724


We just upgraded to 14.2.2, and had a dozen OSDs at 14.2.2 go down this
bug, recovered with:

systemctl reset-failed ceph-osd@160
systemctl start ceph-osd@160
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nautilus 14.2.1 / 14.2.2 crash

2019-07-19 Thread Nathan Fish
On further investigation, it seems to be this bug:
http://tracker.ceph.com/issues/38724

On Fri, Jul 19, 2019 at 1:38 PM Nathan Fish  wrote:
>
> I came in this morning and started to upgrade to 14.2.2, only to
> notice that 3 OSDs had crashed overnight - exactly 1 on each of 3
> hosts. Apparently there was no data loss, which implies they crashed
> at different times, far enough part to rebuild? Still digging through
> logs to find exactly when they first crashed.
>
> Log from restarting ceph-osd@53:
> https://termbin.com/3e0x
>
> If someone can read this log and get anything out of it I would
> appreciate it. All I can tell is that it wasn't a RocksDB ENOSPC,
> which I have seen before.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com