Re: [ceph-users] SSD OSD crashing after upgrade to 12.2.10

Igor Fedotov Thu, 07 Feb 2019 06:05:58 -0800

Eugen,

At first - you should upgrade to 12.2.11 (or bring the mentioned patchin by other means) to fix rename procedure which will avoid newinconsistent objects appearance in DB. This should at least reduce theOSD crash frequency.

At second - theoretically previous crashes could result in persistentinconsistent objects in your DB. I haven't seen that in real life beforebut probably they exist. We need to check. If so OSD crashes might stilloccur.

So I'd like to have fsck report to verify that. No matter if you do fsckbefore or after the upgrade.

Once we have fsck report we can proceed with the repair. Which is a bitrisky procedure so may be I should try to simulate the inconsistency in question and check if built-in repair handles that properly. Willsee, lets get fsck report first.

W.r.t to running ceph-bluestore-tool - you might want to specify logfile and increase log level to 20 using --log-file and --log-level options.



On 2/7/2019 4:45 PM, Eugen Block wrote:

Hi Igor,

thanks for the quick response!
Just to make sure I don't misunderstand, and because it's a productioncluster:before anything else I should run fsck on that OSD? Depending on theresult we'll decide how to continue, right?Is there anything else to be enabled for that command or can I simplyrun 'ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-<ID>'?
Any other obstacles I should be aware of when running fsck?

Thanks!
Eugen


Zitat von Igor Fedotov <[email protected]>:
Hi Eugen,

looks like this isn't [1] but rather

https://tracker.ceph.com/issues/38049

and
https://tracker.ceph.com/issues/36541 (=https://tracker.ceph.com/issues/36638 for luminous)
Hence it's not fixed in 12.2.10, target release is 12.2.11
Also please note the patch allows to avoid new occurrences for theissue. But there some chances that inconsistencies caused by itearlier are still present in DB. And assertion might still happen(hopefully with less frequency).
So could you please run fsck for OSDs that were broken once and sharethe results?
Then we can decide if it makes sense to proceed with the repair.


Thanks,

Igor

On 2/7/2019 3:37 PM, Eugen Block wrote:
Hi list,
I found this thread [1] about crashing SSD OSDs, although that wasabout an upgrade to 12.2.7, we just hit (probably) the same issueafter our update to 12.2.10 two days ago in a production cluster.Just half an hour ago I saw one OSD (SSD) crashing (for the firsttime):
2019-02-07 13:02:07.682178 mon.host1 mon.0 <IP>:6789/0 109754 :cluster [INF] osd.10 failed (root=default,host=host1) (connectionrefused reported by osd.20)2019-02-07 13:02:08.623828 mon.host1 mon.0 <IP>:6789/0 109771 :cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
One minute later, the OSD was back online.
This is the stack trace reported in syslog:

---cut here---
2019-02-07T13:01:51.181027+01:00 host1 ceph-osd[1136505]: *** Caughtsignal (Aborted) **2019-02-07T13:01:51.181232+01:00 host1 ceph-osd[1136505]: in thread7f75ce646700 thread_name:bstore_kv_final2019-02-07T13:01:51.185873+01:00 host1 ceph-osd[1136505]: cephversion 12.2.10-544-gb10c702661(b10c702661a31c8563b3421d6d664de93a0cb0e2) luminous (stable)2019-02-07T13:01:51.186077+01:00 host1 ceph-osd[1136505]: 1:(()+0xa587d9) [0x560b921cc7d9]2019-02-07T13:01:51.186226+01:00 host1 ceph-osd[1136505]: 2:(()+0x10b10) [0x7f75d8386b10]2019-02-07T13:01:51.186368+01:00 host1 ceph-osd[1136505]: 3:(gsignal()+0x37) [0x7f75d73508d7]2019-02-07T13:01:51.186773+01:00 host1 ceph-osd[1136505]: 4:(abort()+0x13a) [0x7f75d7351caa]2019-02-07T13:01:51.186906+01:00 host1 ceph-osd[1136505]: 5:(ceph::__ceph_assert_fail(char const*, char const*, int, charconst*)+0x280) [0x560b922096d0]2019-02-07T13:01:51.187027+01:00 host1 ceph-osd[1136505]: 6:(interval_set<unsigned long, btree::btree_map<unsigned long,unsigned long, std::less<unsigned long>,mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsignedlong const, unsigned long> >, 256> >::insert(unsigned long, unsignedlong, unsigned long*, unsigned long*)+0xef2) [0x560b921bd432]2019-02-07T13:01:51.187167+01:00 host1 ceph-osd[1136505]: 7:(StupidAllocator::_insert_free(unsigned long, unsigned long)+0x126)[0x560b921b4a06]2019-02-07T13:01:51.187294+01:00 host1 ceph-osd[1136505]: 8:(StupidAllocator::release(unsigned long, unsigned long)+0x7d)[0x560b921b4f4d]2019-02-07T13:01:51.187418+01:00 host1 ceph-osd[1136505]: 9:(BlueStore::_txc_release_alloc(BlueStore::TransContext*)+0x72)[0x560b9207fa22]2019-02-07T13:01:51.187539+01:00 host1 ceph-osd[1136505]: 10:(BlueStore::_txc_finish(BlueStore::TransContext*)+0x5d7)[0x560b92092d77]2019-02-07T13:01:51.187661+01:00 host1 ceph-osd[1136505]: 11:(BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x1f6)[0x560b920a3fa6]2019-02-07T13:01:51.187781+01:00 host1 ceph-osd[1136505]: 12:(BlueStore::_kv_finalize_thread()+0x620) [0x560b920a58e0]2019-02-07T13:01:51.187898+01:00 host1 ceph-osd[1136505]: 13:(BlueStore::KVFinalizeThread::entry()+0xd) [0x560b920fb57d]2019-02-07T13:01:51.188017+01:00 host1 ceph-osd[1136505]: 14:(()+0x8744) [0x7f75d837e744]2019-02-07T13:01:51.188138+01:00 host1 ceph-osd[1136505]: 15:(clone()+0x6d) [0x7f75d7405aad]2019-02-07T13:01:51.188271+01:00 host1 ceph-osd[1136505]: 2019-02-0713:01:51.185833 7f75ce646700 -1 *** Caught signal (Aborted) **
---cut here---
Is there anything we can do about this? The issue in [1] doesn'tseem to be resolved, yet. Debug logging is not enabled, so I don'thave more detailed information except the full stack trace from theOSD log. Any help is appreciated!
Regards,
Eugen
[1]http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/029616.html
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD OSD crashing after upgrade to 12.2.10

Reply via email to