[ceph-users] SSD OSD crashing after upgrade to 12.2.10

Eugen Block Thu, 07 Feb 2019 04:38:42 -0800

Hi list,

I found this thread [1] about crashing SSD OSDs, although that wasabout an upgrade to 12.2.7, we just hit (probably) the same issueafter our update to 12.2.10 two days ago in a production cluster.

Just half an hour ago I saw one OSD (SSD) crashing (for the first time):

2019-02-07 13:02:07.682178 mon.host1 mon.0 <IP>:6789/0 109754 :cluster [INF] osd.10 failed (root=default,host=host1) (connectionrefused reported by osd.20)2019-02-07 13:02:08.623828 mon.host1 mon.0 <IP>:6789/0 109771 :cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)


One minute later, the OSD was back online.
This is the stack trace reported in syslog:

---cut here---

2019-02-07T13:01:51.181027+01:00 host1 ceph-osd[1136505]: *** Caughtsignal (Aborted) **2019-02-07T13:01:51.181232+01:00 host1 ceph-osd[1136505]: in thread7f75ce646700 thread_name:bstore_kv_final2019-02-07T13:01:51.185873+01:00 host1 ceph-osd[1136505]: cephversion 12.2.10-544-gb10c702661(b10c702661a31c8563b3421d6d664de93a0cb0e2) luminous (stable)2019-02-07T13:01:51.186077+01:00 host1 ceph-osd[1136505]: 1:(()+0xa587d9) [0x560b921cc7d9]2019-02-07T13:01:51.186226+01:00 host1 ceph-osd[1136505]: 2:(()+0x10b10) [0x7f75d8386b10]2019-02-07T13:01:51.186368+01:00 host1 ceph-osd[1136505]: 3:(gsignal()+0x37) [0x7f75d73508d7]2019-02-07T13:01:51.186773+01:00 host1 ceph-osd[1136505]: 4:(abort()+0x13a) [0x7f75d7351caa]2019-02-07T13:01:51.186906+01:00 host1 ceph-osd[1136505]: 5:(ceph::__ceph_assert_fail(char const*, char const*, int, charconst*)+0x280) [0x560b922096d0]2019-02-07T13:01:51.187027+01:00 host1 ceph-osd[1136505]: 6:(interval_set<unsigned long, btree::btree_map<unsigned long, unsignedlong, std::less<unsigned long>,mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsignedlong const, unsigned long> >, 256> >::insert(unsigned long, unsignedlong, unsigned long*, unsigned long*)+0xef2) [0x560b921bd432]2019-02-07T13:01:51.187167+01:00 host1 ceph-osd[1136505]: 7:(StupidAllocator::_insert_free(unsigned long, unsigned long)+0x126)[0x560b921b4a06]2019-02-07T13:01:51.187294+01:00 host1 ceph-osd[1136505]: 8:(StupidAllocator::release(unsigned long, unsigned long)+0x7d)[0x560b921b4f4d]2019-02-07T13:01:51.187418+01:00 host1 ceph-osd[1136505]: 9:(BlueStore::_txc_release_alloc(BlueStore::TransContext*)+0x72)[0x560b9207fa22]2019-02-07T13:01:51.187539+01:00 host1 ceph-osd[1136505]: 10:(BlueStore::_txc_finish(BlueStore::TransContext*)+0x5d7)[0x560b92092d77]2019-02-07T13:01:51.187661+01:00 host1 ceph-osd[1136505]: 11:(BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x1f6)[0x560b920a3fa6]2019-02-07T13:01:51.187781+01:00 host1 ceph-osd[1136505]: 12:(BlueStore::_kv_finalize_thread()+0x620) [0x560b920a58e0]2019-02-07T13:01:51.187898+01:00 host1 ceph-osd[1136505]: 13:(BlueStore::KVFinalizeThread::entry()+0xd) [0x560b920fb57d]2019-02-07T13:01:51.188017+01:00 host1 ceph-osd[1136505]: 14:(()+0x8744) [0x7f75d837e744]2019-02-07T13:01:51.188138+01:00 host1 ceph-osd[1136505]: 15:(clone()+0x6d) [0x7f75d7405aad]2019-02-07T13:01:51.188271+01:00 host1 ceph-osd[1136505]: 2019-02-0713:01:51.185833 7f75ce646700 -1 *** Caught signal (Aborted) **

---cut here---

Is there anything we can do about this? The issue in [1] doesn't seemto be resolved, yet. Debug logging is not enabled, so I don't havemore detailed information except the full stack trace from the OSDlog. Any help is appreciated!


Regards,
Eugen

[1]http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/029616.html


_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] SSD OSD crashing after upgrade to 12.2.10

Reply via email to