Hi list,

I found this thread [1] about crashing SSD OSDs, although that was about an upgrade to 12.2.7, we just hit (probably) the same issue after our update to 12.2.10 two days ago in a production cluster.
Just half an hour ago I saw one OSD (SSD) crashing (for the first time):

2019-02-07 13:02:07.682178 mon.host1 mon.0 <IP>:6789/0 109754 : cluster [INF] osd.10 failed (root=default,host=host1) (connection refused reported by osd.20) 2019-02-07 13:02:08.623828 mon.host1 mon.0 <IP>:6789/0 109771 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)

One minute later, the OSD was back online.
This is the stack trace reported in syslog:

---cut here---
2019-02-07T13:01:51.181027+01:00 host1 ceph-osd[1136505]: *** Caught signal (Aborted) ** 2019-02-07T13:01:51.181232+01:00 host1 ceph-osd[1136505]: in thread 7f75ce646700 thread_name:bstore_kv_final 2019-02-07T13:01:51.185873+01:00 host1 ceph-osd[1136505]: ceph version 12.2.10-544-gb10c702661 (b10c702661a31c8563b3421d6d664de93a0cb0e2) luminous (stable) 2019-02-07T13:01:51.186077+01:00 host1 ceph-osd[1136505]: 1: (()+0xa587d9) [0x560b921cc7d9] 2019-02-07T13:01:51.186226+01:00 host1 ceph-osd[1136505]: 2: (()+0x10b10) [0x7f75d8386b10] 2019-02-07T13:01:51.186368+01:00 host1 ceph-osd[1136505]: 3: (gsignal()+0x37) [0x7f75d73508d7] 2019-02-07T13:01:51.186773+01:00 host1 ceph-osd[1136505]: 4: (abort()+0x13a) [0x7f75d7351caa] 2019-02-07T13:01:51.186906+01:00 host1 ceph-osd[1136505]: 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x280) [0x560b922096d0] 2019-02-07T13:01:51.187027+01:00 host1 ceph-osd[1136505]: 6: (interval_set<unsigned long, btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >::insert(unsigned long, unsigned long, unsigned long*, unsigned long*)+0xef2) [0x560b921bd432] 2019-02-07T13:01:51.187167+01:00 host1 ceph-osd[1136505]: 7: (StupidAllocator::_insert_free(unsigned long, unsigned long)+0x126) [0x560b921b4a06] 2019-02-07T13:01:51.187294+01:00 host1 ceph-osd[1136505]: 8: (StupidAllocator::release(unsigned long, unsigned long)+0x7d) [0x560b921b4f4d] 2019-02-07T13:01:51.187418+01:00 host1 ceph-osd[1136505]: 9: (BlueStore::_txc_release_alloc(BlueStore::TransContext*)+0x72) [0x560b9207fa22] 2019-02-07T13:01:51.187539+01:00 host1 ceph-osd[1136505]: 10: (BlueStore::_txc_finish(BlueStore::TransContext*)+0x5d7) [0x560b92092d77] 2019-02-07T13:01:51.187661+01:00 host1 ceph-osd[1136505]: 11: (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x1f6) [0x560b920a3fa6] 2019-02-07T13:01:51.187781+01:00 host1 ceph-osd[1136505]: 12: (BlueStore::_kv_finalize_thread()+0x620) [0x560b920a58e0] 2019-02-07T13:01:51.187898+01:00 host1 ceph-osd[1136505]: 13: (BlueStore::KVFinalizeThread::entry()+0xd) [0x560b920fb57d] 2019-02-07T13:01:51.188017+01:00 host1 ceph-osd[1136505]: 14: (()+0x8744) [0x7f75d837e744] 2019-02-07T13:01:51.188138+01:00 host1 ceph-osd[1136505]: 15: (clone()+0x6d) [0x7f75d7405aad] 2019-02-07T13:01:51.188271+01:00 host1 ceph-osd[1136505]: 2019-02-07 13:01:51.185833 7f75ce646700 -1 *** Caught signal (Aborted) **
---cut here---

Is there anything we can do about this? The issue in [1] doesn't seem to be resolved, yet. Debug logging is not enabled, so I don't have more detailed information except the full stack trace from the OSD log. Any help is appreciated!

Regards,
Eugen

[1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/029616.html

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to