At first - you should upgrade to 12.2.11 (or bring the mentioned patch
in by other means) to fix rename procedure which will avoid new
inconsistent objects appearance in DB. This should at least reduce the
OSD crash frequency.
At second - theoretically previous crashes could result in persistent
inconsistent objects in your DB. I haven't seen that in real life before
but probably they exist. We need to check. If so OSD crashes might still
So I'd like to have fsck report to verify that. No matter if you do fsck
before or after the upgrade.
Once we have fsck report we can proceed with the repair. Which is a bit
risky procedure so may be I should try to simulate the inconsistency
in question and check if built-in repair handles that properly. Will
see, lets get fsck report first.
W.r.t to running ceph-bluestore-tool - you might want to specify log
file and increase log level to 20 using --log-file and --log-level options.
On 2/7/2019 4:45 PM, Eugen Block wrote:
thanks for the quick response!
Just to make sure I don't misunderstand, and because it's a production
before anything else I should run fsck on that OSD? Depending on the
result we'll decide how to continue, right?
Is there anything else to be enabled for that command or can I simply
run 'ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-<ID>'?
Any other obstacles I should be aware of when running fsck?
Zitat von Igor Fedotov <ifedo...@suse.de>:
looks like this isn't  but rather
https://tracker.ceph.com/issues/36638 for luminous)
Hence it's not fixed in 12.2.10, target release is 12.2.11
Also please note the patch allows to avoid new occurrences for the
issue. But there some chances that inconsistencies caused by it
earlier are still present in DB. And assertion might still happen
(hopefully with less frequency).
So could you please run fsck for OSDs that were broken once and share
Then we can decide if it makes sense to proceed with the repair.
On 2/7/2019 3:37 PM, Eugen Block wrote:
I found this thread  about crashing SSD OSDs, although that was
about an upgrade to 12.2.7, we just hit (probably) the same issue
after our update to 12.2.10 two days ago in a production cluster.
Just half an hour ago I saw one OSD (SSD) crashing (for the first
2019-02-07 13:02:07.682178 mon.host1 mon.0 <IP>:6789/0 109754 :
cluster [INF] osd.10 failed (root=default,host=host1) (connection
refused reported by osd.20)
2019-02-07 13:02:08.623828 mon.host1 mon.0 <IP>:6789/0 109771 :
cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
One minute later, the OSD was back online.
This is the stack trace reported in syslog:
2019-02-07T13:01:51.181027+01:00 host1 ceph-osd: *** Caught
signal (Aborted) **
2019-02-07T13:01:51.181232+01:00 host1 ceph-osd: in thread
2019-02-07T13:01:51.185873+01:00 host1 ceph-osd: ceph
(b10c702661a31c8563b3421d6d664de93a0cb0e2) luminous (stable)
2019-02-07T13:01:51.186077+01:00 host1 ceph-osd: 1:
2019-02-07T13:01:51.186226+01:00 host1 ceph-osd: 2:
2019-02-07T13:01:51.186368+01:00 host1 ceph-osd: 3:
2019-02-07T13:01:51.186773+01:00 host1 ceph-osd: 4:
2019-02-07T13:01:51.186906+01:00 host1 ceph-osd: 5:
(ceph::__ceph_assert_fail(char const*, char const*, int, char
2019-02-07T13:01:51.187027+01:00 host1 ceph-osd: 6:
(interval_set<unsigned long, btree::btree_map<unsigned long,
unsigned long, std::less<unsigned long>,
long const, unsigned long> >, 256> >::insert(unsigned long, unsigned
long, unsigned long*, unsigned long*)+0xef2) [0x560b921bd432]
2019-02-07T13:01:51.187167+01:00 host1 ceph-osd: 7:
(StupidAllocator::_insert_free(unsigned long, unsigned long)+0x126)
2019-02-07T13:01:51.187294+01:00 host1 ceph-osd: 8:
(StupidAllocator::release(unsigned long, unsigned long)+0x7d)
2019-02-07T13:01:51.187418+01:00 host1 ceph-osd: 9:
2019-02-07T13:01:51.187539+01:00 host1 ceph-osd: 10:
2019-02-07T13:01:51.187661+01:00 host1 ceph-osd: 11:
2019-02-07T13:01:51.187781+01:00 host1 ceph-osd: 12:
2019-02-07T13:01:51.187898+01:00 host1 ceph-osd: 13:
2019-02-07T13:01:51.188017+01:00 host1 ceph-osd: 14:
2019-02-07T13:01:51.188138+01:00 host1 ceph-osd: 15:
2019-02-07T13:01:51.188271+01:00 host1 ceph-osd: 2019-02-07
13:01:51.185833 7f75ce646700 -1 *** Caught signal (Aborted) **
Is there anything we can do about this? The issue in  doesn't
seem to be resolved, yet. Debug logging is not enabled, so I don't
have more detailed information except the full stack trace from the
OSD log. Any help is appreciated!
ceph-users mailing list
ceph-users mailing list