[ceph-users] Re: Recover after multiple OSD failures (was Re: Help recovering OSD: rocksdb: submit_common error: Corruption: block checksum mismatch)

Alexander Patrakov via ceph-users Tue, 03 Mar 2026 12:47:07 -0800

Hello Theo,

Another method is to set a dummy CRUSH device class, e.g. "import", on
the temporary OSD, which is not matched by any CRUSH rules. Obviously,
this only works if all your CRUSH rules mention a device class and you
set a non-default CRUSH rule for the .mgr pool.


On Tue, Mar 3, 2026 at 3:30 PM Eugen Block via ceph-users
<[email protected]> wrote:
>
> Hi,
>
> I didn't read all the details of this thread, but if you want to
> prevent freshly created OSDs from receiving recovery traffic, you
> might want to set:
>
> ceph config set osd osd_crush_initial_weight 0
>
> This allows you to create and start OSDs without a crush weight, so
> there won't be any traffic to this OSD at all until you reweight its
> crush weight. But the OSD will be up and in, which should allow you to
> import PGs anyway.
>
>
> Zitat von Theo Cabrerizo Diem via ceph-users <[email protected]>:
>
> > Hello Igor, Hello all,
> >
> > First, I've already accepted the fact that my data most likely is
> > unrecoverable by now due to my own fault.. I'm using it to learn and
> > hopefully document the information gained as I couldn't find much recent
> > information regarding the process of recovery and improvement of my
> > understanding about ceph, if someone is willing to further chime in.
> >
> > The current situation is that I've had multiple OSD failures and some of
> > the ceph-osd processes would refuse to start (corruption on rocksdb). I've
> > decided to follow
> > https://www.croit.io/blog/how-to-recover-inactive-pgs-using-ceph-objectstore-tool-on-ceph-clusters
> > as a suggested mechanism to export the PGs and re-import on a fresh OSD.
> >
> > I had an attempt (sort of documented in this thread) to do it with
> > "unexpected" results. I have written down a lot of information, states,
> > ceph pg queries, etc (so I can provide their outputs if relevant. I have
> > not marked any of the OSDs as lost at any time and the monitors have been
> > running without issue since the beginning.
> >
> > I have some questions regarding my observations following the information
> > on that blog (sorry for my lack of experience):
> >
> > - Is the process for setting up a new "temporary" OSD to import PGs
> > correct? (short story: "ceph-volume lvm prepare", start the osd and as soon
> > as possible run "ceph osd crush reweight osd.XX 0")
> >
> > - Creating the OSD as described above, running "ceph-objectstore-tool --op
> > list-pgs" on this OSD showed lots of PGs (which I assume they were
> > "pre-allocation" from crush) but "ceph osd df" confirmed no data was on the
> > osd (only very very little, like less than 2Gb). Is there a way to have an
> > OSD "flushed out" so I can import further PGs?
> >
> > - Running "ceph pg XX.XX query" into some of the imported PGs after
> > starting ceph-osd again doesn't seem to reliably reflect my progress. Is
> > there a different way? Is it because the PG is still in down state because
> > of the dead OSDs?
> > For example, pg 11.17 of which had only shard-0 available because only one
> > OSD was up. I've imported shard-1 to osd.10 , but "ceph pg 11.17 query"
> > shows under "recovery_state":
> >
> >   "intervals": [
> >     {
> >       "first": "2882",
> >       "last": "2883",
> >       "acting": "1(1),3(0)"
> >     },
> >     {
> >       "first": "3021",
> >       "last": "3023",
> >       "acting": "3(0),8(1)"
> >     },
> >     {
> >       "first": "3024",
> >       "last": "3026",
> >       "acting": "3(0),8(1),10(2)"
> >     }
> >   ]
> >
> > Running  ceph-objectstore-tool --op list-pgs on osd.10 (stopped) confirms
> > that 11.17s1 is listed, and running ceph-objectstore-tool --op list-pgs on
> > osd.8 (stopped) doesn't show 11.17 at all (none of its shards)
> >
> > Should I instead keep track of my progress using "ceph-objectstore-tool
> > --op list" looking for a "oid" present?
> >
> > This might reflect on my lack of knowledge regarding how ceph osd "works
> > internally", so feel free to correct me or suggest a better approach. I
> > still have the 3x original OSDs (out of 4, one, as mentioned on the thread,
> > have a bigger corruption and  ceph-objectstore-tool fails) and 8x 2Tb disks
> > that I can load as new OSDs to import the old data (I had less than 6Tb
> > used before the crash).
> >
> > Should I continue exporting all PGs and keep importing them this way?
> >
> > Thanks
> >
> >
> > On Sat, 28 Feb 2026 at 16:50, Theo Cabrerizo Diem via ceph-users <
> > [email protected]> wrote:
> >
> >> Hello all,
> >>
> >> I've managed to get a bunch of 2Tb disks for setting up a few OSDs but
> >> before I even started adding them to my monitors, I decided to check my
> >> cluster state and noticed another OSD died. Trying to start it, revealed a
> >> rocksdb corruption:
> >>
> >> # /usr/bin/ceph-osd -f --id "1" --osd-data "/var/lib/ceph/osd" --cluster
> >> "ceph" --setuser "ceph" --setgroup "ceph"
> >> 2026-02-28T15:34:16.596+0000 7f793ce718c0 -1 Falling back to public
> >> interface
> >>
> >> 2026-02-28T15:35:04.304+0000 7f792c1c5640 -1 rocksdb: submit_common error:
> >> Corruption: block checksum mismatch: stored = 0, computed = 1265684702,
> >> type = 4  in db/180433.sst offset 1048892 size 1429 code = ☻ Rocksdb
> >> transaction:
> >> PutCF( prefix = O key =
> >> 0x7F8000000000000006D0000000'!!='0xFFFFFFFFFFFFFFFEFFFFFFFFFFFFFFFF6F value
> >> size = 33)
> >> PutCF( prefix = S key = 'nid_max' value size = 8)
> >> PutCF( prefix = S key = 'blobid_max' value size = 8)
> >> /ceph/rpmbuild/BUILD/ceph-20.2.0/src/os/bluestore/BlueStore.cc: In function
> >> 'void BlueStore::_txc_apply_kv(TransContext*, bool)' thread 7f792c1c5640
> >> time 2026-02-28T15:35:04.305926+0000
> >> /ceph/rpmbuild/BUILD/ceph-20.2.0/src/os/bluestore/BlueStore.cc: 14539:
> >> FAILED ceph_assert(r == 0)
> >>  ceph version 20.2.0 (69f84cc2651aa259a15bc192ddaabd3baba07489) tentacle
> >> (stable - RelWithDebInfo)
> >>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> >> const*)+0x11f) [0x557b7cbf6236]
> >>  2: /usr/bin/ceph-osd(+0x44a3ef) [0x557b7cbd03ef]
> >>  3: (BlueStore::_kv_sync_thread()+0xaf1) [0x557b7d276191]
> >>  4: /usr/bin/ceph-osd(+0xa790d1) [0x557b7d1ff0d1]
> >>  5: /lib64/libc.so.6(+0x8b2fa) [0x7f793d3382fa]
> >>  6: /lib64/libc.so.6(+0x110400) [0x7f793d3bd400]
> >> 2026-02-28T15:35:04.309+0000 7f792c1c5640 -1
> >> /ceph/rpmbuild/BUILD/ceph-20.2.0/src/os/bluestore/BlueStore.cc: In function
> >> 'void BlueStore::_txc_apply_kv(TransContext*, bool)' thread 7f792c1c5640
> >> time 2026-02-28T15:35:04.305926+0000
> >> /ceph/rpmbuild/BUILD/ceph-20.2.0/src/os/bluestore/BlueStore.cc: 14539:
> >> FAILED ceph_assert(r == 0)
> >>
> >>  ceph version 20.2.0 (69f84cc2651aa259a15bc192ddaabd3baba07489) tentacle
> >> (stable - RelWithDebInfo)
> >>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> >> const*)+0x11f) [0x557b7cbf6236]
> >>  2: /usr/bin/ceph-osd(+0x44a3ef) [0x557b7cbd03ef]
> >>  3: (BlueStore::_kv_sync_thread()+0xaf1) [0x557b7d276191]
> >>  4: /usr/bin/ceph-osd(+0xa790d1) [0x557b7d1ff0d1]
> >>  5: /lib64/libc.so.6(+0x8b2fa) [0x7f793d3382fa]
> >>  6: /lib64/libc.so.6(+0x110400) [0x7f793d3bd400]
> >>
> >> *** Caught signal (Aborted) **
> >>  in thread 7f792c1c5640 thread_name:bstore_kv_sync
> >>  ceph version 20.2.0 (69f84cc2651aa259a15bc192ddaabd3baba07489) tentacle
> >> (stable - RelWithDebInfo)
> >>  1: /lib64/libc.so.6(+0x3fc30) [0x7f793d2ecc30]
> >>  2: /lib64/libc.so.6(+0x8d03c) [0x7f793d33a03c]
> >>  3: raise()
> >>  4: abort()
> >>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> >> const*)+0x17a) [0x557b7cbf6291]
> >>  6: /usr/bin/ceph-osd(+0x44a3ef) [0x557b7cbd03ef]
> >>  7: (BlueStore::_kv_sync_thread()+0xaf1) [0x557b7d276191]
> >>  8: /usr/bin/ceph-osd(+0xa790d1) [0x557b7d1ff0d1]
> >>  9: /lib64/libc.so.6(+0x8b2fa) [0x7f793d3382fa]
> >>  10: /lib64/libc.so.6(+0x110400) [0x7f793d3bd400]
> >> 2026-02-28T15:35:04.320+0000 7f792c1c5640 -1 *** Caught signal (Aborted) **
> >>  in thread 7f792c1c5640 thread_name:bstore_kv_sync
> >>
> >>  ceph version 20.2.0 (69f84cc2651aa259a15bc192ddaabd3baba07489) tentacle
> >> (stable - RelWithDebInfo)
> >>  1: /lib64/libc.so.6(+0x3fc30) [0x7f793d2ecc30]
> >>  2: /lib64/libc.so.6(+0x8d03c) [0x7f793d33a03c]
> >>  3: raise()
> >>  4: abort()
> >>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> >> const*)+0x17a) [0x557b7cbf6291]
> >>  6: /usr/bin/ceph-osd(+0x44a3ef) [0x557b7cbd03ef]
> >>  7: (BlueStore::_kv_sync_thread()+0xaf1) [0x557b7d276191]
> >>  8: /usr/bin/ceph-osd(+0xa790d1) [0x557b7d1ff0d1]
> >>  9: /lib64/libc.so.6(+0x8b2fa) [0x7f793d3382fa]
> >>  10: /lib64/libc.so.6(+0x110400) [0x7f793d3bd400]
> >>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> >> to interpret this.
> >>
> >>  -2421> 2026-02-28T15:34:16.596+0000 7f793ce718c0 -1 Falling back to public
> >> interface
> >>     -8> 2026-02-28T15:35:04.304+0000 7f792c1c5640 -1 rocksdb: submit_common
> >> error: Corruption: block checksum mismatch: stored = 0, computed =
> >> 1265684702, type = 4  in db/180433.sst offset 1048892 size 1429 code = ☻
> >> Rocksdb transaction:
> >> PutCF( prefix = O key =
> >> 0x7F8000000000000006D0000000'!!='0xFFFFFFFFFFFFFFFEFFFFFFFFFFFFFFFF6F value
> >> size = 33)
> >> PutCF( prefix = S key = 'nid_max' value size = 8)
> >> PutCF( prefix = S key = 'blobid_max' value size = 8)
> >>     -7> 2026-02-28T15:35:04.309+0000 7f792c1c5640 -1
> >> /ceph/rpmbuild/BUILD/ceph-20.2.0/src/os/bluestore/BlueStore.cc: In function
> >> 'void BlueStore::_txc_apply_kv(TransContext*, bool)' thread 7f792c1c5640
> >> time 2026-02-28T15:35:04.305926+0000
> >> /ceph/rpmbuild/BUILD/ceph-20.2.0/src/os/bluestore/BlueStore.cc: 14539:
> >> FAILED ceph_assert(r == 0)
> >>
> >>  ceph version 20.2.0 (69f84cc2651aa259a15bc192ddaabd3baba07489) tentacle
> >> (stable - RelWithDebInfo)
> >>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> >> const*)+0x11f) [0x557b7cbf6236]
> >>  2: /usr/bin/ceph-osd(+0x44a3ef) [0x557b7cbd03ef]
> >>  3: (BlueStore::_kv_sync_thread()+0xaf1) [0x557b7d276191]
> >>  4: /usr/bin/ceph-osd(+0xa790d1) [0x557b7d1ff0d1]
> >>  5: /lib64/libc.so.6(+0x8b2fa) [0x7f793d3382fa]
> >>  6: /lib64/libc.so.6(+0x110400) [0x7f793d3bd400]
> >>
> >>      0> 2026-02-28T15:35:04.320+0000 7f792c1c5640 -1 *** Caught signal
> >> (Aborted) **
> >>  in thread 7f792c1c5640 thread_name:bstore_kv_sync
> >>
> >>  ceph version 20.2.0 (69f84cc2651aa259a15bc192ddaabd3baba07489) tentacle
> >> (stable - RelWithDebInfo)
> >>  1: /lib64/libc.so.6(+0x3fc30) [0x7f793d2ecc30]
> >>  2: /lib64/libc.so.6(+0x8d03c) [0x7f793d33a03c]
> >>  3: raise()
> >>  4: abort()
> >>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> >> const*)+0x17a) [0x557b7cbf6291]
> >>  6: /usr/bin/ceph-osd(+0x44a3ef) [0x557b7cbd03ef]
> >>  7: (BlueStore::_kv_sync_thread()+0xaf1) [0x557b7d276191]
> >>  8: /usr/bin/ceph-osd(+0xa790d1) [0x557b7d1ff0d1]
> >>  9: /lib64/libc.so.6(+0x8b2fa) [0x7f793d3382fa]
> >>  10: /lib64/libc.so.6(+0x110400) [0x7f793d3bd400]
> >>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> >> to interpret this.
> >>
> >>  -2445> 2026-02-28T15:34:16.596+0000 7f793ce718c0 -1 Falling back to public
> >> interface
> >>    -32> 2026-02-28T15:35:04.304+0000 7f792c1c5640 -1 rocksdb: submit_common
> >> error: Corruption: block checksum mismatch: stored = 0, computed =
> >> 1265684702, type = 4  in db/180433.sst offset 1048892 size 1429 code = ☻
> >> Rocksdb transaction:
> >> PutCF( prefix = O key =
> >> 0x7F8000000000000006D0000000'!!='0xFFFFFFFFFFFFFFFEFFFFFFFFFFFFFFFF6F value
> >> size = 33)
> >> PutCF( prefix = S key = 'nid_max' value size = 8)
> >> PutCF( prefix = S key = 'blobid_max' value size = 8)
> >>    -31> 2026-02-28T15:35:04.309+0000 7f792c1c5640 -1
> >> /ceph/rpmbuild/BUILD/ceph-20.2.0/src/os/bluestore/BlueStore.cc: In function
> >> 'void BlueStore::_txc_apply_kv(TransContext*, bool)' thread 7f792c1c5640
> >> time 2026-02-28T15:35:04.305926+0000
> >> /ceph/rpmbuild/BUILD/ceph-20.2.0/src/os/bluestore/BlueStore.cc: 14539:
> >> FAILED ceph_assert(r == 0)
> >>
> >>  ceph version 20.2.0 (69f84cc2651aa259a15bc192ddaabd3baba07489) tentacle
> >> (stable - RelWithDebInfo)
> >>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> >> const*)+0x11f) [0x557b7cbf6236]
> >>  2: /usr/bin/ceph-osd(+0x44a3ef) [0x557b7cbd03ef]
> >>  3: (BlueStore::_kv_sync_thread()+0xaf1) [0x557b7d276191]
> >>  4: /usr/bin/ceph-osd(+0xa790d1) [0x557b7d1ff0d1]
> >>  5: /lib64/libc.so.6(+0x8b2fa) [0x7f793d3382fa]
> >>  6: /lib64/libc.so.6(+0x110400) [0x7f793d3bd400]
> >>
> >>    -24> 2026-02-28T15:35:04.320+0000 7f792c1c5640 -1 *** Caught signal
> >> (Aborted) **
> >>  in thread 7f792c1c5640 thread_name:bstore_kv_sync
> >>
> >>  ceph version 20.2.0 (69f84cc2651aa259a15bc192ddaabd3baba07489) tentacle
> >> (stable - RelWithDebInfo)
> >>  1: /lib64/libc.so.6(+0x3fc30) [0x7f793d2ecc30]
> >>  2: /lib64/libc.so.6(+0x8d03c) [0x7f793d33a03c]
> >>  3: raise()
> >>  4: abort()
> >>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> >> const*)+0x17a) [0x557b7cbf6291]
> >>  6: /usr/bin/ceph-osd(+0x44a3ef) [0x557b7cbd03ef]
> >>  7: (BlueStore::_kv_sync_thread()+0xaf1) [0x557b7d276191]
> >>  8: /usr/bin/ceph-osd(+0xa790d1) [0x557b7d1ff0d1]
> >>  9: /lib64/libc.so.6(+0x8b2fa) [0x7f793d3382fa]
> >>  10: /lib64/libc.so.6(+0x110400) [0x7f793d3bd400]
> >>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> >> to interpret this.
> >>
> >> Aborted (core dumped)
> >>
> >> There was no sign of hardware failure on kernel logs. I think, for now, I
> >> should move away from alpine binaries and use ceph's official containers.
> >> Is there any tool that can try to fix this rocksdb issue? Or is the
> >> recommended way to export all PGs from this osd and re-import on a new one?
> >>
> >> I agree that at this point I should consider rechecking all the hardware
> >> involved. I plan to decommission this system once I (if possible) I get the
> >> data out.
> >>
> >> Thanks,
> >> Theo
> >>
> >> On Tue, 24 Feb 2026 at 22:22, Theo Cabrerizo Diem <[email protected]>
> >> wrote:
> >>
> >> > Hello Igor,
> >> >
> >> > > Just in case - didn't you overwrite existing PG replicas  at target
> >> OSDs
> >> > > when exporting PGs back to OSDs 1&3?
> >> >
> >> > Now that you mention that an OSD cannot have two shards, I think I did
> >> put
> >> > myself now in a tricky place.
> >> > I had 4 OSDs, two "died" .. leaving two online, so when I did export
> >> > shards from osd.2 I did end up importing them in osd.3 (which had other
> >> > shards for the same PG).
> >> >
> >> > Doesn't look "that bad" ... (*looking up to the sky*) ...
> >> > ## on osd.3
> >> > # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ --op list-pgs |
> >> > grep 16.b
> >> > 16.bs1
> >> > 16.bs0
> >> > (and the s2 should be in osd.1, s1 was imported)
> >> >
> >> > ## ceph health reports
> >> >   pg 16.b is down, acting [3,NONE,1]
> >> >
> >> > I might have some spares HDDs to set up a new OSD (one though) with
> >> enough
> >> > space for it to balance out, but as I've mentioned, there are quite a
> >> bunch
> >> > of PGs in down state. Should I set-up a new OSD and wait until it
> >> "settles
> >> > up"? The exported shards from the dead osd.2 that ceph health was
> >> > complaining about sums to about 400Gb.
> >> >
> >> > Is it possible to set-up an OSD and have ceph not push too many shards
> >> > into it? (the replica pools are "small"). I ask this because it should be
> >> > easy to get a 1Tb/2Tb disk, but if it should balance completely for the
> >> PGs
> >> > to "unstuck", I need a bigger disk, which I don't have at the moment (and
> >> > currently not the best market prices to get them).
> >> >
> >> > Another option is if osd.0 is gone for good (the one that doesn't even
> >> > allows me to run 'ceph-objectstore-tool --list-pgs'), I can wipe its
> >> > disk/clone and use it instead, it is the same capacity as other OSDs (I
> >> > used a spare I had at home). If you believe it is still workable, I can
> >> try
> >> > to find some more disks and not touch those for the moment.
> >> >
> >> > Thanks for the patience so far.
> >> >
> >> > On Tue, 24 Feb 2026 at 16:39, Igor Fedotov <[email protected]>
> >> wrote:
> >> >
> >> >> Hi Theo,
> >> >>
> >> >> Sorry I can't tell for sure what would marking OSD lost do with its
> >> >> encryption keys. Likely - yes, they'll be lost.
> >> >>
> >> >> But instead of going this way I'd rather suggest you to add another two
> >> >> OSDs and let Ceph recover more PGs replicas into them.
> >> >>
> >> >> Just in case - didn't you overwrite existing PG replicas  at target OSDs
> >> >> when exporting PGs back to OSDs 1&3? The same PG can't have two
> >> >> replicas/shards at a single OSD while your OSD count is pretty limited..
> >> >> Just curious for now - that still shouldn't be an issue given you have
> >> >> at least 2 replicas/shards for all the pools anyway.
> >> >>
> >> >> Thanks,
> >> >>
> >> >> Igor
> >> >>
> >> >> On 2/21/2026 11:00 PM, Theo Cabrerizo Diem via ceph-users wrote:
> >> >> > Hello Igor,
> >> >> >
> >> >> > First of all, sorry about the late reply. It took me a while to export
> >> >> all
> >> >> > shards that weren't available from the osd.2 (1 and 3 were fine, 2
> >> >> didn't
> >> >> > start but i could use `ceph-objectstore-tool ... --op list-pgs` while
> >> >> osd.0
> >> >> > I couldn't even list the pgs, it threw an error right away - more
> >> about
> >> >> it
> >> >> > later in the email)
> >> >> >
> >> >> > Two of the unavailable shards, when exporting, ceph-objectstore-tool
> >> >> core
> >> >> > dumped with the same issue in the rocksdb, but I should have enough
> >> >> chunks
> >> >> > to not need them - just mentioning in case is useful:
> >> >> >
> >> >> > sh-5.1# ceph-objectstore-tool --data-path /var/lib/ceph/osd --pgid
> >> >> 11.19s2
> >> >> > --op export --file pg.11.19s2.dat
> >> >> > /ceph/rpmbuild/BUILD/ceph-20.2.0/src/kv/RocksDBStore.cc: In function
> >> >> > 'virtual int RocksDBStore::get(const std::string&, const std::string&,
> >> >> > ceph::bufferlist*)' thread 7ff3be4ca800 time
> >> >> 2026-02-04T09:42:00.743877+0000
> >> >> > /ceph/rpmbuild/BUILD/ceph-20.2.0/src/kv/RocksDBStore.cc: 1961:
> >> >> > ceph_abort_msg("block checksum mismatch: stored = 246217859, computed
> >> =
> >> >> > 2155741315, type = 4  in db/170027.sst offset 28264757 size 1417")
> >> >> >   ceph version 20.2.0 (69f84cc2651aa259a15bc192ddaabd3baba07489)
> >> >> tentacle
> >> >> > (stable - RelWithDebInfo)
> >> >> >   1: (ceph::__ceph_abort(char const*, int, char const*,
> >> >> > std::__cxx11::basic_string<char, std::char_traits<char>,
> >> >> > std::allocator<char> > const&)+0xc9) [0x7ff3bf5391fd]
> >> >> >   2: (RocksDBStore::get(std::__cxx11::basic_string<char,
> >> >> > std::char_traits<char>, std::allocator<char> > const&,
> >> >> > std::__cxx11::basic_string<char, std::char_traits<char>,
> >> >> > std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x3bc)
> >> >> > [0x555667b340bc]
> >> >> >   3:
> >> >> >
> >> >>
> >> (BlueStore::omap_get_values(boost::intrusive_ptr<ObjectStore::CollectionImpl>&,
> >> >> > ghobject_t const&,
> >> >> > std::set<std::__cxx11::basic_string<char,std::char_traits<char>,
> >> >> > std::allocator<char> >, std::less<std::__cxx11::basic_string<char,
> >> >> > std::char_traits<char>, std::allocator<char> > >,
> >> >> > std::allocator<std::__cxx11::basic_string<char,
> >> std::char_traits<char>,
> >> >> > std::allocator<char> > > > const&,
> >> >> > std::map<std::__cxx11::basic_string<char,std::char_traits<char>,
> >> >> > std::allocator<char> >, ceph::buffer::v15_2_0::list,
> >> >> > std::less<std::__cxx11::basic_string<char, std::char_traits<char>,
> >> >> > std::allocator<char> > >,
> >> >> > std::allocator<std::pair<std::__cxx11::basic_string<char,
> >> >> > std::char_traits<char>, std::allocator<char> > const,
> >> >> > ceph::buffer::v15_2_0::list> > >*)+0x401) [0x555667a25fe1]
> >> >> >   4: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*)+0x361)
> >> >> > [0x5556675e0101]
> >> >> >   5: main()
> >> >> >   6: /lib64/libc.so.6(+0x2a610) [0x7ff3be930610]
> >> >> >   7: __libc_start_main()
> >> >> >   8: _start()
> >> >> > *** Caught signal (Aborted) **
> >> >> >   in thread 7ff3be4ca800 thread_name:ceph-objectstor
> >> >> >   ceph version 20.2.0 (69f84cc2651aa259a15bc192ddaabd3baba07489)
> >> >> tentacle
> >> >> > (stable - RelWithDebInfo)
> >> >> >   1: /lib64/libc.so.6(+0x3fc30) [0x7ff3be945c30]
> >> >> >   2: /lib64/libc.so.6(+0x8d03c) [0x7ff3be99303c]
> >> >> >   3: raise()
> >> >> >   4: abort()
> >> >> >   5: (ceph::__ceph_abort(char const*, int, char const*,
> >> >> > std::__cxx11::basic_string<char, std::char_traits<char>,
> >> >> > std::allocator<char> > const&)+0x186) [0x7ff3bf5392ba]
> >> >> >   6: (RocksDBStore::get(std::__cxx11::basic_string<char,
> >> >> > std::char_traits<char>, std::allocator<char> > const&,
> >> >> > std::__cxx11::basic_string<char, std::char_traits<char>,
> >> >> > std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x3bc)
> >> >> > [0x555667b340bc]
> >> >> >   7:
> >> >> >
> >> >>
> >> (BlueStore::omap_get_values(boost::intrusive_ptr<ObjectStore::CollectionImpl>&,
> >> >> > ghobject_t const&,
> >> >> > std::set<std::__cxx11::basic_string<char,std::char_traits<char>,
> >> >> > std::allocator<char> >, std::less<std::__cxx11::basic_string<char,
> >> >> > std::char_traits<char>, std::allocator<char> > >,
> >> >> > std::allocator<std::__cxx11::basic_string<char,
> >> std::char_traits<char>,
> >> >> > std::allocator<char> > > > const&,
> >> >> > std::map<std::__cxx11::basic_string<char,std::char_traits<char>,
> >> >> > std::allocator<char> >, ceph::buffer::v15_2_0::list,
> >> >> > std::less<std::__cxx11::basic_string<char, std::char_traits<char>,
> >> >> > std::allocator<char> > >,
> >> >> > std::allocator<std::pair<std::__cxx11::basic_string<char,
> >> >> > std::char_traits<char>, std::allocator<char> > const,
> >> >> > ceph::buffer::v15_2_0::list> > >*)+0x401) [0x555667a25fe1]
> >> >> >   8: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*)+0x361)
> >> >> > [0x5556675e0101]
> >> >> >   9: main()
> >> >> >   10: /lib64/libc.so.6(+0x2a610) [0x7ff3be930610]
> >> >> >   11: __libc_start_main()
> >> >> >   12: _start()
> >> >> > Aborted (core dumped)
> >> >> >
> >> >> >
> >> >> >
> >> >> > After importing all shards that I could recover that weren't
> >> available,
> >> >> I
> >> >> > don't have any "unknown" pgs anymore. I still have lots of PGs in
> >> "down"
> >> >> > state, which I assume I need to flag both "dead" OSDs as lost to
> >> unstuck
> >> >> > them. Since it is an operation I cannot go back, I would like to
> >> confirm
> >> >> > that is indeed the correct next step to take.
> >> >> >
> >> >> > I have a few questions to understand "what happens" in the next step
> >> >> > (marking osd as lost?):
> >> >> >
> >> >> > Shall I assume that once I flag an OSD as lost, I won't be able to
> >> >> > "activate" it since I use encryption when initializing the bluestore
> >> >> OSD,
> >> >> > or flagging them as lost won't destroy their unlocking keys? (which
> >> >> means
> >> >> > any hope of further extracting data to be gone, mostly on the osd.0
> >> >> which I
> >> >> > couldn't use ceph-objectstore-tool at all since the power loss).
> >> >> >
> >> >> > I think I should have all the shards from the PGs but just in case,
> >> I've
> >> >> > managed to make a clone of the osd.0 on a different physical disk (the
> >> >> > other reason I took long to answer). But still ceph-objectstore-tool
> >> >> > refuses to run:
> >> >> >
> >> >> > # ceph-objectstore-tool --data-path /var/lib/ceph/osd --op list-pgs
> >> >> > Mount failed with '(5) Input/output error'
> >> >> >
> >> >> > # ls -l /var/lib/ceph/osd
> >> >> > total 28
> >> >> > lrwxrwxrwx 1 ceph ceph  50 Feb  4 08:26 block ->
> >> >> > /dev/mapper/zNPZJR-i0TZ-6NtK-URto-tjfs-iJRb-GCAYEm
> >> >> > -rw------- 1 ceph ceph  37 Feb  4 08:26 ceph_fsid
> >> >> > -rw------- 1 ceph ceph  37 Feb  4 08:26 fsid
> >> >> > -rw------- 1 ceph ceph  55 Feb  4 08:26 keyring
> >> >> > -rw------- 1 ceph ceph 106 Jan 24 00:44 lockbox.keyring
> >> >> > -rw------- 1 ceph ceph   6 Feb  4 08:26 ready
> >> >> > -rw------- 1 ceph ceph  10 Feb  4 08:26 type
> >> >> > -rw------- 1 ceph ceph   2 Feb  4 08:26 whoami
> >> >> >
> >> >> > Just as information, all except 2 pools in my cluster are
> >> "replicated".
> >> >> > Pools id 11 and 16 are erasure coded (2+1). If I understood correctly,
> >> >> as
> >> >> > long as I have two acting shards (and at most one "NONE"), data should
> >> >> be
> >> >> > available (at least in read-only) once I mark the down OSDs as lost.
> >> Is
> >> >> > that understanding correct?
> >> >> >
> >> >> > Another information, pools 10 and 15 are the "replicated root pools"
> >> >> before
> >> >> > the erasure coded pools were created.
> >> >> >
> >> >> > Ignoring osd.0 for now, here are the current state of my cluster (mds
> >> is
> >> >> > intentionally not started while I try to fix the PGs):
> >> >> > ### ceph osd lspools
> >> >> > 3 .rgw.root
> >> >> > 4 default.rgw.log
> >> >> > 5 default.rgw.control
> >> >> > 6 default.rgw.meta
> >> >> > 10 ark.data
> >> >> > 11 ark.data_ec
> >> >> > 12 ark.metadata
> >> >> > 14 .mgr
> >> >> > 15 limbo
> >> >> > 16 limbo.data_ec
> >> >> > 18 default.rgw.buckets.index
> >> >> > 19 default.rgw.buckets.data
> >> >> > ###
> >> >> >
> >> >> > ### ceph health
> >> >> > # ceph -s
> >> >> >    cluster:
> >> >> >      id:     021f058f-dbf3-4a23-adb5-21d83f3f1bb6
> >> >> >      health: HEALTH_ERR
> >> >> >              1 filesystem is degraded
> >> >> >              1 filesystem has a failed mds daemon
> >> >> >              1 filesystem is offline
> >> >> >              insufficient standby MDS daemons available
> >> >> >              Reduced data availability: 143 pgs inactive, 143 pgs down
> >> >> >              Degraded data redundancy: 1303896/7149898 objects
> >> degraded
> >> >> > (18.237%), 218 pgs degraded, 316 pgs undersized
> >> >> >              144 pgs not deep-scrubbed in time
> >> >> >              459 pgs not scrubbed in time
> >> >> >              256 slow ops, oldest one blocked for 1507794 sec, osd.1
> >> has
> >> >> > slow ops
> >> >> >              too many PGs per OSD (657 > max 500)
> >> >> >
> >> >> >    services:
> >> >> >      mon: 2 daemons, quorum ceph-ymir-mon2,ceph-ymir-mon1 (age 2w)
> >> >> >      mgr: ceph-ymir-mgr1(active, since 2w)
> >> >> >      mds: 0/1 daemons up (1 failed)
> >> >> >      osd: 4 osds: 2 up (since 29m), 2 in (since 4w); 24 remapped pgs
> >> >> >
> >> >> >    data:
> >> >> >      volumes: 0/1 healthy, 1 failed
> >> >> >      pools:   12 pools, 529 pgs
> >> >> >      objects: 2.46M objects, 7.4 TiB
> >> >> >      usage:   8.3 TiB used, 13 TiB / 22 TiB avail
> >> >> >      pgs:     27.032% pgs not active
> >> >> >               1303896/7149898 objects degraded (18.237%)
> >> >> >               306628/7149898 objects misplaced (4.289%)
> >> >> >               218 active+undersized+degraded
> >> >> >               143 down
> >> >> >               98  active+undersized
> >> >> >               45  active+clean
> >> >> >               19  active+clean+remapped
> >> >> >               4   active+clean+remapped+scrubbing+deep
> >> >> >               1   active+clean+remapped+scrubbing
> >> >> >               1   active+clean+scrubbing+deep
> >> >> > ### ceph health
> >> >> >
> >> >> > ### ceph health detail
> >> >> > # ceph health detail
> >> >> > HEALTH_ERR 1 filesystem is degraded; 1 filesystem has a failed mds
> >> >> daemon;
> >> >> > 1 filesystem is offline; insufficient standby
> >> >> >   MDS daemons available; Reduced data availability: 143 pgs inactive,
> >> >> 143
> >> >> > pgs down; Degraded data redundancy: 1303896/714
> >> >> > 9898 objects degraded (18.237%), 218 pgs degraded, 316 pgs undersized;
> >> >> 144
> >> >> > pgs not deep-scrubbed in time; 459 pgs not sc
> >> >> > rubbed in time; 256 slow ops, oldest one blocked for 1508207 sec,
> >> osd.1
> >> >> has
> >> >> > slow ops; too many PGs per OSD (657 > max 50
> >> >> > 0)
> >> >> > [WRN] FS_DEGRADED: 1 filesystem is degraded
> >> >> >      fs ark is degraded
> >> >> > [WRN] FS_WITH_FAILED_MDS: 1 filesystem has a failed mds daemon
> >> >> >      fs ark has 1 failed mds
> >> >> > [ERR] MDS_ALL_DOWN: 1 filesystem is offline
> >> >> >      fs ark is offline because no MDS is active for it.
> >> >> > [WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons
> >> >> available
> >> >> >      have 0; want 1 more
> >> >> > [WRN] PG_AVAILABILITY: Reduced data availability: 143 pgs inactive,
> >> 143
> >> >> pgs
> >> >> > down
> >> >> >      pg 10.11 is down, acting [1,3]
> >> >> >      pg 10.18 is down, acting [3,1]
> >> >> >      pg 10.1d is down, acting [1,3]
> >> >> >      pg 10.1f is down, acting [1,3]
> >> >> >      pg 11.10 is down, acting [3,1,NONE]
> >> >> >      pg 11.12 is down, acting [1,NONE,3]
> >> >> >      pg 11.18 is stuck inactive for 4w, current state down, last
> >> acting
> >> >> > [1,3,NONE]
> >> >> >      pg 11.19 is down, acting [3,1,NONE]
> >> >> >      pg 11.1b is down, acting [1,NONE,3]
> >> >> >      pg 11.62 is down, acting [NONE,3,1]
> >> >> >      pg 11.63 is down, acting [3,NONE,1]
> >> >> >      pg 11.64 is down, acting [NONE,1,3]
> >> >> >      pg 11.66 is down, acting [NONE,3,1]
> >> >> >      pg 11.67 is down, acting [1,NONE,3]
> >> >> >      pg 11.68 is down, acting [3,NONE,1]
> >> >> >      pg 11.69 is down, acting [NONE,1,3]
> >> >> >      pg 11.6a is down, acting [1,NONE,3]
> >> >> >      pg 11.6b is down, acting [NONE,1,3]
> >> >> >      pg 11.6f is down, acting [NONE,3,1]
> >> >> >      pg 11.71 is down, acting [1,3,NONE]
> >> >> >      pg 11.72 is down, acting [1,3,NONE]
> >> >> >      pg 11.74 is down, acting [NONE,3,1]
> >> >> >      pg 11.76 is down, acting [1,NONE,3]
> >> >> >      pg 11.78 is down, acting [3,1,NONE]
> >> >> >      pg 11.7d is down, acting [NONE,3,1]
> >> >> >      pg 11.7e is down, acting [NONE,1,3]
> >> >> >      pg 15.15 is down, acting [1,3]
> >> >> >      pg 15.16 is down, acting [3,1]
> >> >> >      pg 15.17 is down, acting [1,3]
> >> >> >      pg 15.1a is down, acting [3,1]
> >> >> >      pg 16.1 is down, acting [1,3,NONE]
> >> >> >      pg 16.4 is down, acting [1,3,NONE]
> >> >> >      pg 16.b is down, acting [3,NONE,1]
> >> >> >      pg 16.60 is down, acting [3,1,NONE]
> >> >> >      pg 16.61 is down, acting [3,1,NONE]
> >> >> >      pg 16.62 is down, acting [3,NONE,1]
> >> >> >      pg 16.63 is down, acting [3,NONE,1]
> >> >> >      pg 16.65 is down, acting [NONE,3,1]
> >> >> >      pg 16.67 is down, acting [1,NONE,3]
> >> >> >      pg 16.68 is down, acting [1,NONE,3]
> >> >> >      pg 16.69 is down, acting [3,1,NONE]
> >> >> >      pg 16.6a is down, acting [1,3,NONE]
> >> >> >      pg 16.6c is down, acting [1,3,NONE]
> >> >> >      pg 16.70 is down, acting [3,NONE,1]
> >> >> >      pg 16.73 is down, acting [3,NONE,1]
> >> >> >      pg 16.74 is down, acting [1,3,NONE]
> >> >> >      pg 16.75 is down, acting [3,1,NONE]
> >> >> >      pg 16.79 is down, acting [3,NONE,1]
> >> >> >      pg 16.7a is down, acting [1,3,NONE]
> >> >> >      pg 16.7e is down, acting [1,3,NONE]
> >> >> >      pg 16.7f is down, acting [3,NONE,1]
> >> >> > [WRN] PG_DEGRADED: Degraded data redundancy: 1303896/7149898 objects
> >> >> > degraded (18.237%), 218 pgs degraded, 316 pgs under
> >> >> > sized
> >> >> >      pg 3.18 is stuck undersized for 36m, current state
> >> >> active+undersized,
> >> >> > last acting [1,3]
> >> >> > ...<snipped for brevity>
> >> >> > ###
> >> >> >
> >> >> > Once again, I cannot thank you enough for looking into my issue.
> >> >> > I have the impression that being able to recover the data I need is
> >> just
> >> >> > around the corner. Although the croit.io blog did mention flagging
> >> the
> >> >> osd
> >> >> > as lost, I would like to double check it to avoid losing any
> >> >> possibility to
> >> >> > recover the data.
> >> >> >
> >> >> > If there's anything further I could check or if you need full output
> >> of
> >> >> the
> >> >> > commands, let me know.
> >> >> >
> >> >> > Thanks in advance.
> >> >> >
> >> >> > On Tue, 3 Feb 2026 at 10:26, Igor Fedotov <[email protected]>
> >> >> wrote:
> >> >> >
> >> >> >> Hi Theo,
> >> >> >>
> >> >> >> you might want to try to use PG export/import using
> >> >> ceph-objectstore-tool.
> >> >> >>
> >> >> >> Please find more details here
> >> >> >>
> >> >>
> >> https://www.croit.io/blog/how-to-recover-inactive-pgs-using-ceph-objectstore-tool-on-ceph-clusters
> >> >> >>
> >> >> >>
> >> >> >> Thanks,
> >> >> >>
> >> >> >> Igor
> >> >> >> On 03/02/2026 02:38, Theo Cabrerizo Diem via ceph-users wrote:
> >> >> >>
> >> >> >> :12:18.895+0000 7f0c543eac00 -1 bluestore(/var/lib/ceph/osd)
> >> >> >> fsck error: free extent 0x1714c521000~978b26df000 intersects
> >> >> allocatedblocks
> >> >> >> fsck status: remaining 1 error(s) and warning(s)
> >> >> >>
> >> >> >>
> >> >> > _______________________________________________
> >> >> > ceph-users mailing list -- [email protected]
> >> >> > To unsubscribe send an email to [email protected]
> >> >>
> >> >
> >> _______________________________________________
> >> ceph-users mailing list -- [email protected]
> >> To unsubscribe send an email to [email protected]
> >>
> > _______________________________________________
> > ceph-users mailing list -- [email protected]
> > To unsubscribe send an email to [email protected]
>
>
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]



-- 
Alexander Patrakov
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Recover after multiple OSD failures (was Re: Help recovering OSD: rocksdb: submit_common error: Corruption: block checksum mismatch)

Reply via email to