Hello Igor,

> Just in case - didn't you overwrite existing PG replicas  at target OSDs
> when exporting PGs back to OSDs 1&3?

Now that you mention that an OSD cannot have two shards, I think I did put
myself now in a tricky place.
I had 4 OSDs, two "died" .. leaving two online, so when I did export shards
from osd.2 I did end up importing them in osd.3 (which had other shards for
the same PG).

Doesn't look "that bad" ... (*looking up to the sky*) ...
## on osd.3
# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ --op list-pgs | grep
16.b
16.bs1
16.bs0
(and the s2 should be in osd.1, s1 was imported)

## ceph health reports
  pg 16.b is down, acting [3,NONE,1]

I might have some spares HDDs to set up a new OSD (one though) with enough
space for it to balance out, but as I've mentioned, there are quite a bunch
of PGs in down state. Should I set-up a new OSD and wait until it "settles
up"? The exported shards from the dead osd.2 that ceph health was
complaining about sums to about 400Gb.

Is it possible to set-up an OSD and have ceph not push too many shards into
it? (the replica pools are "small"). I ask this because it should be easy
to get a 1Tb/2Tb disk, but if it should balance completely for the PGs to
"unstuck", I need a bigger disk, which I don't have at the moment (and
currently not the best market prices to get them).

Another option is if osd.0 is gone for good (the one that doesn't even
allows me to run 'ceph-objectstore-tool --list-pgs'), I can wipe its
disk/clone and use it instead, it is the same capacity as other OSDs (I
used a spare I had at home). If you believe it is still workable, I can try
to find some more disks and not touch those for the moment.

Thanks for the patience so far.

On Tue, 24 Feb 2026 at 16:39, Igor Fedotov <[email protected]> wrote:

> Hi Theo,
>
> Sorry I can't tell for sure what would marking OSD lost do with its
> encryption keys. Likely - yes, they'll be lost.
>
> But instead of going this way I'd rather suggest you to add another two
> OSDs and let Ceph recover more PGs replicas into them.
>
> Just in case - didn't you overwrite existing PG replicas  at target OSDs
> when exporting PGs back to OSDs 1&3? The same PG can't have two
> replicas/shards at a single OSD while your OSD count is pretty limited..
> Just curious for now - that still shouldn't be an issue given you have
> at least 2 replicas/shards for all the pools anyway.
>
> Thanks,
>
> Igor
>
> On 2/21/2026 11:00 PM, Theo Cabrerizo Diem via ceph-users wrote:
> > Hello Igor,
> >
> > First of all, sorry about the late reply. It took me a while to export
> all
> > shards that weren't available from the osd.2 (1 and 3 were fine, 2 didn't
> > start but i could use `ceph-objectstore-tool ... --op list-pgs` while
> osd.0
> > I couldn't even list the pgs, it threw an error right away - more about
> it
> > later in the email)
> >
> > Two of the unavailable shards, when exporting, ceph-objectstore-tool core
> > dumped with the same issue in the rocksdb, but I should have enough
> chunks
> > to not need them - just mentioning in case is useful:
> >
> > sh-5.1# ceph-objectstore-tool --data-path /var/lib/ceph/osd --pgid
> 11.19s2
> > --op export --file pg.11.19s2.dat
> > /ceph/rpmbuild/BUILD/ceph-20.2.0/src/kv/RocksDBStore.cc: In function
> > 'virtual int RocksDBStore::get(const std::string&, const std::string&,
> > ceph::bufferlist*)' thread 7ff3be4ca800 time
> 2026-02-04T09:42:00.743877+0000
> > /ceph/rpmbuild/BUILD/ceph-20.2.0/src/kv/RocksDBStore.cc: 1961:
> > ceph_abort_msg("block checksum mismatch: stored = 246217859, computed =
> > 2155741315, type = 4  in db/170027.sst offset 28264757 size 1417")
> >   ceph version 20.2.0 (69f84cc2651aa259a15bc192ddaabd3baba07489) tentacle
> > (stable - RelWithDebInfo)
> >   1: (ceph::__ceph_abort(char const*, int, char const*,
> > std::__cxx11::basic_string<char, std::char_traits<char>,
> > std::allocator<char> > const&)+0xc9) [0x7ff3bf5391fd]
> >   2: (RocksDBStore::get(std::__cxx11::basic_string<char,
> > std::char_traits<char>, std::allocator<char> > const&,
> > std::__cxx11::basic_string<char, std::char_traits<char>,
> > std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x3bc)
> > [0x555667b340bc]
> >   3:
> >
> (BlueStore::omap_get_values(boost::intrusive_ptr<ObjectStore::CollectionImpl>&,
> > ghobject_t const&,
> > std::set<std::__cxx11::basic_string<char,std::char_traits<char>,
> > std::allocator<char> >, std::less<std::__cxx11::basic_string<char,
> > std::char_traits<char>, std::allocator<char> > >,
> > std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>,
> > std::allocator<char> > > > const&,
> > std::map<std::__cxx11::basic_string<char,std::char_traits<char>,
> > std::allocator<char> >, ceph::buffer::v15_2_0::list,
> > std::less<std::__cxx11::basic_string<char, std::char_traits<char>,
> > std::allocator<char> > >,
> > std::allocator<std::pair<std::__cxx11::basic_string<char,
> > std::char_traits<char>, std::allocator<char> > const,
> > ceph::buffer::v15_2_0::list> > >*)+0x401) [0x555667a25fe1]
> >   4: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*)+0x361)
> > [0x5556675e0101]
> >   5: main()
> >   6: /lib64/libc.so.6(+0x2a610) [0x7ff3be930610]
> >   7: __libc_start_main()
> >   8: _start()
> > *** Caught signal (Aborted) **
> >   in thread 7ff3be4ca800 thread_name:ceph-objectstor
> >   ceph version 20.2.0 (69f84cc2651aa259a15bc192ddaabd3baba07489) tentacle
> > (stable - RelWithDebInfo)
> >   1: /lib64/libc.so.6(+0x3fc30) [0x7ff3be945c30]
> >   2: /lib64/libc.so.6(+0x8d03c) [0x7ff3be99303c]
> >   3: raise()
> >   4: abort()
> >   5: (ceph::__ceph_abort(char const*, int, char const*,
> > std::__cxx11::basic_string<char, std::char_traits<char>,
> > std::allocator<char> > const&)+0x186) [0x7ff3bf5392ba]
> >   6: (RocksDBStore::get(std::__cxx11::basic_string<char,
> > std::char_traits<char>, std::allocator<char> > const&,
> > std::__cxx11::basic_string<char, std::char_traits<char>,
> > std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x3bc)
> > [0x555667b340bc]
> >   7:
> >
> (BlueStore::omap_get_values(boost::intrusive_ptr<ObjectStore::CollectionImpl>&,
> > ghobject_t const&,
> > std::set<std::__cxx11::basic_string<char,std::char_traits<char>,
> > std::allocator<char> >, std::less<std::__cxx11::basic_string<char,
> > std::char_traits<char>, std::allocator<char> > >,
> > std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>,
> > std::allocator<char> > > > const&,
> > std::map<std::__cxx11::basic_string<char,std::char_traits<char>,
> > std::allocator<char> >, ceph::buffer::v15_2_0::list,
> > std::less<std::__cxx11::basic_string<char, std::char_traits<char>,
> > std::allocator<char> > >,
> > std::allocator<std::pair<std::__cxx11::basic_string<char,
> > std::char_traits<char>, std::allocator<char> > const,
> > ceph::buffer::v15_2_0::list> > >*)+0x401) [0x555667a25fe1]
> >   8: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*)+0x361)
> > [0x5556675e0101]
> >   9: main()
> >   10: /lib64/libc.so.6(+0x2a610) [0x7ff3be930610]
> >   11: __libc_start_main()
> >   12: _start()
> > Aborted (core dumped)
> >
> >
> >
> > After importing all shards that I could recover that weren't available, I
> > don't have any "unknown" pgs anymore. I still have lots of PGs in "down"
> > state, which I assume I need to flag both "dead" OSDs as lost to unstuck
> > them. Since it is an operation I cannot go back, I would like to confirm
> > that is indeed the correct next step to take.
> >
> > I have a few questions to understand "what happens" in the next step
> > (marking osd as lost?):
> >
> > Shall I assume that once I flag an OSD as lost, I won't be able to
> > "activate" it since I use encryption when initializing the bluestore OSD,
> > or flagging them as lost won't destroy their unlocking keys? (which means
> > any hope of further extracting data to be gone, mostly on the osd.0
> which I
> > couldn't use ceph-objectstore-tool at all since the power loss).
> >
> > I think I should have all the shards from the PGs but just in case, I've
> > managed to make a clone of the osd.0 on a different physical disk (the
> > other reason I took long to answer). But still ceph-objectstore-tool
> > refuses to run:
> >
> > # ceph-objectstore-tool --data-path /var/lib/ceph/osd --op list-pgs
> > Mount failed with '(5) Input/output error'
> >
> > # ls -l /var/lib/ceph/osd
> > total 28
> > lrwxrwxrwx 1 ceph ceph  50 Feb  4 08:26 block ->
> > /dev/mapper/zNPZJR-i0TZ-6NtK-URto-tjfs-iJRb-GCAYEm
> > -rw------- 1 ceph ceph  37 Feb  4 08:26 ceph_fsid
> > -rw------- 1 ceph ceph  37 Feb  4 08:26 fsid
> > -rw------- 1 ceph ceph  55 Feb  4 08:26 keyring
> > -rw------- 1 ceph ceph 106 Jan 24 00:44 lockbox.keyring
> > -rw------- 1 ceph ceph   6 Feb  4 08:26 ready
> > -rw------- 1 ceph ceph  10 Feb  4 08:26 type
> > -rw------- 1 ceph ceph   2 Feb  4 08:26 whoami
> >
> > Just as information, all except 2 pools in my cluster are "replicated".
> > Pools id 11 and 16 are erasure coded (2+1). If I understood correctly, as
> > long as I have two acting shards (and at most one "NONE"), data should be
> > available (at least in read-only) once I mark the down OSDs as lost. Is
> > that understanding correct?
> >
> > Another information, pools 10 and 15 are the "replicated root pools"
> before
> > the erasure coded pools were created.
> >
> > Ignoring osd.0 for now, here are the current state of my cluster (mds is
> > intentionally not started while I try to fix the PGs):
> > ### ceph osd lspools
> > 3 .rgw.root
> > 4 default.rgw.log
> > 5 default.rgw.control
> > 6 default.rgw.meta
> > 10 ark.data
> > 11 ark.data_ec
> > 12 ark.metadata
> > 14 .mgr
> > 15 limbo
> > 16 limbo.data_ec
> > 18 default.rgw.buckets.index
> > 19 default.rgw.buckets.data
> > ###
> >
> > ### ceph health
> > # ceph -s
> >    cluster:
> >      id:     021f058f-dbf3-4a23-adb5-21d83f3f1bb6
> >      health: HEALTH_ERR
> >              1 filesystem is degraded
> >              1 filesystem has a failed mds daemon
> >              1 filesystem is offline
> >              insufficient standby MDS daemons available
> >              Reduced data availability: 143 pgs inactive, 143 pgs down
> >              Degraded data redundancy: 1303896/7149898 objects degraded
> > (18.237%), 218 pgs degraded, 316 pgs undersized
> >              144 pgs not deep-scrubbed in time
> >              459 pgs not scrubbed in time
> >              256 slow ops, oldest one blocked for 1507794 sec, osd.1 has
> > slow ops
> >              too many PGs per OSD (657 > max 500)
> >
> >    services:
> >      mon: 2 daemons, quorum ceph-ymir-mon2,ceph-ymir-mon1 (age 2w)
> >      mgr: ceph-ymir-mgr1(active, since 2w)
> >      mds: 0/1 daemons up (1 failed)
> >      osd: 4 osds: 2 up (since 29m), 2 in (since 4w); 24 remapped pgs
> >
> >    data:
> >      volumes: 0/1 healthy, 1 failed
> >      pools:   12 pools, 529 pgs
> >      objects: 2.46M objects, 7.4 TiB
> >      usage:   8.3 TiB used, 13 TiB / 22 TiB avail
> >      pgs:     27.032% pgs not active
> >               1303896/7149898 objects degraded (18.237%)
> >               306628/7149898 objects misplaced (4.289%)
> >               218 active+undersized+degraded
> >               143 down
> >               98  active+undersized
> >               45  active+clean
> >               19  active+clean+remapped
> >               4   active+clean+remapped+scrubbing+deep
> >               1   active+clean+remapped+scrubbing
> >               1   active+clean+scrubbing+deep
> > ### ceph health
> >
> > ### ceph health detail
> > # ceph health detail
> > HEALTH_ERR 1 filesystem is degraded; 1 filesystem has a failed mds
> daemon;
> > 1 filesystem is offline; insufficient standby
> >   MDS daemons available; Reduced data availability: 143 pgs inactive, 143
> > pgs down; Degraded data redundancy: 1303896/714
> > 9898 objects degraded (18.237%), 218 pgs degraded, 316 pgs undersized;
> 144
> > pgs not deep-scrubbed in time; 459 pgs not sc
> > rubbed in time; 256 slow ops, oldest one blocked for 1508207 sec, osd.1
> has
> > slow ops; too many PGs per OSD (657 > max 50
> > 0)
> > [WRN] FS_DEGRADED: 1 filesystem is degraded
> >      fs ark is degraded
> > [WRN] FS_WITH_FAILED_MDS: 1 filesystem has a failed mds daemon
> >      fs ark has 1 failed mds
> > [ERR] MDS_ALL_DOWN: 1 filesystem is offline
> >      fs ark is offline because no MDS is active for it.
> > [WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons
> available
> >      have 0; want 1 more
> > [WRN] PG_AVAILABILITY: Reduced data availability: 143 pgs inactive, 143
> pgs
> > down
> >      pg 10.11 is down, acting [1,3]
> >      pg 10.18 is down, acting [3,1]
> >      pg 10.1d is down, acting [1,3]
> >      pg 10.1f is down, acting [1,3]
> >      pg 11.10 is down, acting [3,1,NONE]
> >      pg 11.12 is down, acting [1,NONE,3]
> >      pg 11.18 is stuck inactive for 4w, current state down, last acting
> > [1,3,NONE]
> >      pg 11.19 is down, acting [3,1,NONE]
> >      pg 11.1b is down, acting [1,NONE,3]
> >      pg 11.62 is down, acting [NONE,3,1]
> >      pg 11.63 is down, acting [3,NONE,1]
> >      pg 11.64 is down, acting [NONE,1,3]
> >      pg 11.66 is down, acting [NONE,3,1]
> >      pg 11.67 is down, acting [1,NONE,3]
> >      pg 11.68 is down, acting [3,NONE,1]
> >      pg 11.69 is down, acting [NONE,1,3]
> >      pg 11.6a is down, acting [1,NONE,3]
> >      pg 11.6b is down, acting [NONE,1,3]
> >      pg 11.6f is down, acting [NONE,3,1]
> >      pg 11.71 is down, acting [1,3,NONE]
> >      pg 11.72 is down, acting [1,3,NONE]
> >      pg 11.74 is down, acting [NONE,3,1]
> >      pg 11.76 is down, acting [1,NONE,3]
> >      pg 11.78 is down, acting [3,1,NONE]
> >      pg 11.7d is down, acting [NONE,3,1]
> >      pg 11.7e is down, acting [NONE,1,3]
> >      pg 15.15 is down, acting [1,3]
> >      pg 15.16 is down, acting [3,1]
> >      pg 15.17 is down, acting [1,3]
> >      pg 15.1a is down, acting [3,1]
> >      pg 16.1 is down, acting [1,3,NONE]
> >      pg 16.4 is down, acting [1,3,NONE]
> >      pg 16.b is down, acting [3,NONE,1]
> >      pg 16.60 is down, acting [3,1,NONE]
> >      pg 16.61 is down, acting [3,1,NONE]
> >      pg 16.62 is down, acting [3,NONE,1]
> >      pg 16.63 is down, acting [3,NONE,1]
> >      pg 16.65 is down, acting [NONE,3,1]
> >      pg 16.67 is down, acting [1,NONE,3]
> >      pg 16.68 is down, acting [1,NONE,3]
> >      pg 16.69 is down, acting [3,1,NONE]
> >      pg 16.6a is down, acting [1,3,NONE]
> >      pg 16.6c is down, acting [1,3,NONE]
> >      pg 16.70 is down, acting [3,NONE,1]
> >      pg 16.73 is down, acting [3,NONE,1]
> >      pg 16.74 is down, acting [1,3,NONE]
> >      pg 16.75 is down, acting [3,1,NONE]
> >      pg 16.79 is down, acting [3,NONE,1]
> >      pg 16.7a is down, acting [1,3,NONE]
> >      pg 16.7e is down, acting [1,3,NONE]
> >      pg 16.7f is down, acting [3,NONE,1]
> > [WRN] PG_DEGRADED: Degraded data redundancy: 1303896/7149898 objects
> > degraded (18.237%), 218 pgs degraded, 316 pgs under
> > sized
> >      pg 3.18 is stuck undersized for 36m, current state
> active+undersized,
> > last acting [1,3]
> > ...<snipped for brevity>
> > ###
> >
> > Once again, I cannot thank you enough for looking into my issue.
> > I have the impression that being able to recover the data I need is just
> > around the corner. Although the croit.io blog did mention flagging the
> osd
> > as lost, I would like to double check it to avoid losing any possibility
> to
> > recover the data.
> >
> > If there's anything further I could check or if you need full output of
> the
> > commands, let me know.
> >
> > Thanks in advance.
> >
> > On Tue, 3 Feb 2026 at 10:26, Igor Fedotov <[email protected]> wrote:
> >
> >> Hi Theo,
> >>
> >> you might want to try to use PG export/import using
> ceph-objectstore-tool.
> >>
> >> Please find more details here
> >>
> https://www.croit.io/blog/how-to-recover-inactive-pgs-using-ceph-objectstore-tool-on-ceph-clusters
> >>
> >>
> >> Thanks,
> >>
> >> Igor
> >> On 03/02/2026 02:38, Theo Cabrerizo Diem via ceph-users wrote:
> >>
> >> :12:18.895+0000 7f0c543eac00 -1 bluestore(/var/lib/ceph/osd)
> >> fsck error: free extent 0x1714c521000~978b26df000 intersects
> allocatedblocks
> >> fsck status: remaining 1 error(s) and warning(s)
> >>
> >>
> > _______________________________________________
> > ceph-users mailing list -- [email protected]
> > To unsubscribe send an email to [email protected]
>
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to