[ceph-users] Re: radosgw (s3) stops when setting weight to 0 on one OSD

Rok Jaklič via ceph-users Tue, 03 Feb 2026 05:08:21 -0800

Probably I did not use the right words to explain current situtaion...

---


Cluster starts to recover/backfilling once I set crush reweight of a failed
osd to 0 (here s3 stops responding after some time). If I set it to a
little bit lower than the previous value, then s3 does not stop working
after some time.

For example, right now its io says it is in a recovery state? (that is why
I said it is in recovering state):
Every 2.0s: ceph -s
            ctplmon1.arnes.si: Tue Feb  3 13:55:19 2026

  cluster:
    id:     0a6e5422-ac75-4093-af20-528ee00cc847
    health: HEALTH_ERR
            2 OSD(s) experiencing slow operations in BlueStore
            2 OSD(s) experiencing stalled read in db device of BlueFS
            5 scrub errors
            Possible data damage: 3 pgs inconsistent

  services:
    mon: 3 daemons, quorum ctplmon1,ctplmon3,ctplmon2 (age 10d)
    mgr: ctplmon1(active, since 10d), standbys: ctplmon3
    mds: 1/1 daemons up
    osd: 240 osds: 240 up (since 42h), 240 in (since 42h); 13 remapped pgs
    rgw: 6 daemons active (2 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   10 pools, 6353 pgs
    objects: 360.61M objects, 421 TiB
    usage:   708 TiB used, 623 TiB / 1.3 PiB avail
    pgs:     1544002/1802985000 objects misplaced (0.086%)
             6328 active+clean
             13   active+remapped+backfilling
             9    active+clean+scrubbing+deep
             3    active+clean+inconsistent

  io:
    client:   97 KiB/s rd, 1.2 KiB/s wr, 18 op/s rd, 3 op/s wr
    recovery: 110 MiB/s, 90 objects/s

---

I do not use orchestrator.

Crush failure domain is set to host and EC 3+2 is used (historical
reasons), even though we would like to change this to 8+3 in the near
future.

For every new failed disk different PGs are probably used, not the same?

Rok

On Tue, Feb 3, 2026 at 1:35 PM Eugen Block via ceph-users <
[email protected]> wrote:

> So the cluster is currently recovering and then you wanted to remove
> another OSD? I'm just trying to get the full picture here. Let's assume the
> cluster is backfilling because you removed one OSD (and it's not done yet),
> then you decide to remove another one. If we assume a replicated pool (size
> 3, min_size 2), this could cause inactive PGs until backfill has finished.
> Depending on the failure domain (host?) you need to be careful when
> removing OSDs from multiple failure domains at the same time. That's why
> some background here would help to better understand what really happens.
> And does it happen every time you replace a disk? Does it always impact the
> same PGs? Does the cluster report inactive PGs for more than a few seconds?
>
> I was asking for RGW logs during that outage, not the "normal" behaviour
> (starting new request), so some error logs. The reshardlogs messages are
> also nothing unusual.
>
> Regarding the replacement procedure, it looks like you don't use the
> orchestrator, right?
>
>
> Am Di., 3. Feb. 2026 um 10:30 Uhr schrieb Rok Jaklič via ceph-users <
> [email protected]>:
>
> > ceph -s is in recovery state, moving around 200-300MB/s of data.
> >
> > [root@ctplmon1 ~]# ceph osd ok-to-stop 239
> >
> >
> {"ok_to_stop":true,"osds":[239],"num_ok_pgs":94,"num_not_ok_pgs":0,"ok_become_degraded":["9.e","9.40","9.4a","9.4e","9.69","9.7a","9.87","9.bc","9.c5","
> > 9.cf
> >
> >
> ","9.dd","9.128","9.172","9.1b6","9.1c7","9.227","9.244","9.246","9.24e","9.2b3","9.2d3","9.2ee","9.2f1","9.31c","9.335","9.361","9.364","9.38b","9.3b0","9.3d9","9.408","9.41b","9.483","9.49a","9.4c1","9.4dc","9.4ea","9.4fd","9.553","9.554","9.55b","9.57a","9.586","9.5c7","9.61a","9.627","9.68b","9.69d","9.711","9.71d","9.737","9.784","10.12","10.82","10.18c","10.1c7","10.290","10.310","10.390","10.434","10.4aa","10.4e1","10.525","10.558","10.57f","10.5e2","10.5e7","10.62f","10.736","10.842","10.8d9","10.904","10.964","10.977","10.9eb","10.a1a","10.a40","10.b80","10.bbf","10.be2","10.c7c","10.c9b","10.cb0","10.cb3","10.d7a","10.dc8","10.dcf","10.dde","10.e5b","10.e6d","10.e78","10.ece","10.ef1","10.f60"]}
> >
> > rgw logs is usually something like:
> > 2026-02-03T09:48:32.050+0100 7f34ff13a640  1 ====== starting new request
> > req=0x7f34b8f0c640 =====
> > 2026-02-03T09:48:35.901+0100 7f34cb0d2640  1 ====== starting new request
> > req=0x7f34b8e8b640 =====
> > 2026-02-03T09:49:41.408+0100 7f34e00fc640  1 ====== starting new request
> > req=0x7f34b8d08640 =====
> > 2026-02-03T09:49:46.267+0100 7f35511de640  1 ====== starting new request
> > req=0x7f34b8c87640 =====
> >
> > while I can see also:
> > 2026-02-03T09:49:50.912+0100 7f35bf2db640  0 INFO: RGWReshardLock::lock
> > found lock on reshard.0000000000 to be held by another RGW process;
> > skipping for now
> > 2026-02-03T09:49:50.919+0100 7f35bf2db640  0 INFO: RGWReshardLock::lock
> > found lock on reshard.0000000002 to be held by another RGW process;
> > skipping for now
> > 2026-02-03T09:49:50.926+0100 7f35bf2db640  0 INFO: RGWReshardLock::lock
> > found lock on reshard.0000000004 to be held by another RGW process;
> > skipping for now
> > 2026-02-03T09:49:50.934+0100 7f35bf2db640  0 INFO: RGWReshardLock::lock
> > found lock on reshard.0000000006 to be held by another RGW process;
> > skipping for now
> > 2026-02-03T09:49:50.953+0100 7f35bf2db640  0 INFO: RGWReshardLock::lock
> > found lock on reshard.0000000008 to be held by another RGW process;
> > skipping for now
> > 2026-02-03T09:49:50.960+0100 7f35bf2db640  0 INFO: RGWReshardLock::lock
> > found lock on reshard.0000000010 to be held by another RGW process;
> > skipping for now
> > 2026-02-03T09:49:50.968+0100 7f35bf2db640  0 INFO: RGWReshardLock::lock
> > found lock on reshard.0000000012 to be held by another RGW process;
> > skipping for now
> > 2026-02-03T09:49:50.975+0100 7f35bf2db640  0 INFO: RGWReshardLock::lock
> > found lock on reshard.0000000014 to be held by another RGW process;
> > skipping for now
> >
> > but do not know if it is related.
> >
> > ...
> >
> > Should I use some different strategy to "replace" failed disks?
> >
> > Right now I am doing something like:
> > ceph osd crush reweight osd.{osdid} 4 3 2 1 ... in steps
> >
> > while ! ceph osd safe-to-destroy osd.{osdid} ; do sleep 10 ; done
> >
> > systemctl stop ceph-osd@{id}
> > ceph osd down osd.{id}
> > ceph osd destroy {id} --yes-i-really-mean-it
> >
> > ceph-volume lvm prepare --osd-id {id} --data /dev/sdX
> > ceph-volume lvm list
> > ceph-volume lvm activate {id} {osd fsid}
> >
> >
> >
> > On Tue, Feb 3, 2026 at 10:21 AM Eugen Block <[email protected]> wrote:
> >
> > > Hi,
> > >
> > > what does 'ceph osd ok-to-stop <OSD_ID>' show? What do the rgw logs
> > > reveal? What's the 'ceph status' during that time?
> > >
> > > Am Di., 3. Feb. 2026 um 10:18 Uhr schrieb Rok Jaklič via ceph-users <
> > > [email protected]>:
> > >
> > >> Hi,
> > >>
> > >> we currently have 240 OSDs in the cluster, but when I set the reweight
> > of
> > >> just one failed OSD to 0 (from 5.45798), radosgw stops working for
> > >> some reason.
> > >>
> > >> If I do "gentle reweight" in steps of 1 for example (from 5.4 to 4.4),
> > it
> > >> works ok.
> > >>
> > >> Any ideas why?
> > >>
> > >> Running on reef ceph version 18.2.7.
> > >>
> > >> Kind regards,
> > >> Rok
> > >> _______________________________________________
> > >> ceph-users mailing list -- [email protected]
> > >> To unsubscribe send an email to [email protected]
> > >>
> > >
> > _______________________________________________
> > ceph-users mailing list -- [email protected]
> > To unsubscribe send an email to [email protected]
> >
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
>
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: radosgw (s3) stops when setting weight to 0 on one OSD

Reply via email to