[ceph-users] Re: radosgw (s3) stops when setting weight to 0 on one OSD

Eugen Block via ceph-users Tue, 03 Feb 2026 04:35:43 -0800

So the cluster is currently recovering and then you wanted to remove
another OSD? I'm just trying to get the full picture here. Let's assume the
cluster is backfilling because you removed one OSD (and it's not done yet),
then you decide to remove another one. If we assume a replicated pool (size
3, min_size 2), this could cause inactive PGs until backfill has finished.
Depending on the failure domain (host?) you need to be careful when
removing OSDs from multiple failure domains at the same time. That's why
some background here would help to better understand what really happens.
And does it happen every time you replace a disk? Does it always impact the
same PGs? Does the cluster report inactive PGs for more than a few seconds?


I was asking for RGW logs during that outage, not the "normal" behaviour
(starting new request), so some error logs. The reshardlogs messages are
also nothing unusual.

Regarding the replacement procedure, it looks like you don't use the
orchestrator, right?


Am Di., 3. Feb. 2026 um 10:30 Uhr schrieb Rok Jaklič via ceph-users <
[email protected]>:

> ceph -s is in recovery state, moving around 200-300MB/s of data.
>
> [root@ctplmon1 ~]# ceph osd ok-to-stop 239
>
> {"ok_to_stop":true,"osds":[239],"num_ok_pgs":94,"num_not_ok_pgs":0,"ok_become_degraded":["9.e","9.40","9.4a","9.4e","9.69","9.7a","9.87","9.bc","9.c5","
> 9.cf
>
> ","9.dd","9.128","9.172","9.1b6","9.1c7","9.227","9.244","9.246","9.24e","9.2b3","9.2d3","9.2ee","9.2f1","9.31c","9.335","9.361","9.364","9.38b","9.3b0","9.3d9","9.408","9.41b","9.483","9.49a","9.4c1","9.4dc","9.4ea","9.4fd","9.553","9.554","9.55b","9.57a","9.586","9.5c7","9.61a","9.627","9.68b","9.69d","9.711","9.71d","9.737","9.784","10.12","10.82","10.18c","10.1c7","10.290","10.310","10.390","10.434","10.4aa","10.4e1","10.525","10.558","10.57f","10.5e2","10.5e7","10.62f","10.736","10.842","10.8d9","10.904","10.964","10.977","10.9eb","10.a1a","10.a40","10.b80","10.bbf","10.be2","10.c7c","10.c9b","10.cb0","10.cb3","10.d7a","10.dc8","10.dcf","10.dde","10.e5b","10.e6d","10.e78","10.ece","10.ef1","10.f60"]}
>
> rgw logs is usually something like:
> 2026-02-03T09:48:32.050+0100 7f34ff13a640  1 ====== starting new request
> req=0x7f34b8f0c640 =====
> 2026-02-03T09:48:35.901+0100 7f34cb0d2640  1 ====== starting new request
> req=0x7f34b8e8b640 =====
> 2026-02-03T09:49:41.408+0100 7f34e00fc640  1 ====== starting new request
> req=0x7f34b8d08640 =====
> 2026-02-03T09:49:46.267+0100 7f35511de640  1 ====== starting new request
> req=0x7f34b8c87640 =====
>
> while I can see also:
> 2026-02-03T09:49:50.912+0100 7f35bf2db640  0 INFO: RGWReshardLock::lock
> found lock on reshard.0000000000 to be held by another RGW process;
> skipping for now
> 2026-02-03T09:49:50.919+0100 7f35bf2db640  0 INFO: RGWReshardLock::lock
> found lock on reshard.0000000002 to be held by another RGW process;
> skipping for now
> 2026-02-03T09:49:50.926+0100 7f35bf2db640  0 INFO: RGWReshardLock::lock
> found lock on reshard.0000000004 to be held by another RGW process;
> skipping for now
> 2026-02-03T09:49:50.934+0100 7f35bf2db640  0 INFO: RGWReshardLock::lock
> found lock on reshard.0000000006 to be held by another RGW process;
> skipping for now
> 2026-02-03T09:49:50.953+0100 7f35bf2db640  0 INFO: RGWReshardLock::lock
> found lock on reshard.0000000008 to be held by another RGW process;
> skipping for now
> 2026-02-03T09:49:50.960+0100 7f35bf2db640  0 INFO: RGWReshardLock::lock
> found lock on reshard.0000000010 to be held by another RGW process;
> skipping for now
> 2026-02-03T09:49:50.968+0100 7f35bf2db640  0 INFO: RGWReshardLock::lock
> found lock on reshard.0000000012 to be held by another RGW process;
> skipping for now
> 2026-02-03T09:49:50.975+0100 7f35bf2db640  0 INFO: RGWReshardLock::lock
> found lock on reshard.0000000014 to be held by another RGW process;
> skipping for now
>
> but do not know if it is related.
>
> ...
>
> Should I use some different strategy to "replace" failed disks?
>
> Right now I am doing something like:
> ceph osd crush reweight osd.{osdid} 4 3 2 1 ... in steps
>
> while ! ceph osd safe-to-destroy osd.{osdid} ; do sleep 10 ; done
>
> systemctl stop ceph-osd@{id}
> ceph osd down osd.{id}
> ceph osd destroy {id} --yes-i-really-mean-it
>
> ceph-volume lvm prepare --osd-id {id} --data /dev/sdX
> ceph-volume lvm list
> ceph-volume lvm activate {id} {osd fsid}
>
>
>
> On Tue, Feb 3, 2026 at 10:21 AM Eugen Block <[email protected]> wrote:
>
> > Hi,
> >
> > what does 'ceph osd ok-to-stop <OSD_ID>' show? What do the rgw logs
> > reveal? What's the 'ceph status' during that time?
> >
> > Am Di., 3. Feb. 2026 um 10:18 Uhr schrieb Rok Jaklič via ceph-users <
> > [email protected]>:
> >
> >> Hi,
> >>
> >> we currently have 240 OSDs in the cluster, but when I set the reweight
> of
> >> just one failed OSD to 0 (from 5.45798), radosgw stops working for
> >> some reason.
> >>
> >> If I do "gentle reweight" in steps of 1 for example (from 5.4 to 4.4),
> it
> >> works ok.
> >>
> >> Any ideas why?
> >>
> >> Running on reef ceph version 18.2.7.
> >>
> >> Kind regards,
> >> Rok
> >> _______________________________________________
> >> ceph-users mailing list -- [email protected]
> >> To unsubscribe send an email to [email protected]
> >>
> >
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
>
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: radosgw (s3) stops when setting weight to 0 on one OSD

Reply via email to