[ceph-users] Re: radosgw (s3) stops when setting weight to 0 on one OSD

Rok Jaklič via ceph-users Tue, 03 Feb 2026 01:38:32 -0800

ceph -s is in recovery state, moving around 200-300MB/s of data.

[root@ctplmon1 ~]# ceph osd ok-to-stop 239
{"ok_to_stop":true,"osds":[239],"num_ok_pgs":94,"num_not_ok_pgs":0,"ok_become_degraded":["9.e","9.40","9.4a","9.4e","9.69","9.7a","9.87","9.bc","9.c5","
9.cf
","9.dd","9.128","9.172","9.1b6","9.1c7","9.227","9.244","9.246","9.24e","9.2b3","9.2d3","9.2ee","9.2f1","9.31c","9.335","9.361","9.364","9.38b","9.3b0","9.3d9","9.408","9.41b","9.483","9.49a","9.4c1","9.4dc","9.4ea","9.4fd","9.553","9.554","9.55b","9.57a","9.586","9.5c7","9.61a","9.627","9.68b","9.69d","9.711","9.71d","9.737","9.784","10.12","10.82","10.18c","10.1c7","10.290","10.310","10.390","10.434","10.4aa","10.4e1","10.525","10.558","10.57f","10.5e2","10.5e7","10.62f","10.736","10.842","10.8d9","10.904","10.964","10.977","10.9eb","10.a1a","10.a40","10.b80","10.bbf","10.be2","10.c7c","10.c9b","10.cb0","10.cb3","10.d7a","10.dc8","10.dcf","10.dde","10.e5b","10.e6d","10.e78","10.ece","10.ef1","10.f60"]}


rgw logs is usually something like:
2026-02-03T09:48:32.050+0100 7f34ff13a640  1 ====== starting new request
req=0x7f34b8f0c640 =====
2026-02-03T09:48:35.901+0100 7f34cb0d2640  1 ====== starting new request
req=0x7f34b8e8b640 =====
2026-02-03T09:49:41.408+0100 7f34e00fc640  1 ====== starting new request
req=0x7f34b8d08640 =====
2026-02-03T09:49:46.267+0100 7f35511de640  1 ====== starting new request
req=0x7f34b8c87640 =====

while I can see also:
2026-02-03T09:49:50.912+0100 7f35bf2db640  0 INFO: RGWReshardLock::lock
found lock on reshard.0000000000 to be held by another RGW process;
skipping for now
2026-02-03T09:49:50.919+0100 7f35bf2db640  0 INFO: RGWReshardLock::lock
found lock on reshard.0000000002 to be held by another RGW process;
skipping for now
2026-02-03T09:49:50.926+0100 7f35bf2db640  0 INFO: RGWReshardLock::lock
found lock on reshard.0000000004 to be held by another RGW process;
skipping for now
2026-02-03T09:49:50.934+0100 7f35bf2db640  0 INFO: RGWReshardLock::lock
found lock on reshard.0000000006 to be held by another RGW process;
skipping for now
2026-02-03T09:49:50.953+0100 7f35bf2db640  0 INFO: RGWReshardLock::lock
found lock on reshard.0000000008 to be held by another RGW process;
skipping for now
2026-02-03T09:49:50.960+0100 7f35bf2db640  0 INFO: RGWReshardLock::lock
found lock on reshard.0000000010 to be held by another RGW process;
skipping for now
2026-02-03T09:49:50.968+0100 7f35bf2db640  0 INFO: RGWReshardLock::lock
found lock on reshard.0000000012 to be held by another RGW process;
skipping for now
2026-02-03T09:49:50.975+0100 7f35bf2db640  0 INFO: RGWReshardLock::lock
found lock on reshard.0000000014 to be held by another RGW process;
skipping for now

but do not know if it is related.

...

Should I use some different strategy to "replace" failed disks?

Right now I am doing something like:
ceph osd crush reweight osd.{osdid} 4 3 2 1 ... in steps

while ! ceph osd safe-to-destroy osd.{osdid} ; do sleep 10 ; done

systemctl stop ceph-osd@{id}
ceph osd down osd.{id}
ceph osd destroy {id} --yes-i-really-mean-it

ceph-volume lvm prepare --osd-id {id} --data /dev/sdX
ceph-volume lvm list
ceph-volume lvm activate {id} {osd fsid}



On Tue, Feb 3, 2026 at 10:21 AM Eugen Block <[email protected]> wrote:

> Hi,
>
> what does 'ceph osd ok-to-stop <OSD_ID>' show? What do the rgw logs
> reveal? What's the 'ceph status' during that time?
>
> Am Di., 3. Feb. 2026 um 10:18 Uhr schrieb Rok Jaklič via ceph-users <
> [email protected]>:
>
>> Hi,
>>
>> we currently have 240 OSDs in the cluster, but when I set the reweight of
>> just one failed OSD to 0 (from 5.45798), radosgw stops working for
>> some reason.
>>
>> If I do "gentle reweight" in steps of 1 for example (from 5.4 to 4.4), it
>> works ok.
>>
>> Any ideas why?
>>
>> Running on reef ceph version 18.2.7.
>>
>> Kind regards,
>> Rok
>> _______________________________________________
>> ceph-users mailing list -- [email protected]
>> To unsubscribe send an email to [email protected]
>>
>
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: radosgw (s3) stops when setting weight to 0 on one OSD

Reply via email to