[ceph-users] Re: radosgw (s3) stops when setting weight to 0 on one OSD

Rok Jaklič via ceph-users Wed, 04 Feb 2026 12:31:42 -0800

If one HDD drive fails and smartctl shows errors and then I decide to drain
it with crush reweight 0, would Ceph try to copy/move data/pgs from data
failed disk anyway?


Because we noticed on some non ceph clusters (raid setup, actually mail
servers), that one failed drive may hog up "app/OS" because "app" fails to
read/write to failed disk because "some queue" fills up (since app/OS is
unable for data to be read/written)?

Could something similar happen in Ceph?

Rok

On Wed, Feb 4, 2026 at 9:52 AM Rok Jaklič <[email protected]> wrote:

>
> On Wed, Feb 4, 2026 at 2:59 AM Anthony D'Atri <[email protected]>
> wrote:
>
>> Are these rear bay drives, hence the limit of 2? Or
>> You might consider an M.2 AIC adapter card with bifurcation.  M.2
>> enterprise SSDs are sunsetting but for retrofits you should be able to find
>> Micron 6450 units.
>>
>> What’s your workload like?
>>
>
> On average 10-50MB/s of write, with spikes up to a few hundred MB/s during
> evening/night time; it went up to 1GB/s during tests without a problem. All
> these are S3 workloads/tests.
>
> I would have to check that on site, RM does not show, however we are just
> about to migrate to new machines, which have 4 NVMe slots ... so I am
> really considering moving WAL/DB to NVMe, however I am still a little bit
> hesitant, since I am not really sure this will solve the problem of why
> radosgw/s3 stops after some time when setting crush reweight to 0 on one
> failed disk. We are doing the same thing on HPC where radosgw/s3 is not
> used and we are not experiencing this problem there. If we move WAL/DB to
> NVMe, and if one NVMe fails and we have to recover 10 OSDs for example, it
> would take much longer than if just 1 OSD has to be recovered (while users
> being unable to access s3).
>
> ---
>
> My suspicion is that when we set crush reweight of the failed disk to 0,
> all other affected disks from that pool disables some write (because of
> recovery) and some queue fills up which then stops/hangs radosgw...
>
> Rok
>
>
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: radosgw (s3) stops when setting weight to 0 on one OSD

Reply via email to