When reweight the drive down to 0, does the behavior change if you set it 'out' first? The only other special thing that happens with reweight 0 is that upmaps get removed for said OSD.
If you think it is at the pool level you could run a simple rados bench to a pool while you reweight an OSD to see if your issue still holds. Also could be some issue with osd_op_queue being mclock if you have that, might be worth A/B testing with one OSD if you still see it. On Wed, Feb 4, 2026 at 12:31 PM Rok Jaklič via ceph-users < [email protected]> wrote: > If one HDD drive fails and smartctl shows errors and then I decide to drain > it with crush reweight 0, would Ceph try to copy/move data/pgs from data > failed disk anyway? > > Because we noticed on some non ceph clusters (raid setup, actually mail > servers), that one failed drive may hog up "app/OS" because "app" fails to > read/write to failed disk because "some queue" fills up (since app/OS is > unable for data to be read/written)? > > Could something similar happen in Ceph? > > Rok > > On Wed, Feb 4, 2026 at 9:52 AM Rok Jaklič <[email protected]> wrote: > > > > > On Wed, Feb 4, 2026 at 2:59 AM Anthony D'Atri <[email protected]> > > wrote: > > > >> Are these rear bay drives, hence the limit of 2? Or > >> You might consider an M.2 AIC adapter card with bifurcation. M.2 > >> enterprise SSDs are sunsetting but for retrofits you should be able to > find > >> Micron 6450 units. > >> > >> What’s your workload like? > >> > > > > On average 10-50MB/s of write, with spikes up to a few hundred MB/s > during > > evening/night time; it went up to 1GB/s during tests without a problem. > All > > these are S3 workloads/tests. > > > > I would have to check that on site, RM does not show, however we are just > > about to migrate to new machines, which have 4 NVMe slots ... so I am > > really considering moving WAL/DB to NVMe, however I am still a little bit > > hesitant, since I am not really sure this will solve the problem of why > > radosgw/s3 stops after some time when setting crush reweight to 0 on one > > failed disk. We are doing the same thing on HPC where radosgw/s3 is not > > used and we are not experiencing this problem there. If we move WAL/DB to > > NVMe, and if one NVMe fails and we have to recover 10 OSDs for example, > it > > would take much longer than if just 1 OSD has to be recovered (while > users > > being unable to access s3). > > > > --- > > > > My suspicion is that when we set crush reweight of the failed disk to 0, > > all other affected disks from that pool disables some write (because of > > recovery) and some queue fills up which then stops/hangs radosgw... > > > > Rok > > > > > _______________________________________________ > ceph-users mailing list -- [email protected] > To unsubscribe send an email to [email protected] > _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
