[ceph-users] Re: radosgw (s3) stops when setting weight to 0 on one OSD

Anthony D'Atri via ceph-users Tue, 03 Feb 2026 18:01:16 -0800

>> . Not sure where you get your number from (10 DB devices
>> max per NVMe), but the docs [0] state to not have more than 15 OSDs per NVMe:

Huh, I thought it said 10.  That was the rule of thumb 10 years ago when NVMe 
NAND SSDs were on the order of the P3600 and PCIe was what, gen 2 or 3?

These days 15 probably is fine with an appropriate SSD model and PCIe gen 4+.  
These guidelines are there because we’re expected to provide some, but in the 
end there are variables including the workload.  

>> But you're correct about the SPOF, if one NVMe dies, all OSDs that have
>> their DB/WAL on that NVMe die as well.

In a sufficiently large cluster that isn’t a big deal.  Even so, any cluster 
should be able to survive the loss of an entire host, so is this blast radius 
any scarier?

Some admins choose to mirror offload SSDs because of this FUD, but consider:

* Ceph is already handling distributed redundancy.  Does it make sense to 
double the space amp for wall+ DB ?
* If HDDs are used for the perception † that they cost less, consider that 
doubling them up costs more, narrowing the capex gap.

My sense is that most admins elect to distribute OSDs across unmirrored offload 
SSDs. Costs less and burns less endurance.  

†  With thoughtful sourcing, model selection appropriate to the workload, and 
TCO analysis you might be surprised.  

>> 
>> [0]
>> 
>> https://docs.ceph.com/en/latest/start/hardware-recommendations/#minimum-hardware-recommendations
>> 
>> Am Di., 3. Feb. 2026 um 15:00 Uhr schrieb Rok Jaklič via ceph-users <
>> [email protected]>:
>> 
>>> We have 28 OSDs per host and we can only have 2 NVMe per host (one being
>>> used for OS) .

Are these rear bay drives, hence the limit of 2? Or 
You might consider an M.2 AIC adapter card with bifurcation.  M.2 enterprise 
SSDs are sunsetting but for retrofits you should be able to find Micron 6450 
units.  

What’s your workload like?

>>> .. and if I remember correctly there is max 10 OSDs/NVMe
>>> recommended, that's why we decided to go just for HDD based clusters at
>> the
>>> beinging.
>>> 
>>> We have 2 clusters this way, one being for HPC (no radosgw/s3) and other
>>> for "users" (radosgw/s3), running over 4 years now ... works ok,
>>> performance is ok, just we have this problem where we have to do gentle
>>> reweight of a failed OSDs.

With modern releases the upmap / balancer strategy is better, you might be 
using the name of the old script to describe it.  

>>> 
>>> Thanks for the info, we will consider NVMe ... although then there is
>> SPOF
>>> for those OSDs which have DB on NVMe?

Ceph can handle it.  Enterprise SSDs have lower AFR than spinners. Keep the 
firmware updated.  

>>> 
>>> Rok
>>> 
>>> On Tue, Feb 3, 2026 at 2:35 PM Robert Sander via ceph-users <
>>> [email protected]> wrote:
>>> 
>>>> Am 03.02.26 um 2:31 PM schrieb Rok Jaklič:
>>>>> On Tue, Feb 3, 2026 at 2:26 PM Robert Sander via ceph-users <ceph-
>>>>> [email protected] <mailto:[email protected]>> wrote:
>>>>> 
>>>>>>    Am 03.02.26 um 2:18 PM schrieb Rok Jaklič via ceph-users:
>>>>>> 
>>>>>>>             2 OSD(s) experiencing slow operations in
>> BlueStore
>>>>>>>             2 OSD(s) experiencing stalled read in db device

I see this when updating firmware live.  The default lifetime of the warning is 
24 hours, so usually this is far more ephemeral than one might think and less 
of a big deal.  

>> of
>>>>>>    BlueFS
>>>>>> 
>>>>>>    Are your OSDs HDD only?
>>>>> 
>>>> 
>>>>> Yes.
>>>>> 
>>>>> Does not affect users much. Usually those messages appear when we are
>>>>> reweighting and changing failed disks.
>>>> 
>>>> These HDDs will be maxed out with the recovery work and cannot serve
>>>> anything else any more.
>>>> 
>>>> I have seen HDD only clusters going into the "spiral of death" because
>>>> the HDDs cannot answer fast enough. OSDs randomly dropping out making
>>>> the whole system unstable.
>>>> 
>>>> The RocksDB is such a random IO application that it is not suitable for
>>>> HDDs. It should always be put on flash storage (SSD/NVMe).
>>>> 
>>>> Regards
>>>> --
>>>> Robert Sander
>>>> Linux Consultant
>>>> 
>>>> Heinlein Consulting GmbH
>>>> Schwedter Str. 8/9b, 10119 Berlin
>>>> 
>>>> https://www.heinlein-support.de
>>>> 
>>>> Tel: +49 30 405051 - 0
>>>> Fax: +49 30 405051 - 19
>>>> 
>>>> Amtsgericht Berlin-Charlottenburg - HRB 220009 B
>>>> Geschäftsführer: Peer Heinlein - Sitz: Berlin
>>>> _______________________________________________
>>>> ceph-users mailing list -- [email protected]
>>>> To unsubscribe send an email to [email protected]
>>>> 
>>> _______________________________________________
>>> ceph-users mailing list -- [email protected]
>>> To unsubscribe send an email to [email protected]
>>> 
>> _______________________________________________
>> ceph-users mailing list -- [email protected]
>> To unsubscribe send an email to [email protected]
>> 
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: radosgw (s3) stops when setting weight to 0 on one OSD

Reply via email to