> On May 16, 2025, at 7:52 AM, Frédéric Nass <frederic.n...@univ-lorraine.fr> 
> wrote:
> 
> Hi Kasper, 
> 
> ----- Le 16 Mai 25, à 12:56, Kasper Rasmussen <kasper_steenga...@hotmail.com> 
> a écrit : 
> 
>> Thanks Frédéric
> 
>> I found there is difference in the bluestore_min_alloc_size among the OSDs
>> depending on the version they were created on.
> 
>> However, as I understand there is no way to change it, other than destroy the
>> OSD and bring them back in.:
> 
>> " This BlueStore attribute takes effect only at OSD creation; if the 
>> attribute
>> is changed later, a specific OSD’s behavior will not change unless and until
>> the OSD is destroyed and redeployed with the appropriate option value(s).
>> Upgrading to a later Ceph release will not change the value used by OSDs that
>> were deployed under older releases or with other settings. "
>> Ref.: [
>> https://docs.ceph.com/en/reef/rados/configuration/bluestore-config-ref/#minimum-allocation-size
>> |
>> https://docs.ceph.com/en/reef/rados/configuration/bluestore-config-ref/#minimum-allocation-size
>> ]
> 
> True. You'll have to redeploy these OSDs. 
> 
>> I'm not sure that's an option, unless there is a huge gain in doing that 
>> change.

Remember that this is at OSD granularity, and you would realize the space 
reclaim incrementally with each OSD.

Set the no recover, norebalance flags
destroy
redeploy
unset

Wait for recovery before proceeding to the next OSD.  You can do multiple OSDs 
within a single failure domain if you can tolerate the recovery traffic.

> 
>> The impact of having this discrepancy between the OSDs, as I understand is
>> potential: "unusually high ratio of raw to stored data" on the 65K OSDs.
> 
>> In my case the ratio between - raw to stored data - is approx 3:1
>> I'd guess that is what to expect, when all pools is setup with three 
>> replicas.
> 
>> Feel free to correct me if I'm wrong/have misunderstood the DOCS

If you have that’s my fault, since I wrote the above.

> 
> Well, technically you'll be losing raw space on the DB device, data device, 
> or both devices when not using 4K. How much depends on your workloads. It's 
> not that problematic with RBD workloads with minimal metadata, 4M objects (> 
> 64K), and replicated data placement schemes, but would become an issue with 
> S3/CephFS workloads with small objects and/or erasure coding data placement 
> schemes.

Exactly.  The min_alloc_size change and the code work that was necessary to 
enable it were driven by RGW implementations with very small objects, which can 
result in a lot of waste due to padding:

This sheet

https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit?usp=sharing

visually shows this dynamic.  I’ve seen a RGW bucket data pools regain 20% of 
raw capacity by rebuilding OSDs with a smaller value.  EC amplifies this effect.

If you have a substantial fraction of RGW objects or CephFS files smaller than 
128-256KB then rebuilding the OSDs (one failure domain at a time, please!) may 
net you additional raw capacity.  RBD pools or CephFS / RGW data pools with 
larger files/objects will experience much less space difference.  The EC revamp 
in Tentacle promises to additionally debunk this padding effect, but that will 
only be for new pools, and like any shiny new feature will be best delayed in 
production until it goes through some shakeout.

>> Since some of your OSDs seem to have been created prior to Pacific, you might
>> want to check their bluestore_min_alloc_size. They should all use a
>> bluestore_min_alloc_size of 4k.
> 

The exception is coarse-IU QLC / pTLC, like the Solidigm P5316 and Micron 6550 
ION, for which you should enable 
bluestore_use_optimal_io_size_for_min_alloc_size

before redeploying those OSDs, and ensure you’re on a reasonably recent kernel.





_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to