> On May 16, 2025, at 7:52 AM, Frédéric Nass <frederic.n...@univ-lorraine.fr> > wrote: > > Hi Kasper, > > ----- Le 16 Mai 25, à 12:56, Kasper Rasmussen <kasper_steenga...@hotmail.com> > a écrit : > >> Thanks Frédéric > >> I found there is difference in the bluestore_min_alloc_size among the OSDs >> depending on the version they were created on. > >> However, as I understand there is no way to change it, other than destroy the >> OSD and bring them back in.: > >> " This BlueStore attribute takes effect only at OSD creation; if the >> attribute >> is changed later, a specific OSD’s behavior will not change unless and until >> the OSD is destroyed and redeployed with the appropriate option value(s). >> Upgrading to a later Ceph release will not change the value used by OSDs that >> were deployed under older releases or with other settings. " >> Ref.: [ >> https://docs.ceph.com/en/reef/rados/configuration/bluestore-config-ref/#minimum-allocation-size >> | >> https://docs.ceph.com/en/reef/rados/configuration/bluestore-config-ref/#minimum-allocation-size >> ] > > True. You'll have to redeploy these OSDs. > >> I'm not sure that's an option, unless there is a huge gain in doing that >> change.
Remember that this is at OSD granularity, and you would realize the space reclaim incrementally with each OSD. Set the no recover, norebalance flags destroy redeploy unset Wait for recovery before proceeding to the next OSD. You can do multiple OSDs within a single failure domain if you can tolerate the recovery traffic. > >> The impact of having this discrepancy between the OSDs, as I understand is >> potential: "unusually high ratio of raw to stored data" on the 65K OSDs. > >> In my case the ratio between - raw to stored data - is approx 3:1 >> I'd guess that is what to expect, when all pools is setup with three >> replicas. > >> Feel free to correct me if I'm wrong/have misunderstood the DOCS If you have that’s my fault, since I wrote the above. > > Well, technically you'll be losing raw space on the DB device, data device, > or both devices when not using 4K. How much depends on your workloads. It's > not that problematic with RBD workloads with minimal metadata, 4M objects (> > 64K), and replicated data placement schemes, but would become an issue with > S3/CephFS workloads with small objects and/or erasure coding data placement > schemes. Exactly. The min_alloc_size change and the code work that was necessary to enable it were driven by RGW implementations with very small objects, which can result in a lot of waste due to padding: This sheet https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit?usp=sharing visually shows this dynamic. I’ve seen a RGW bucket data pools regain 20% of raw capacity by rebuilding OSDs with a smaller value. EC amplifies this effect. If you have a substantial fraction of RGW objects or CephFS files smaller than 128-256KB then rebuilding the OSDs (one failure domain at a time, please!) may net you additional raw capacity. RBD pools or CephFS / RGW data pools with larger files/objects will experience much less space difference. The EC revamp in Tentacle promises to additionally debunk this padding effect, but that will only be for new pools, and like any shiny new feature will be best delayed in production until it goes through some shakeout. >> Since some of your OSDs seem to have been created prior to Pacific, you might >> want to check their bluestore_min_alloc_size. They should all use a >> bluestore_min_alloc_size of 4k. > The exception is coarse-IU QLC / pTLC, like the Solidigm P5316 and Micron 6550 ION, for which you should enable bluestore_use_optimal_io_size_for_min_alloc_size before redeploying those OSDs, and ensure you’re on a reasonably recent kernel. _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io