> we are planning to expand our ceph cluster (reef) from 8 OSD nodes and 3 > monitors to 12 OSD nodes and 3 monitors.
I strongly recommend running 5 mons. Given how you phrase the above it seems as though you might mean 3 dedicated mon nodes? If so I’d still run 5 mons, add the mon cephadm host label to two of the OSD nodes. > Currently each OSD node has a JBOD > with 28 6TB HDD disks and we are using EC 3+2, since at first there were > just 5 OSD nodes. Situations like that are why I often suggest starting with at least 7 nodes, even if they’re half populated, that lets you do 4+2 EC without CRUSH gymnastics. > No NVMEs are used since we do not have enough of them on > one particular node. ? Do you mean as OSDs, or to offload WAL+DB? Either way isn’t an all or nothing proposition. > Bare metal, no containers I suggest considering migrating to cephadm. The orchestrator works pretty well these days. > we build binaries for Alma > Linux ourselves. We are using RGW S3 frontend and have on average around > 50MB of upload/write during the day with spikes to up to 700MB. > > It works well. Do you have your index pool on SSDs? How old are those existing OSDs? If they were deployed in the 6 TB spinner era, they may have the old bluestore_min_alloc_size value baked in. Check `ceph osd metadata`, see if any are not 4KiB. If you have or will have a significant population of small RGW objects, serially redeploying the old OSDs — especially if they’re Filestore — would have advantages. > New osd nodes will have JBOD with 30 12TB disks. Also no NVMEs for > OSDs, just for OS. > > We will add each new node to the cluster then adding new OSD one by one > (until one rebalances), then draining old OSDs one by one (until one > rebalances). Will be a slow process, but we do not want to overload > anything. That will work, but will be inefficient. I suggest instead the process described here: https://ceph.io/assets/pdfs/events/2024/ceph-days-nyc/Mastering%20Ceph%20Operations%20with%20Upmap.pdf Basically, disable rebalancing, add all the OSDs. No data moves. upmap PGs to where they are currently, and let the balancer incrementally do the work. You can modulate that process in the usual fashion. > I am thinking of creating a new EC pool with 8+3 (leaving one node for > "backup") and migrating user per user to the new EC pool. Glad to see that you’re planning for >= k+m+1 failure domains, if you only had 11 hosts total, you would effectively strand a lot of the raw capacity on the new systems. Through Tentacle you can add a second RGW storage class on the new data pool and use LC policies to migrate the data transparently, and set the default storageclass for given users to the new one. In Tentacle, turn on EC optimizations on the EC pools before moving data, so that new writes will save you the prior zero-padding. With Umbrella, due out … sometime next year, it is planned to be able to do such migrations transparently at the RADOS level. AIUI the old pool will need to remain permanently, but it’ll be worth it. _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
