> we are planning to expand our ceph cluster (reef) from 8 OSD nodes and 3
> monitors to 12 OSD nodes and 3 monitors.

I strongly recommend running 5 mons.  Given how you phrase the above it seems 
as though you might mean 3 dedicated mon nodes?  If so I’d still run 5 mons, 
add the mon cephadm host label to two of the OSD nodes.

> Currently each OSD node has a JBOD
> with 28 6TB HDD disks and we are using EC 3+2, since at first there were
> just 5 OSD nodes.

Situations like that are why I often suggest starting with at least 7 nodes, 
even if they’re half populated, that lets you do 4+2 EC without CRUSH 
gymnastics.

> No NVMEs are used since we do not have enough of them on
> one particular node.

?  Do you mean as OSDs, or to offload WAL+DB? Either way isn’t an all or 
nothing proposition. 

> Bare metal, no containers

I suggest considering migrating to cephadm.  The orchestrator works pretty well 
these days.

> we build binaries for Alma
> Linux ourselves. We are using RGW S3 frontend and have on average around
> 50MB of upload/write during the day with spikes to up to 700MB.
> 
> It works well.

Do you have your index pool on SSDs?  How old are those existing OSDs?  If they 
were deployed in the 6 TB spinner era, they may have the old 
bluestore_min_alloc_size value baked in.  Check `ceph osd metadata`, see if any 
are not 4KiB.  If you have or will have a significant population of small RGW 
objects, serially redeploying the old OSDs — especially if they’re Filestore — 
would have advantages.

> New osd nodes will have JBOD with 30 12TB disks. Also no NVMEs for
> OSDs, just for OS.
> 
> We will add each new node to the cluster then adding new OSD one by one
> (until one rebalances), then draining old OSDs one by one (until one
> rebalances). Will be a slow process, but we do not want to overload
> anything.

That will work, but will be inefficient. I suggest instead the process 
described here: 
https://ceph.io/assets/pdfs/events/2024/ceph-days-nyc/Mastering%20Ceph%20Operations%20with%20Upmap.pdf

Basically, disable rebalancing, add all the OSDs. No data moves. upmap PGs to 
where they are currently, and let the balancer incrementally do the work.  You 
can modulate that process in the usual fashion.


> I am thinking of creating a new EC pool with 8+3 (leaving one node for
> "backup") and migrating user per user to the new EC pool.

Glad to see that you’re planning for >= k+m+1 failure domains, if you only had 
11 hosts total, you would effectively strand a lot of the raw capacity on the 
new systems.

Through Tentacle you can add a second RGW storage class on the new data pool 
and use LC policies to migrate the data transparently, and set the default 
storageclass for given users to the new one. In Tentacle, turn on EC 
optimizations on the EC pools before moving data, so that new writes will save 
you the prior zero-padding.

With Umbrella, due out … sometime next year, it is planned to be able to do 
such migrations transparently at the RADOS level.  AIUI the old pool will need 
to remain permanently, but it’ll be worth it.




_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to