On Sat, Dec 27, 2025 at 10:22 PM Anthony D'Atri <[email protected]>
wrote:
>
>
> > we are planning to expand our ceph cluster (reef) from 8 OSD nodes and 3
> > monitors to 12 OSD nodes and 3 monitors.
>
> I strongly recommend running 5 mons. Given how you phrase the above it
> seems as though you might mean 3 dedicated mon nodes? If so I’d still run
> 5 mons, add the mon cephadm host label to two of the OSD nodes.
>
Right now. only one mon is "active" and used all of the time. You would
still run 5 mons anyway?
>
> > Currently each OSD node has a JBOD
> > with 28 6TB HDD disks and we are using EC 3+2, since at first there were
> > just 5 OSD nodes.
>
> Situations like that are why I often suggest starting with at least 7
> nodes, even if they’re half populated, that lets you do 4+2 EC without
> CRUSH gymnastics.
>
> > No NVMEs are used since we do not have enough of them on
> > one particular node.
>
> ? Do you mean as OSDs, or to offload WAL+DB? Either way isn’t an all or
> nothing proposition.
NVME is used only for OS. I was considering offloading WAL+DB to NVME, but
that would mean 3-5NVME per OSD/JBOD node and we do not have configuration
like that right now.
>
>
> > Bare metal, no containers
>
> I suggest considering migrating to cephadm. The orchestrator works pretty
> well these days.
>
> > we build binaries for Alma
> > Linux ourselves. We are using RGW S3 frontend and have on average around
> > 50MB of upload/write during the day with spikes to up to 700MB.
> >
> > It works well.
>
> Do you have your index pool on SSDs? How old are those existing OSDs? If
> they were deployed in the 6 TB spinner era, they may have the old
> bluestore_min_alloc_size value baked in. Check `ceph osd metadata`, see if
> any are not 4KiB. If you have or will have a significant population of
> small RGW objects, serially redeploying the old OSDs — especially if
> they’re Filestore — would have advantages.
>
Index pool is on HDD. HDDs (mostly Seagate) are actually quite old, from
2016 and lately we are seeing a bigger failure rate, however we have some
spare on the shelf.
I checked bluestore_min_alloc_size and indeed there is a 4K value set for
all of them. Even on new ones.
We are replacing failed disks on "same OSD id" with something like:
# destroy failed osd
systemctl stop ceph-osd@{id}
ceph osd down osd.{id}
ceph osd destroy {id} --yes-i-really-mean-it
# prepare new one on same osd id
ceph-volume lvm prepare --osd-id {id} --data /dev/sdX
ceph-volume lvm list
ceph-volume lvm activate {id} {osd fsid}
Is this "a problem"?
>
> > New osd nodes will have JBOD with 30 12TB disks. Also no NVMEs for
> > OSDs, just for OS.
> >
> > We will add each new node to the cluster then adding new OSD one by one
> > (until one rebalances), then draining old OSDs one by one (until one
> > rebalances). Will be a slow process, but we do not want to overload
> > anything.
>
> That will work, but will be inefficient. I suggest instead the process
> described here:
> https://ceph.io/assets/pdfs/events/2024/ceph-days-nyc/Mastering%20Ceph%20Operations%20with%20Upmap.pdf
>
> Basically, disable rebalancing, add all the OSDs. No data moves. upmap PGs
> to where they are currently, and let the balancer incrementally do the
> work. You can modulate that process in the usual fashion.
>
Thanks, will take a look at this.
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]