On 19/05/2026 14:52, Anthony D'Atri wrote:

We have 4 potential variants for OSD nodes.

Variant A1
~100 HDD JOBD + x86_64 server with block.db NVMe


Variant A2
~100 HDD JOBD + 2x x86_64 server with block.db NVMe
JBOD split in half, each node gets 50 HDDs


Variant B1
~60 HDD JOBD + x86_64 server with block.db NVMe


Variant B2
~60 HDD JOBD + 2x x86_64 server with block.db NVMe
JBOD split in half, each node gets 30 HDDs

How big of a cluster are you planning? Nodes that dense can present various 
problems, such that of your variants I would reluctantly pick B2.

At least 16 or 32 nodes
in 16 racks.
Erasure coding 8+3 with failure domain on rack level.

We initially selected 8+3 over 4+2 because we expect rebuilds to take very long with nodes this big and we don't want to loose redundancy



Splitting JBOD logically into 2 servers isn't an issue for use because we will 
replicate data on rack level and not host level.


Common specifications for all variants

5-6GB of RAM per 1 HDD

Plus more for mons and other daemons? Especially MDS?

Other daemons will be on some dedicated non-storage servers.
We aim for low RAM/HDD on storage nodes. Other daemons won't fit there.


2% of HDD capacity in NVMe devices for block.db (or none)
2x 50Gb or 2x 100Gb Ethernet per server (active-backup bonded interfaces)
(CPU per OSD to be determined)


Variant A1 is very unlikely to happen but we are curious what network interface 
speeds would you suggest for so many HDDs in one node.

100GE bonded at the least.  Depends on your workload.


Variant A2 is the most likely the one we will choose for large deployment.

Variant B1/B2 for smaller deployments.

Does anyone of you run ceph on similar setups? Did you find any pitfall with it?

What are your minimal recommendations for network speed per HDD, cpu per HDD, 
etc?

In our experience most of our servers, even in large clusters, never max out 
the network interfaces or CPUs. We almost never rebuild or rebalance whole 
servers. 27 HDD nodes of our biggest CephFS cluster with EC usually have only 
2-3Gbps of network traffic.

Your workload is archival?


Yes, mostly archival.
We have big demand for S3 and CephFS.
But we may move to pure s3 cluster in the future.
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to