[ceph-users] Re: [Urgent suggestion needed] New Prod Cluster Hardware recommendation

Anthony D'Atri Thu, 10 Jul 2025 09:56:50 -0700

> Now I have a task to move them into a clustered proxmox and use shared
> storage.


Gotcha, I’ve seen that before and suspect that’s what was going on.

> Deploying external ceph to avoid dependency on single node in
> hyperconverged setup, also we wanted to use ceph for other services apart
> from ceph so we think its better to go with external ceph.

Agreed.

> Sorry I think I didn't explain it correctly about my ceph cluster.
> 
> *Total 5 Nodes: *where I want to colocate services
> 3 Nodes will have MON,MGR & OSD Services colocated
> 2 Node will be primary for OSD service but if needed can expand it for
> other services as we are planning to go with similar specs hardware for all
> the nodes.

Ah.  That isn’t what was implied by what you wrote originally and I wanted to 
help you avert disaster.

> What do you think we should do with core and ram reservation for per osd's
> and per services where we want to populate osd's upto full capacity in
> longer run (24 NVME chassis)?

Unless I’m missing something, you won’t have reservations as such with a 
standalone Ceph cluster.
If you mean what you equip the node with, people have varying rules of thumb.

I would suggest 6 vcores/hyperthreads and 8GB per NVMe OSD.

> Sure, I will explore Dell R7615 with 9454 or better 32 cores (AMD EPYC 9334
> 2.70GHz, 32C/64T) because of cost

Just be clear about cores vs threads, it’s super easy to mix them up.  With 
Ceph we mostly think in terms of vcores aka hyperthreads, which on most CPUs 
are 2x per physical core.

> 
> *RAM per node:*
> We are going with 32GB DIMMs  which will allow more capacity increase in
> future (for now 32*4=128G)

Nice.  Back in the depths of time I was tasked with ordering a Sun 4/110.  
Minimum orderable RAM was 8GB.  The system had 32 slots, I figured they would 
send 8x 1GB.  Nope.  They sent 32x 256KB, filling all the slots, so expansion 
would have meant pulling some low-density modules that would have not been 
useful elsewhere.  Today’s units are a thousand times larger but the potential 
is still there.

> *OSD's per node:*
> 5x 7.68TB Data Center NVMe Read Intensive AG Drive U2 with carrier
> 
> OR
> 10x3.84TB Data Center NVMe Read Intensive AG Drive U2 with Carrier
> 
> Which one is better ?

At small scale there’s a certain advantage to having more OSDs, but the 3.84 TB 
SSDs are prone to the same phenomenon as low-density memory modules, assuming 
future expansion.  With the 7.68 TB SSDs your cluster would have 25x OSDs 
right?  I suspect that would be okay for your use-case, so I’d probably prefer 
the larger so you have more potential for expansion without having to buy 
servers.

You almost certainly don’t need “mixed use” (MU) models so the above are fine.

With spinners there is longstanding conventional wisdom that more spindles are 
better, and even the practice of short-stroking to limit long seeks. To a 
somewhat lesser extent this applies to SAS/SATA SSDs as well.

NVMe SSDs are much less prone to such bottlenecks, especially at PCIe Gen 4+, 
so unless you get into the 60+TB SKU territory my sense is that conserving 
expansion slots is usually the thing to solve for, unless your initial 
footprint is *really* tiny, like only 5 OSDs. 

> 
> *Networking:*
> Public Network: 2x25G port as Bond0
> Cluster Network: 2x25G Port as Bond1
> (Proxmox will also have 2x25G port in bond)

Ah, so you’ll have 4, or 6 physical ports per server?  I personally prefer to 
not have a cluster network, but if you have the Capex to deploy it, that’s fine.

> 
> On Wed, Jul 9, 2025 at 8:26 PM Alex Gorbachev <a...@iss-integration.com>
> wrote:
> 
>> Completely agreeing with what Anthony wrote, and we see very good results
>> with at least 4 physical OSD nodes, managed and deployed by cephadm - you
>> will have 3 MONs and MGRs "hyperconverged" in cephadm sense, and run 3x
>> replication for OSD with an extra OSD host for n+1 redundancy.
>> 
>> Proxmox just needs a network and keyring to talk to this cluster.  You can
>> run deployment and automation functions from a VM in Proxmox that runs on
>> local storage.
>> 
>> --
>> Alex Gorbachev
>> https://alextelescope.blogspot.com
>> 
>> 
>> 
>> On Wed, Jul 9, 2025 at 10:28 AM Anthony D'Atri <a...@dreamsnake.net> wrote:
>> 
>>> 
>>>> 
>>>> I am new to this thread would like to get some suggestions to build new
>>>> external ceph  cluster
>>> 
>>> Why external?  Many Proxmox deployments are converged.  Is this an
>>> existing Proxmox cluster that currently does not use shared storage?
>>> 
>>> 
>>>> which will backend for proxmox VM's
>>>> 
>>>> I am planning to start with 5 Nodes(3 Mon & 2 OSD)
>>> 
>>> This is not the best plan.
>>> 
>>> If your data is not disposable you will want to maintain the default 3
>>> copies, which you cannot safely do on 2 OSD nodes.
>>> 
>>> When deploying a very small cluster solve first for the number of nodes.
>>> You need at least 3 OSD nodes, 4 has advantages.
>>> 
>>> So in your case, go converged: OSDs on all 5 nodes, and add the
>>> mon/mgr/etc ceph orch labels to all 5 so that when a node is down a
>>> replacement may be spun up.
>>> 
>>> This would also let you deploy 5 mon instances instead of 3, which is
>>> advantageous in that you can ride out 2 failures without disruption.
>>> 
>>>> and I am expecting to start with ~60+ TB usable space.
>>> 
>>> That would mean (3 * 60) / .85 =211.765 ~ 212 TB of raw capacity, let’s
>>> see how that matches your numbers below.
>>> 
>>>> estimated Storage Specs Calculator:
>>>> 
>>>> RAM: 8GB/OSD Daemon, 16GB OS, 4GB for Mon & MGR, 16GB for MDS
>>> 
>>> I would allot more than 4GB for mon/mgr.
>>> 
>>>> cpu: 2 core/osd, 2 core for os, 2 core per services
>>> 
>>> Cores or hyperthreads?  Either way these numbers are low.
>>> 
>>>> *Dell R7625 5 Node to start with *
>>> 
>>> Dramatic overkill for a mon/mgr/MDS node.
>>> 
>>>> - RAM: 128G (Plan to increase later as needed)
>>> 
>>> I suggest 32GB DIMMs to maximize potential for future expansion.
>>> 
>>>> - CPU: 2x AMD EPYC 9224 2.50GHz, 24C/48T, 64M Cache (200W) DDR5-4800
>>> 
>>> 96 threads total per server.
>>> 
>>>> - Chassis Configuration 24x2.5 NVME
>>> 
>>> You’ll be tempted to fill those slots; each OSD past, say, 12 will
>>> decrease performance due to having to share the vcores/threads.
>>> With the above CPU choice I would go with the R7615 to save rack space,
>>> or bump up the CPU. The 9224 is the default choice on Dell’s configurator
>>> but there are lots of others available. The 9454 for example would give you
>>> enough cores to more comfortably service an eventual 24 OSDs.
>>> 
>>> Alternately consider the R7615 with, say, the 9654P. The P CPUs can’t be
>>> used in a dual-socket motherboard, so they’re usually a bit cheaper for the
>>> same specs.
>>> 
>>> With EPYC CPUs you can get better performance by disabling IOMMU on the
>>> kernel command line via GRUB defaults.
>>> 
>>> 
>>>> - 2x1.92TB Data Center NVMe Read Intensive AG Drive U2 Gen4 with
>>> carrier (
>>>> OS Disk, I need extra space)
>>> 
>>> Okay so that will limit you to 22 OSDs with the 24-bay chassis.  You
>>> could provision BOSS-N1 for M.2 boot though.
>>> 
>>>> - 5x 7.68TB Data Center NVMe Read Intensive AG Drive U2 Gen4 with
>>> Carrier
>>>> 24Gbps 512e 2.5in Hot-Plug 1DWPD , AG Drive
>>> 
>>> I think you have a copy/paste error there.  The second line above sounds
>>> like a SAS SSD.
>>> 
>>> So from what you wrote about this would intend a total of 10x 7.68TB OSD
>>> drives.  With 3x replication and the default headroom ratios these will
>>> give you about 22 TB of usable space, which is just 20 TiB.
>>> 
>>>> - 2x Nvidia ConnectX-6 Lx Dual Port 10/25GbE SFP28, No Crypto, PCIe Low
>>>> Profile
>>> 
>>> I suggest bonding them and not having an optional replication network.
>>> Some people will use one port for public and the other for replication, but
>>> for multiple reasons that wouldn’t be ideal.
>>> 
>>>> 
>>>> - 1G for IPMI
>>>> 
>>>> Please help me finalize these specs.
>>>> 
>>>> Thanks
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>> 
>> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [Urgent suggestion needed] New Prod Cluster Hardware recommendation

Reply via email to