[ceph-users] Re: Ceph deployment for HPC and research workflows

Anthony D'Atri via ceph-users Sat, 09 May 2026 13:29:51 -0700


> On May 6, 2026, at 9:55 AM, Max Breitmeyer via ceph-users 
> <[email protected]> wrote:
> 
> Hello all,
> At UMBC we were given a grant to run a ceph cluster as our primary research
> storage for an HPC facility. The ceph cluster consists of the following
> hardware:
> Mon” nodes x3:
> • 2x 25Gb interfaces
> • 192GB RAM
> • Storage:
> -- 0 (none) OSDs/drives


Why nodes dedicated only to mons? I suggest considering a deployment where mons 
are colocated on OSD or MDS nodes, either abstracting client config through 
DNS, by constraining placement to ensure that the orchestrator is 
failure-domain aware when scheduling, or possibly even just specifying all of 
the mons on the client mounts for resilience.  The latter for sure doesn't 
scale well and is prone to not keeping up with changes. Clients need to be able 
to contact at least one mon to download the current monmap.


> “HDD” nodes x16:
> • 2x 25Gb interfaces
> • 192GB RAM
> • Storage:
> -- 12 OSDs w/ 20TB HDDs
> -- 2 local Write-Access-Logs (journal) w/ 8TB NVMe drives

WAL+DB?

> “NVMe” nodes x3:
> • 2x 100Gb interfaces
> • 384GB RAM
> • Storage:
> -- 16x OSDs w/ 8TB NVMe drives

These could easily have colocated mons, sparing the dedicated mon nodes.

> “MDS” nodes x3:
> • 2x 100Gb interfaces
> • 192GB RAM
> • Storage:
> -- 8x OSDs w/ 1TB NVMe drives

A few thoughts:

* I recommend 5 mon daemons vs only 3

* Since the MDS node spec above mentions 1TB (960 GB) SSDs and the NVMe node 
spec mentions 8 TB (7.6 TB) SSDs, I might infer that you constrain -- or 
intended to constrain -- the MDS metadata pool to only these 24x OSDs?  And 
that the 48x 7.6 T SSD nodes host OSDs used for a separate CephFS filesystem or 
specific files/directories via layout xattrs?

I might suggest a rearchitecture, which could be done incrementally without 
downtime:

* Colocate 5 mon daemons on other nodes and either use these 3 systems for 
something else, locate some RGWs (if in use) there.  If those nodes were 
ordered or could be made NVMe-capable, you might repurpose them as additional 
OSD nodes. If they have SATA bays, you could deploy HDD OSDs on them too.

* It's best to not have OSDs within a given device class not be very different 
sizes, but Ceph can handle it.  Setting mgr/balancer/upmap_max_deviation        
   1 will help the balancer do well with the smaller OSDs.

* If the 1T and 7.6T SSD OSDs aren't in the same device class, consider 
consolidating them. Your workload -- and availability -- will benefit.

The fine folks at OSNexus could help you with these ideas.

 "to use the full 192GB"

There ya go.  Note that the MDS cache size config value (default is 4GiB I 
think?) is like osd_memory_target: advisory, not a hard limit.  Under heavy 
load an MDS can and will attempt to use more than the configured amount.  You 
also need to allow some margin for other MDS internal uses, other daemons 
(node_exporter, the OSDs I presume you're running, etc), and the usual OS 
tasks.  I might suggest with 192 GiB of physmem to never configure more than 
128 GiB at most for MDS cache.  Individual MDSes with sizes > 64 GiB may also 
be very slow to failover.

So without seeing your actual deployment details this is speculation, but one 
thought is that the too-large MDS cache size DoSed or OOMed the colocated OSDs.







> 
> In the six months since moving from a different storage solution (isilon),
> we've had multiple crashes on the system that have completely taken down
> our system as a result of multiple thousands of jobs being run at the same
> time, causing an overload on the active MDS nodes. Has the use of ceph as a
> primary research storage been used in an HPC with many dozens of different
> workflows and filetypes being used at once? Or is a better to design to
> come up with a tiered approach in combination with a second storage
> solution?
> 
> -- 
> V/R,
> Maxwell Breitmeyer
> UMBC HPCF Specialist
> Graduate Student
> (443) 835-8250
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Ceph deployment for HPC and research workflows

Reply via email to