> On May 6, 2026, at 9:55 AM, Max Breitmeyer via ceph-users > <[email protected]> wrote: > > Hello all, > At UMBC we were given a grant to run a ceph cluster as our primary research > storage for an HPC facility. The ceph cluster consists of the following > hardware: > Mon” nodes x3: > • 2x 25Gb interfaces > • 192GB RAM > • Storage: > -- 0 (none) OSDs/drives
Why nodes dedicated only to mons? I suggest considering a deployment where mons are colocated on OSD or MDS nodes, either abstracting client config through DNS, by constraining placement to ensure that the orchestrator is failure-domain aware when scheduling, or possibly even just specifying all of the mons on the client mounts for resilience. The latter for sure doesn't scale well and is prone to not keeping up with changes. Clients need to be able to contact at least one mon to download the current monmap. > “HDD” nodes x16: > • 2x 25Gb interfaces > • 192GB RAM > • Storage: > -- 12 OSDs w/ 20TB HDDs > -- 2 local Write-Access-Logs (journal) w/ 8TB NVMe drives WAL+DB? > “NVMe” nodes x3: > • 2x 100Gb interfaces > • 384GB RAM > • Storage: > -- 16x OSDs w/ 8TB NVMe drives These could easily have colocated mons, sparing the dedicated mon nodes. > “MDS” nodes x3: > • 2x 100Gb interfaces > • 192GB RAM > • Storage: > -- 8x OSDs w/ 1TB NVMe drives A few thoughts: * I recommend 5 mon daemons vs only 3 * Since the MDS node spec above mentions 1TB (960 GB) SSDs and the NVMe node spec mentions 8 TB (7.6 TB) SSDs, I might infer that you constrain -- or intended to constrain -- the MDS metadata pool to only these 24x OSDs? And that the 48x 7.6 T SSD nodes host OSDs used for a separate CephFS filesystem or specific files/directories via layout xattrs? I might suggest a rearchitecture, which could be done incrementally without downtime: * Colocate 5 mon daemons on other nodes and either use these 3 systems for something else, locate some RGWs (if in use) there. If those nodes were ordered or could be made NVMe-capable, you might repurpose them as additional OSD nodes. If they have SATA bays, you could deploy HDD OSDs on them too. * It's best to not have OSDs within a given device class not be very different sizes, but Ceph can handle it. Setting mgr/balancer/upmap_max_deviation 1 will help the balancer do well with the smaller OSDs. * If the 1T and 7.6T SSD OSDs aren't in the same device class, consider consolidating them. Your workload -- and availability -- will benefit. The fine folks at OSNexus could help you with these ideas. "to use the full 192GB" There ya go. Note that the MDS cache size config value (default is 4GiB I think?) is like osd_memory_target: advisory, not a hard limit. Under heavy load an MDS can and will attempt to use more than the configured amount. You also need to allow some margin for other MDS internal uses, other daemons (node_exporter, the OSDs I presume you're running, etc), and the usual OS tasks. I might suggest with 192 GiB of physmem to never configure more than 128 GiB at most for MDS cache. Individual MDSes with sizes > 64 GiB may also be very slow to failover. So without seeing your actual deployment details this is speculation, but one thought is that the too-large MDS cache size DoSed or OOMed the colocated OSDs. > > In the six months since moving from a different storage solution (isilon), > we've had multiple crashes on the system that have completely taken down > our system as a result of multiple thousands of jobs being run at the same > time, causing an overload on the active MDS nodes. Has the use of ceph as a > primary research storage been used in an HPC with many dozens of different > workflows and filetypes being used at once? Or is a better to design to > come up with a tiered approach in combination with a second storage > solution? > > -- > V/R, > Maxwell Breitmeyer > UMBC HPCF Specialist > Graduate Student > (443) 835-8250 > _______________________________________________ > ceph-users mailing list -- [email protected] > To unsubscribe send an email to [email protected] _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
