Hi Max,

Probably the first order of business is to understand the nature of the crashes you are seeing. Just as a guess, if you have thousands of jobs being run at the same time, you may have a lot of memory pressure on the MDSses. If the crashes are just the MDSes going OOM, you might want to verify how much of the 192GB of RAM on each node the MDSes are allowed to use for cache.

Beyond that, it might take some digging to understand the specifics of your issues. There are other institutions that do use CephFS for HPC use cases though. I gave a talk at the IO500 BoF at SC24 in collaboration with GWDG in Germany regarding their Ceph deployment.

Shameless plug: Clyso does offer academic support packages for upstream Ceph if that is of interest to you.

Thanks,
Mark

On 5/6/26 6:54 AM, Max Breitmeyer via ceph-users wrote:
Hello all,
At UMBC we were given a grant to run a ceph cluster as our primary research
storage for an HPC facility. The ceph cluster consists of the following
hardware:
Mon” nodes x3:
• 2x 25Gb interfaces
• 192GB RAM
• Storage:
-- 0 (none) OSDs/drives
“HDD” nodes x16:
• 2x 25Gb interfaces
• 192GB RAM
• Storage:
-- 12 OSDs w/ 20TB HDDs
-- 2 local Write-Access-Logs (journal) w/ 8TB NVMe drives
“NVMe” nodes x3:
• 2x 100Gb interfaces
• 384GB RAM
• Storage:
-- 16x OSDs w/ 8TB NVMe drives
“MDS” nodes x3:
• 2x 100Gb interfaces
• 192GB RAM
• Storage:
-- 8x OSDs w/ 1TB NVMe drives


In the six months since moving from a different storage solution (isilon),
we've had multiple crashes on the system that have completely taken down
our system as a result of multiple thousands of jobs being run at the same
time, causing an overload on the active MDS nodes. Has the use of ceph as a
primary research storage been used in an HPC with many dozens of different
workflows and filetypes being used at once? Or is a better to design to
come up with a tiered approach in combination with a second storage
solution?

--
Best Regards,
Mark Nelson
Co-Founder and Head of R&D

Clyso Inc.
p: +49 89 21552391 12 | a: North Vancouver, B.C.
w: https://clyso.com | e: [email protected]

We are hiring: https://www.clyso.com/jobs/
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to