Glad to hear that helped! OSNexus are great guys.
Thanks,
Mark
On 5/8/26 6:50 AM, Max Breitmeyer wrote:
Hey Mark,
Initially we had it set to use the full 192GB, but after an issue
where starting up the MDS service was going straight to OOM, we
brought it down to 64GB. This has prevented any further crashes. We
have a support vendor that we're working with (OSNexus) at the moment.
On Wed, May 6, 2026 at 10:40 AM Mark Nelson via ceph-users
<[email protected]> wrote:
Hi Max,
Probably the first order of business is to understand the nature
of the
crashes you are seeing. Just as a guess, if you have thousands of
jobs
being run at the same time, you may have a lot of memory pressure
on the
MDSses. If the crashes are just the MDSes going OOM, you might
want to
verify how much of the 192GB of RAM on each node the MDSes are
allowed
to use for cache.
Beyond that, it might take some digging to understand the
specifics of
your issues. There are other institutions that do use CephFS for
HPC use
cases though. I gave a talk at the IO500 BoF at SC24 in collaboration
with GWDG in Germany regarding their Ceph deployment.
Shameless plug: Clyso does offer academic support packages for
upstream
Ceph if that is of interest to you.
Thanks,
Mark
On 5/6/26 6:54 AM, Max Breitmeyer via ceph-users wrote:
> Hello all,
> At UMBC we were given a grant to run a ceph cluster as our
primary research
> storage for an HPC facility. The ceph cluster consists of the
following
> hardware:
> Mon” nodes x3:
> • 2x 25Gb interfaces
> • 192GB RAM
> • Storage:
> -- 0 (none) OSDs/drives
> “HDD” nodes x16:
> • 2x 25Gb interfaces
> • 192GB RAM
> • Storage:
> -- 12 OSDs w/ 20TB HDDs
> -- 2 local Write-Access-Logs (journal) w/ 8TB NVMe drives
> “NVMe” nodes x3:
> • 2x 100Gb interfaces
> • 384GB RAM
> • Storage:
> -- 16x OSDs w/ 8TB NVMe drives
> “MDS” nodes x3:
> • 2x 100Gb interfaces
> • 192GB RAM
> • Storage:
> -- 8x OSDs w/ 1TB NVMe drives
>
>
> In the six months since moving from a different storage solution
(isilon),
> we've had multiple crashes on the system that have completely
taken down
> our system as a result of multiple thousands of jobs being run
at the same
> time, causing an overload on the active MDS nodes. Has the use
of ceph as a
> primary research storage been used in an HPC with many dozens of
different
> workflows and filetypes being used at once? Or is a better to
design to
> come up with a tiered approach in combination with a second storage
> solution?
>
--
Best Regards,
Mark Nelson
Co-Founder and Head of R&D
Clyso Inc.
p: +49 89 21552391 12 | a: North Vancouver, B.C.
w: https://clyso.com | e: [email protected]
We are hiring: https://www.clyso.com/jobs/
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
--
V/R,
Maxwell Breitmeyer
UMBC HPCF Specialist
Graduate Student
(443) 835-8250
--
Best Regards,
Mark Nelson
Co-Founder and Head of R&D
Clyso Inc.
p: +49 89 21552391 12 | a: North Vancouver, B.C.
w: https://clyso.com | e: [email protected]
We are hiring: https://www.clyso.com/jobs/
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]