Hello all, At UMBC we were given a grant to run a ceph cluster as our primary research storage for an HPC facility. The ceph cluster consists of the following hardware: Mon” nodes x3: • 2x 25Gb interfaces • 192GB RAM • Storage: -- 0 (none) OSDs/drives “HDD” nodes x16: • 2x 25Gb interfaces • 192GB RAM • Storage: -- 12 OSDs w/ 20TB HDDs -- 2 local Write-Access-Logs (journal) w/ 8TB NVMe drives “NVMe” nodes x3: • 2x 100Gb interfaces • 384GB RAM • Storage: -- 16x OSDs w/ 8TB NVMe drives “MDS” nodes x3: • 2x 100Gb interfaces • 192GB RAM • Storage: -- 8x OSDs w/ 1TB NVMe drives
In the six months since moving from a different storage solution (isilon), we've had multiple crashes on the system that have completely taken down our system as a result of multiple thousands of jobs being run at the same time, causing an overload on the active MDS nodes. Has the use of ceph as a primary research storage been used in an HPC with many dozens of different workflows and filetypes being used at once? Or is a better to design to come up with a tiered approach in combination with a second storage solution? -- V/R, Maxwell Breitmeyer UMBC HPCF Specialist Graduate Student (443) 835-8250 _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
