Hi Stefan

We use cephfs for our 7200CPU/224GPU HPC cluster, for our use-case (large-ish image files) it works well.

We have 36 ceph nodes, each with 12 x 12TB HDD, 2 x 1.92TB NVMe, plus a 240GB System disk. Four dedicated nodes have NVMe for metadata pool, and provide mon,mgr and MDS service.

I'm not sure you need 4% of OSD for wal/db, search this mailing list archive for a definitive answer, but my personal notes are as follows:

"If you expect lots of small files: go for a DB that's > ~300 GB
For mostly large files you are probably fine with a 60 GB DB.
266 GB is the same as 60 GB, due to the way the cache multiplies at each level, spills over during compaction."

We use a single enterprise quality 1.9TB NVMe for each 6 OSDs to good effect, you probably need 1DWPD to be safe. I suspect you might be able to increase the ratio of HDD per NVMe with PCIe gen4 NVMe drives.

best regards,

Jake

On 20/06/2022 08:22, Stefan Kooman wrote:
On 6/19/22 23:23, Christian Wuerdig wrote:
On Sun, 19 Jun 2022 at 02:29, Satish Patel <satish....@gmail.com> wrote:

Greeting folks,

We are planning to build Ceph storage for mostly cephFS for HPC workload
and in future we are planning to expand to S3 style but that is yet to be
decided. Because we need mass storage, we bought the following HW.

15 Total servers and each server has a 12x18TB HDD (spinning disk) . We
understand SSD/NvME would be best fit but it's way out of budget.

I hope you have extra HW on hand for Monitor and MDS  servers

^^ this. It also depends on the uptime guarantees you have to provide (if any). Are the HPC users going to write large files? Or loads of small files? The more metadata operations the busier the MDSes will be, but if it's mainly large files the load on them will be much lower.

Ceph recommends using a faster disk for wal/db if the data disk is slow and
in my case I do have a slower disk for data.

Question:
1. Let's say if i want to put a NvME disk for wal/db then what size i
should buy.


The official recommendation is to budget 4% of OSD size for WAL/DB - so in
your case that would be 720GB per OSD. Especially if you want to go to S3
later you should stick closer to that limit since RGW is a heavy meta data
user.

CephFS can be metadata heavy also, depending on work load. You can co-locate the S3 service on this cluster later on, but from an operational perspective this might not be preferred: you can tune the hardware / configuration for each use case. Easier to troubleshoot, independent upgrade cycle, etc.

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


For help, read https://www.mrc-lmb.cam.ac.uk/scicomp/
then contact unixad...@mrc-lmb.cam.ac.uk
--
Dr Jake Grimmett
Head Of Scientific Computing
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.
Phone 01223 267019
Mobile 0776 9886539

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to