Le 20/11/2023 à 09:24:41+0000, Frank Schilder a écrit Hi, Thanks everyone for your answer.
> > we are using something similar for ceph-fs. For a backup system your setup > can work, depending on how you back up. While HDD pools have poor IOP/s > performance, they are very good for streaming workloads. If you are using > something like Borg backup that writes huge files sequentially, a HDD > back-end should be OK. > Ok. Good to know > Here some things to consider and try out: > > 1. You really need to get a bunch of enterprise SSDs with power loss > protection for the FS meta data pool (disable write cache if enabled, this > will disable volatile write cache and switch to protected caching). We are > using (formerly Intel) 1.8T SATA drives that we subdivide into 4 OSDs each to > raise performance. Place the meta-data pool and the primary data pool on > these disks. Create a secondary data pool on the HDDs and assign it to the > root *before* creating anything on the FS (see the recommended 3-pool layout > for ceph file systems in the docs). I would not even consider running this > without SSDs. 1 such SSD per host is the minimum, 2 is better. If Borg or > whatever can make use of a small fast storage directory, assign a sub-dir of > the root to the primary data pool. OK. I will see what I can do. > > 2. Calculate with sufficient extra disk space. As long as utilization stays > below 60-70% bluestore will try to make large object writes sequential, which > is really important for HDDs. On our cluster we currently have 40% > utilization and I get full HDD bandwidth out for large sequential > reads/writes. Make sure your backup application makes large sequential IO > requests. > > 3. As Anthony said, add RAM. You should go for 512G on 50 HDD-nodes. You can > run the MDS daemons on the OSD nodes. Set a reasonable cache limit and use > ephemeral pinning. Depending on the CPUs you are using, 48 cores can be > plenty. The latest generation Intel Xeon Scalable Processors is so efficient > with ceph that 1HT per HDD is more than enough. Yes I get 512G on each node, 64 core on each server. > > 4. 3 MON+MGR nodes are sufficient. You can do something else with the > remaining 2 nodes. Of course, you can use them as additional MON+MGR nodes. > We also use 5 and it improves maintainability a lot. > Ok thanks. > Something more exotic if you have time: > > 5. To improve sequential performance further, you can experiment with larger > min_alloc_sizes for OSDs (on creation time, you will need to scrap and > re-deploy the cluster to test different values). Every HDD has a preferred > IO-size for which random IO achieves nearly the same band-with as sequential > writes. (But see 7.) > > 6. On your set-up you will probably go for a 4+2 EC data pool on HDD. With > object size 4M the max. chunk size per OSD will be 1M. For many HDDs this is > the preferred IO size (usually between 256K-1M). (But see 7.) > > 7. Important: large min_alloc_sizes are only good if your workload *never* > modifies files, but only replaces them. A bit like a pool without EC > overwrite enabled. The implementation of EC overwrites has a "feature" that > can lead to massive allocation amplification. If your backup workload does > modifications to files instead of adding new+deleting old, do *not* > experiment with options 5.-7. Instead, use the default and make sure you have > sufficient unused capacity to increase the chances for large bluestore writes > (keep utilization below 60-70% and just buy extra disks). A workload with > large min_alloc_sizes has to be S3-like, only upload, download and delete are > allowed. Thankt a lot for those tips. I'm newbie with ceph so it's going to take sometime before I understand everything you say. Best regards -- Albert SHIH 🦫 🐸 France Heure locale/Local time: jeu. 23 nov. 2023 08:32:20 CET _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io