[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

Gregory Orange Wed, 17 Jan 2024 06:48:37 -0800

On 16/1/24 11:39, Anthony D'Atri wrote:

by “RBD for cloud”, do you mean VM / container general-purposes volumeson which a filesystem is usually built? Or large archive / backupvolumes that are read and written sequentially without much concern forlatency or throughput?

General purpose volumes for cloud instance filesystems. Performance isnot high, but requirements are a moving target, and it performs betterthan it used to, so decision makers and users are satisfied. If moretargeted requirements start to arise, of course architecture and costswill change.

How many of those ultra-dense chassis in a cluster? Are all 60 off asingle HBA?

When we deploy prod RGW there it may be 10-20 in a cluster. Yes there isa single 4 miniSAS port HBA per head node, and one of those for eachchassis.

I’ve experienced RGW clusters built from 4x 90 slot ultra-dense chassis,each of which had 2x server trays, so effectively 2x 45 slot chassisbound together. The bucket pool was EC 3,2 or 4,2. The motherboard was…. odd, as a certain chassis vendor had a thing for at a certain pointin time. With only 12 DIMM slots each, they were chronically short onRAM and the single HBA was a bottleneck. Performance was acceptable forthe use-case …. at first. As the cluster filled up and got busier, thatwas no longer the case. And these were 8TB capped drives. Not allslots were filled, at least initially.
The index pool was on separate 1U servers with SATA SSDs.

This sounds similar to our plans, albeit with denser nodes and a NVMeindex pool. Also in our favour is that the users of the cluster we arecurrently intending for this have established a practice of storinglarge objects.

There were hotspots, usually relatively small objects that clientshammered on. A single OSD restarting and recovering would tank the API;we found it better to destroy and redeploy it. Expanding faster thandata was coming in was a challenge, as we had to throttle the heck outof the backfill to avoid rampant slow requests and API impact.
QLC with a larger number of OSD node failure domains was a net win inthat RAS was dramatically increased, and expensive engineer-hoursweren’t soaked up fighting performance and availability issues.

Thank you, this is helpful information. We haven't had that kind ofperformance concern with our RGW on 24x 14TB nodes, but it remains to beseen how 60x 22TB behaves in practice. Rebalancing is a bigconsideration, particularly if we have a whole node failure. We arecurrently contemplating a PG split and even more IO since the growingdata volume and subsequent node additions has left us with low PG/OSDratio and it's hard for it to rebalance.


What is OLC?

Fascinating to hear about destroy-redeploy being safer than a simplerestart-recover!

ymmv

Agreed. I guess I wanted to add the data point that these kinds ofclusters can and do make full sense in certain contexts, and push alittle away from "friends don't let friends use HDDs" dogma.

If spinners work for yourpurposes and you don’t need IOPs or the ability to provision SSDs downthe road, more power to you.

I expect our road to be long, and SSD usage will grow as the capitaldollars, performance and TCO metrics change over time. For now, we limitindividual cloud volumes to 300 IOPs, doubled for those who need it.

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

Reply via email to