>> Note: larger HDDs have really low IOPS-per-TB; SSDs avoid that >> issue but cheap SSDs do not have PLP so write IOPS are much >> lower than read IOPS.
> That is something I've seen mentioned a lot, so we've only got > PLP drives on the shopping list. For Ceph they are really necessary for log/database drives and ideally also for Ceph and CephFS metadata but non-PLP SSDs can be used if they are many for data if the workload does not require lots of small data writes. After all one can use for that also HDDs (as long as they are small so IOPS-per-TB are not too low). > The tentative current shopping list is 24x 7.68TB Samsung > PM893 or Kingston DC600M drives. I have got several dozen DC600M (in some Lustre setups) and they seem very reliable with good speed, and the PM893 have a good reputation even if I have never tried them. But it will be hard to source them now. I have recently tried 7.68TB and 30.72TB SSDs from a relatively new chinese brand "Memblaze" and they were pretty good and there seems to be still some availability of those outside the USA, they have just setup an european subsidiary. >> Whether the drive is SSD or HDD larger ones also usually mean >> large PGs which is not so good. With SSDs at least it is >> possible (and in some cases advisable) to split them into >> multiple OSDs though. > Could we just increase the number of PGs to avoid this? Yes, rather more than "usual" guidelines but that has downsides too but I think less severe than having too large PGs. However too large PGs are less of a problem on SSDs as transfer rates are much higher and they have lots of IOPS. The usual Ceph guidelines say something like 100-150 PGs per OSD but I have unfortunately seen 18TB HDDs with around that much and it was pretty bad. My guess is that the guideline was applicable to HDDs in the 1TB-2TB range. The basic issue that as to aggregating Ceph can only do one level so one has to balance too few PGs resulting in huge numbers of objects per PG and amplified recovery traffic while more PGs means larger per-pool and per-OSD PG tables. Several thousand PGs per pool is rather common nowadays and I suspect it can be more. >> That is indeed a good suggestion: the fewer the drives per >> server the better. Ideally just one drive per server :-). To explain a bit more: in an ideal world Ceph was designed to run on something like a lot of NUCs each fronting a small HDD or an SSD. The basic idea of Ceph was/should be to turn each HDD/SSD from a block storage device into an object storage device by fronting it with an OSD daemon and accessing it over ethernet. In theory one could put the OSD daemon code in firmware and have OSD-protocol instead of SCSI-protocol HDDs or SSDs with an Ethernet socket instead of a SAS/SATA/USB/NVME socket. The reason why many setups have many-OSD servers is because the capacity/price curve has a (shallow) sweet spot at around a couple dozen OSDs per server but often "penny wise is pound foolish". > Would I actually see a major advantage going from 6 nodes to > 8, from 8 to 12, or from 12 to 24? (Given 24 disks in each > case.) I guess it is not too bad to have up to half a dozen OSDs per server (depending on network bandwidth, RAM, CPU usage, size, ...) but in your case 6 per server would mean just 4 servers which seems fragile to me. More smaller OSDs and more servers increase parallelism and available banwidth which may or may not be required, but even if the parallelism and banwidth requirements are not high the main issue with too many OSDs per server (apart from resource usage) is that then there are too few "buckets" (if server == bucket) and almost any bucket loss triggers recovery/rebalance of almost every PG/object. The issue is particularly severe with EC "pools" (especially of the EC profile is wide which is a bad idea anyhow in most cases) and less bad with 3-way replication as then having just 6 buckets means that on average the loss of a bucket will trigger recovery on "only" half of the objects/PGs and so on. Recovery eats *a lot* of write IOPS (and secondarily network banwidth) so it is less costly with SSDs (those with PLP in particular). _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
