[ceph-users] Re: Achieve no loss-of-write with 2 node failure?

Peter Grandi via ceph-users Mon, 29 Dec 2025 13:31:37 -0800

>> Note: larger HDDs have really low IOPS-per-TB; SSDs avoid that
>> issue but cheap SSDs do not have PLP so write IOPS are much
>> lower than read IOPS.


> That is something I've seen mentioned a lot, so we've only got
> PLP drives on the shopping list.

For Ceph they are really necessary for log/database drives and
ideally also for Ceph and CephFS metadata but non-PLP SSDs can
be used if they are many for data if the workload does not
require lots of small data writes. After all one can use for
that also HDDs (as long as they are small so IOPS-per-TB are not
too low).

> The tentative current shopping list is 24x 7.68TB Samsung
> PM893 or Kingston DC600M drives.

I have got several dozen DC600M (in some Lustre setups) and they
seem very reliable with good speed, and the PM893 have a good
reputation even if I have never tried them. But it will be hard
to source them now. I have recently tried 7.68TB and 30.72TB
SSDs from a relatively new chinese brand "Memblaze" and they
were pretty good and there seems to be still some availability
of those outside the USA, they have just setup an european
subsidiary.

>> Whether the drive is SSD or HDD larger ones also usually mean
>> large PGs which is not so good. With SSDs at least it is
>> possible (and in some cases advisable) to split them into
>> multiple OSDs though.

> Could we just increase the number of PGs to avoid this?

Yes, rather more than "usual" guidelines but that has downsides
too but I think less severe than having too large PGs. However
too large PGs are less of a problem on SSDs as transfer rates
are much higher and they have lots of IOPS.

The usual Ceph guidelines say something like 100-150 PGs per OSD
but I have unfortunately seen 18TB HDDs with around that much
and it was pretty bad. My guess is that the guideline was
applicable to HDDs in the 1TB-2TB range. The basic issue that as
to aggregating Ceph can only do one level so one has to balance
too few PGs resulting in huge numbers of objects per PG and
amplified recovery traffic while more PGs means larger per-pool
and per-OSD PG tables. Several thousand PGs per pool is rather
common nowadays and I suspect it can be more.

>> That is indeed a good suggestion: the fewer the drives per
>> server the better. Ideally just one drive per server :-).

To explain a bit more: in an ideal world Ceph was designed to
run on something like a lot of NUCs each fronting a small HDD or
an SSD.

The basic idea of Ceph was/should be to turn each HDD/SSD from a
block storage device into an object storage device by fronting
it with an OSD daemon and accessing it over ethernet.

In theory one could put the OSD daemon code in firmware and have
OSD-protocol instead of SCSI-protocol HDDs or SSDs with an
Ethernet socket instead of a SAS/SATA/USB/NVME socket.

The reason why many setups have many-OSD servers is because the
capacity/price curve has a (shallow) sweet spot at around a
couple dozen OSDs per server but often "penny wise is pound
foolish".

> Would I actually see a major advantage going from 6 nodes to
> 8, from 8 to 12, or from 12 to 24? (Given 24 disks in each
> case.)

I guess it is not too bad to have up to half a dozen OSDs per
server (depending on network bandwidth, RAM, CPU usage, size,
...) but in your case 6 per server would mean just 4 servers
which seems fragile to me.

More smaller OSDs and more servers increase parallelism and
available banwidth which may or may not be required, but even if
the parallelism and banwidth requirements are not high the main
issue with too many OSDs per server (apart from resource usage)
is that then there are too few "buckets" (if server == bucket)
and almost any bucket loss triggers recovery/rebalance of almost
every PG/object.

The issue is particularly severe with EC "pools" (especially of
the EC profile is wide which is a bad idea anyhow in most cases)
and less bad with 3-way replication as then having just 6
buckets means that on average the loss of a bucket will trigger
recovery on "only" half of the objects/PGs and so on. Recovery
eats *a lot* of write IOPS (and secondarily network banwidth) so
it is less costly with SSDs (those with PLP in particular).
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Achieve no loss-of-write with 2 node failure?

Reply via email to