[ceph-users] Re: Achieve no loss-of-write with 2 node failure?

Anthony D'Atri via ceph-users Mon, 29 Dec 2025 14:33:51 -0800

> For Ceph they are really necessary for log/database drives and
> ideally also for Ceph and CephFS metadata but non-PLP SSDs can
> be used if they are many for data if the workload does not
> require lots of small data writes. After all one can use for
> that also HDDs (as long as they are small so IOPS-per-TB are not
> too low).


It’s not about IOPS, it's about not losing data in flight when the host/rack/DC 
loses power.


>> The tentative current shopping list is 24x 7.68TB Samsung
>> PM893 or Kingston DC600M drives.
> 
> I have got several dozen DC600M (in some Lustre setups) and they
> seem very reliable with good speed, and the PM893 have a good
> reputation even if I have never tried them. But it will be hard
> to source them now. I have recently tried 7.68TB and 30.72TB
> SSDs from a relatively new chinese brand "Memblaze" 

I’m wary of firmware availability. ymmv.

> 
>> Could we just increase the number of PGs to avoid this?
> 
> Yes, rather more than "usual" guidelines but that has downsides
> too but I think less severe than having too large PGs.

In 2025, I would agree. More memory use and more peering.

> However too large PGs are less of a problem on SSDs as transfer rates
> are much higher and they have lots of IOPS.

Very large PGs are still a problem for the balancer.

> The usual Ceph guidelines say something like 100-150 PGs per OSD

To be clear, PG *replicas*, not PGs proper.  So inn an R3 pool that means 3 * 
pg_num. For an EC 4+2 pool, it means 6 * pg_num.  When you have a mix of 
multiple pools, the manual calculations become tricky, though one can use the 
venerable pgcalc to help with the math.

> but I have unfortunately seen 18TB HDDs with around that much
> and it was pretty bad. My guess is that the guideline was
> applicable to HDDs in the 1TB-2TB range.

The guidance was ~200 at one point, then it was retconned to 100 to help keep 
people from OOMing themselves.  In 2025 and with BlueStore, not Filestore OSDs, 
my sense is that a target of 100 is too low, especially when interpreted by the 
PG autoscaler, with which that default is actually a max, and you’ll generally 
see an even lower number.

For modern clusters (BlueStore, > 2TB OSDs) I suggest

global                                                            advanced  
mon_max_pg_per_osd                         600
global                                                            advanced  
mon_target_pg_per_osd                      300
mgr                                                               advanced  
mgr/balancer/upmap_max_deviation           1
mgr                                                               basic     
target_max_misplaced_ratio                 0.020000

with the PG autoscaler enabled for all pools.  RGW index, RBD/CephFS metadata 
pools should have a BIAS of 4, and no OSDS should have a legacy reweight value 
other than 1.000000

With these non-default values, the PG autoscaler usually does okay.  I’ve seen 
a cluster with defaults with like a dozen RBD pools.  Each pool usually has a 
minimum pg_num of 32, so in a modestly-sized cluster, a dozen of those takes a 
significant bite out of the autoscaler’s budget, and its figurative hands are 
tied.  A cluster with only a single, RBD pool (and yes the trivial .mgr pool) 
is easier for the autoscaler to deal with, though you’ll still usually end up 
with an inadequate PGS value (right column of `ceph osd df` aka the PG ratio is 
what we’re targeting).

I’ve seen more than one cluster as low as ~20 PG replicas per OSD due to this, 
and one (/me puckers and shakes) with single digits.



> The basic issue that as to aggregating Ceph can only do one level so one has 
> to balance
> too few PGs resulting in huge numbers of objects per PG and
> amplified recovery traffic

This is not wrong, you do effectively more in some cases data that doesn’t 
strictly need to be moved, it comes along for the ride. Very large PGs also 
complicate scrubs.

cf. 
https://www.evernote.com/shard/s189/sh/3a5bde91-e94b-4dbb-adbe-9d8a1379f3d5/2rAqPjY6pugisknCqhpoaRwfjiDNcN1Yte04ixUts4pHke2hn34hhZy5kA

With a wildly inadequate PG count, each PG represents a significant fraction of 
the storage on each PG, and the balancer can’t do a very good job. Another, 
rough analogy is consider how smooth a wood board gets with 40 grit sandpaper 
vs 400 grit.

Rough analogy, see what I did there?  ;)

As an aside, for posterity I must note that PGs do not have a fixed size, they 
represent a *fraction* of the total data stored in a pool.


> while more PGs means larger per-pool
> and per-OSD PG tables. Several thousand PGs per pool is rather
> common nowadays and I suspect it can be more.

There’s a default max of 64K as a safeguard. Who among us has never 
fat-fingered an extra zero or two? With very large clusters, this can AFAIK be 
raised. I’ve worked on clusters with ~3100 OSDs, and there are definite ones 
larger in the wild. The orchestrator has issues larger than ~3500 OSDs 
currently but that’s being worked on.

>>> That is indeed a good suggestion: the fewer the drives per
>>> server the better. Ideally just one drive per server :-).
> 
> To explain a bit more: in an ideal world Ceph was designed to
> run on something like a lot of NUCs each fronting a small HDD or
> an SSD.

Well, I wouldn’t necessarily say that, 
https://ceph.io/en/news/blog/2016/500-osd-ceph-cluster/ notwithstanding.

For sure there are certain aspects and defaults that are holdovers from the 
days of 1TB spinners, 1GE networking, and less-evolved CRUSH code and tunables. 
 Like the optional replication network, but I digress...

I often recommend a minimum of 7 chassis so that 4+2 EC can be done without 
gymnastics, even if not fully populated with drives.  Plan for enough CPU 
cores, though.  RAM can be upgrade, so long as you plan for empty slots.  If 
you fill all the slots with 16 GiB modules and need to expand, you’re stuck.  
Back in the day the Sun 4/110 could be ordered with a minimum of 8 MiB. They 
shipped 32x 256KB SIMMs, filling all the slots, and requiring tossing those 
useless modules in order to add more.

So for SSDs, a 1U server with 10-15 capacity drives (depending on U.2, E1.S, 
E3.S) is a starting point, with M.2 or rear boot.

For spinners, I personally suggest a max of 24 SATA slots.

Sure you can get a 100 slot toploader, but beware the pitfalls of a cluster 
comprising only 3 of those, the weight your rack/floor can support, how far 
that monster sticks out into the aisle, potentially preventing doors from 
closing, if you can afford fancy-pants doors. If you can, that’s money better 
spent on getting SSDs instead ;)

Modest drive counts also help avoid the need for a replication network or 
higher link speeds (100 GE+).

> In theory one could put the OSD daemon code in firmware and have
> OSD-protocol instead of SCSI-protocol HDDs or SSDs with an
> Ethernet socket instead of a SAS/SATA/USB/NVME socket.

See the above experiment.

> The reason why many setups have many-OSD servers is because the
> capacity/price curve has a (shallow) sweet spot at around a
> couple dozen OSDs per server but often "penny wise is pound
> foolish".

Yep.  Consider what happens when 100 OSDs are offline at once during a failure, 
and that’s fully 1/3 of your cluster.  

>> Would I actually see a major advantage going from 6 nodes to
>> 8, from 8 to 12, or from 12 to 24? (Given 24 disks in each
>> case.)

Diminishing returns, but 8 gives you the ability to run EC with k+m <= 7, 12 
gives you the ability to run 8+3. 

> I guess it is not too bad to have up to half a dozen OSDs per
> server (depending on network bandwidth, RAM, CPU usage, size,
> ...) but in your case 6 per server would mean just 4 servers
> which seems fragile to me.

4 hosts is the prudent minimum for R3 pools.  For EC, k+m+1 at a minimum.

Now if you’re converging with compute, the tradeoffs are different.

> The issue is particularly severe with EC "pools" (especially of
> the EC profile is wide which is a bad idea anyhow in most cases)

I usually recommend k + m < 12.  Above, say, 6+2 the space savings by going 
wider diminish, and the write performance and other factors often aren’t worth 
the increasingly thin delta.

https://docs.ceph.com/en/reef/rados/operations/erasure-code/#id2

> and less bad with 3-way replication as then having just 6
> buckets means that on average the loss of a bucket will trigger
> recovery on "only" half of the objects/PGs and so on. Recovery
> eats *a lot* of write IOPS (and secondarily network banwidth) so
> it is less costly with SSDs (those with PLP in particular).

I’ve seen a cluster with dense toploaders full of spinners require that 
weighting up a single 8TB SSD had to be done over 4 weeks so as to not tank the 
S3 API.

Consider how long it takes to recover from the loss of a 30+ TB spinner, or a 
node with a hundred of them.  During which time you have reduced redundancy and 
thus increased risk.  For large spinners strongly consider m = 3 for this 
reason. At least one commercial vendor considers a cluster that requires > 8 
hours to recover from the loss of a single OSD to be unsupported.

https://www.youtube.com/watch?v=Wy3EetoIqng


>> I hate to see new spend on SATA/SAS SSDs, since those
>> interfaces are progressively disappearing from product
>> roadmaps. I would take 4-6 NVMe nodes any day.
> 
> Unfortunately many of us still have old SATA-only or SATA+SAS
> servers and it is often better to have more low-density old
> servers than fewer higher-density new servers (even if one can
> afford the latter).


If DC space is free, sure.  Until it isn’t.  You can also mix and match with 
Ceph, within reason. You can get a used R640 with a bunch of cores really 
inexpensively these days.  The chassis might not outlive the SSDs, but the SSDs 
are the bulk of the cost. Just don’t waste money on 3 DWPD SKUs or RAID HBAs.

> For the future the ideal for smaller organizations would be U.3
> servers so one can buy new U.3 NVME devices and recycle existing
> SATA or SAS devices along them but given the current constraint
> on RAM availability I guess getting more servers at all will be
> rather hard in 2026.



Universal slots, sure - but note that a U.2 / U.3 chassis slot doesn’t 
necessarily accept SATA, and SAS usually means paying for an annoying HBA.

Since HDDs these days are mostly LFF, a universal slot also means suboptimal RU 
density.

When you skip the RAID HBA expense, the cost of an enhanced BMC license you 
don’t really need, procure wisely, an all-NVMe server including drives can 
actually cost less.  I’ve seen it with my own eyes.














_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Achieve no loss-of-write with 2 node failure?

Reply via email to