[ceph-users] Re: Achieve no loss-of-write with 2 node failure?

Peter Grandi via ceph-users Tue, 16 Dec 2025 12:28:11 -0800

>>> b) a minimum of k+m+2 nodes?

> [...] If up-front CapEx is your concern


The perhaps Ceph is not such a great idea:

* Ceph is designed around the idea of many servers and many
  small drives with a few small drives per server as it needs to
  have lots of IOPS-per-TB and low impact per system failure.
* Minimizing up-front costs means fewer larger drives and higher
  density servers.
* The common result is extreme congestion during balancing and
  recovery times and high latencies during parallel accesses.
  This mailing list is full of "cost-optimized" horror stories.

Note: larger HDDs have really low IOPS-per-TB; SSDs avoid that
issue but cheap SSDs do not have PLP so write IOPS are much
lower than read IOPS. Whether the drive is SSD or HDD larger
ones also usually mean large PGs which is not so good. With SSDs
at least it is possible (and in some cases advisable) to split
them into multiple OSDs though.

> remember that you don't have to fully populate nodes with
> drives, at least not initially.

That is indeed a good suggestion: the fewer the drives per
server the better. Ideally just one drive per server :-).

> I sometimes recommend a minimum of 7 nodes so that 4+2 or 3+3
> EC can be done safely.

For me 7 buckets and 6-wide EC stripes is less desirable as that
means that *any* bucket failure will cause *nearly all* objects
to be degraded, and if one uses CephFS which chunks files across
objects, too bad. I have seen long, long, long recovery storms
in configurations like that.

> Seven nodes half-full is better in multiple ways than 4 nodes
> fully populated.

Indeed. But with 6-7 buckets I would only use 3-ways
replication.
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Achieve no loss-of-write with 2 node failure?

Reply via email to