>> For Ceph they are really necessary for log/database drives
>> and ideally also for Ceph and CephFS metadata but non-PLP
>> SSDs can be used if they are many for data if the workload
>> does not require lots of small data writes.

> It’s not about IOPS, it's about not losing data in flight when
> the host/rack/DC loses power.

* The filesystem (BlueStore or whatever) or DBMS (RocksDB or
  whatever) is required to issue "write buffer commit" commands
  to the device whenever necessary to ensure metadata or data is
  not lost by cache power loss and remains consistent.

* The main effect of PLP is to allow the device to ignore "write
  buffer commit" commands and do commits less often, and that is
  what improves IOPS (and write amplification in the case of
  SSDs).

* _Some_ loss is inevitable: there can be data still in the
  middle of being written if the loss happen during a "write to
  device" operation and efficient filesystems can do pretty
  large ones.

Note: as to the latter point some non-PLP SSDs still have
minimal capacitor backup to be able to complete writes from the
RAM write buffer to the flash chip. Those that do not are less
desirable :-).

> [...] 4 hosts is the prudent minimum for R3 pools. For EC,
> k+m+1 at a minimum. [...]

But the prudent minimum number of buckets is usually not a great
idea because of the big percent of object replicas damaged by a
bucket failure. I would prefer as a minimum twice the width of
replication/stripes for buckets, especially for EC pools where
recovery is particularly expensive.

> I usually recommend k + m < 12. Above, say, 6+2 the space
> savings by going wider diminish, and the write performance and
> other factors often aren’t worth the increasingly thin delta.
> [...] I’ve seen a cluster with dense toploaders full of
> spinners require that weighting up a single 8TB SSD had to be
> done over 4 weeks so as to not tank the S3 API.

Good anecdote (I have seen several similar cases). So I think
that 4+2 (only in desperate cases 6+2) is the widest that is
cost-effective; wider may have cost that is lower up-front but
the lifetime cost in terms of huge recovery times and the impact
on client workload can be quite bad.

> At least one commercial vendor considers a cluster that
> requires > 8 hours to recover from the loss of a single OSD to
> be unsupported.

That is a good argument from that vendor.
 
> If DC space is free, sure. Until it isn’t. [...]

There are false economies and a major one is to accept a very
high cost of "maintenance" operations like recovery and
scrubbing as in the anecdote above.

That said fiber infrastructure makes distance rather less
important than in the past so data centres can be located in
cheaper areas (also I am a fan of "colocation" data centres).

Still High Frequency Trading people like to put their servers as
near as possible to the physical location of exchange data
centres to really minimize latency and space there is usually
tight and expensive so high density servers and racks make more
sense.

> Just don’t waste money on 3 DWPD SKUs or RAID HBAs.

Usually but I have seen some cases where the DB/WAL SSDs really
should have been 3 DWPD, especially if the SSD is shared among
data-only OSDs, which is almost inevitable in case of servers
with many HDDs. In those cases better to be safe than sorry.
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to