>> For Ceph they are really necessary for log/database drives >> and ideally also for Ceph and CephFS metadata but non-PLP >> SSDs can be used if they are many for data if the workload >> does not require lots of small data writes.
> It’s not about IOPS, it's about not losing data in flight when > the host/rack/DC loses power. * The filesystem (BlueStore or whatever) or DBMS (RocksDB or whatever) is required to issue "write buffer commit" commands to the device whenever necessary to ensure metadata or data is not lost by cache power loss and remains consistent. * The main effect of PLP is to allow the device to ignore "write buffer commit" commands and do commits less often, and that is what improves IOPS (and write amplification in the case of SSDs). * _Some_ loss is inevitable: there can be data still in the middle of being written if the loss happen during a "write to device" operation and efficient filesystems can do pretty large ones. Note: as to the latter point some non-PLP SSDs still have minimal capacitor backup to be able to complete writes from the RAM write buffer to the flash chip. Those that do not are less desirable :-). > [...] 4 hosts is the prudent minimum for R3 pools. For EC, > k+m+1 at a minimum. [...] But the prudent minimum number of buckets is usually not a great idea because of the big percent of object replicas damaged by a bucket failure. I would prefer as a minimum twice the width of replication/stripes for buckets, especially for EC pools where recovery is particularly expensive. > I usually recommend k + m < 12. Above, say, 6+2 the space > savings by going wider diminish, and the write performance and > other factors often aren’t worth the increasingly thin delta. > [...] I’ve seen a cluster with dense toploaders full of > spinners require that weighting up a single 8TB SSD had to be > done over 4 weeks so as to not tank the S3 API. Good anecdote (I have seen several similar cases). So I think that 4+2 (only in desperate cases 6+2) is the widest that is cost-effective; wider may have cost that is lower up-front but the lifetime cost in terms of huge recovery times and the impact on client workload can be quite bad. > At least one commercial vendor considers a cluster that > requires > 8 hours to recover from the loss of a single OSD to > be unsupported. That is a good argument from that vendor. > If DC space is free, sure. Until it isn’t. [...] There are false economies and a major one is to accept a very high cost of "maintenance" operations like recovery and scrubbing as in the anecdote above. That said fiber infrastructure makes distance rather less important than in the past so data centres can be located in cheaper areas (also I am a fan of "colocation" data centres). Still High Frequency Trading people like to put their servers as near as possible to the physical location of exchange data centres to really minimize latency and space there is usually tight and expensive so high density servers and racks make more sense. > Just don’t waste money on 3 DWPD SKUs or RAID HBAs. Usually but I have seen some cases where the DB/WAL SSDs really should have been 3 DWPD, especially if the SSD is shared among data-only OSDs, which is almost inevitable in case of servers with many HDDs. In those cases better to be safe than sorry. _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
