I don't understand why min_size = 2 would kill latency times. Regardless of your min_size, a write to ceph does not ack until it completes to all copies. That means that even with min_size = 1 the write will not be successful until it's written to the NVME, the SSD, and the HDD (given your proposed setup). Every write will have to write to the HDD before it acks every time. The performance boost that you gain from doing primary affinity to use HDDs as secondary storage and SSDs/NVMe's as primary storage is in the reads not the writes.
Having journals in front of the HDDs will still have the write happen initially to the SSD. If you can't configure that in Bluestore for the HDDs then don't use Bluestore... There's no reason you can't use Filestore with SSD/NVMe journals in front of them for your HDDs if it performs faster for your configuration. Bluestore is not the fastest solution for all use cases and Filestore is not getting deprecated... yet. Another note is that with 100GB DC S3510 and DC S3500 SSDs for journals of 4 HDDs each, they ran out of writes in under 18 months in a large RBD cluster. The DC S3520 is not drastically more durable than those. I wouldn't recommend using them. The DC S3610's are much more durable and not much more expensive. On Mon, Aug 21, 2017 at 4:25 PM Xavier Trilla < [email protected]> wrote: > Hi, > > > > I’m working into improving the costs of our actual ceph cluster. We > actually keep 3 x replicas, all of them in SSDs (That cluster hosts several > hundred VMs RBD disks) and lately I’ve been wondering if the following > setup would make sense, in order to improve cost / performance. > > > > The ideal would be to move PG primaries to high performance nodes using > NVMe, keep secondary replica in SSDs and move the third replica to HDDs. > > > > Most probably the hardware will be: > > > > 1st Replica: Intel P4500 NVMe (2TB) > > 2nd Replica: Intel S3520 SATA SSD (1.6TB) > > 3rd Replica: WD Gold Harddrives (2 TB) (I’m considering either 1TB o 2TB > model, as I want to have as many spins as possible) > > > > Also, hosts running OSDs would have a quite different HW configuration (In > our experience NVMe need crazy CPU power in order to get the best out of > them) > > > > I know the NVMe and SATA SSD replicas will work, no problem about that > (We’ll just adjust the primary affinity and crushmap in order to have the > desired data layoff + primary OSDs) what I’m worried is about the HDD > replica. > > > > Also the pool will have min_size 1 (Would love to use min_size 2, but it > would kill latency times) so, even if we have to do some maintenance in the > NVMe nodes, writes to HDDs will be always “lazy”. > > > > Before bluestore (we are planning to move to luminous most probably by the > end of the year or beginning 2018, once it is released and tested properly) > I would just use SSD/NVMe journals for the HDDs. So, all writes would go > to the SSD journal, and then moved to the HDD. But now, with Bluestore I > don’t think that’s an option anymore. > > > > What I’m worried is how would affect to the NVMe primary OSDs having a > quite slow third replica. WD Gold hard drives seem quite decent (For a SATA > drive) but obviously performance is nowhere near to SSDs or NVMe. > > > > So, what do you think? Does anybody have some opinions or experience he > would like to share? > > > > Thanks! > > Xavier. > > > > > > > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
