Re: [ceph-users] NVMe + SSD + HDD RBD Replicas with Bluestore...

David Turner Mon, 21 Aug 2017 13:44:20 -0700

I don't understand why min_size = 2 would kill latency times.  Regardless
of your min_size, a write to ceph does not ack until it completes to all
copies.  That means that even with min_size = 1 the write will not be
successful until it's written to the NVME, the SSD, and the HDD (given your
proposed setup).  Every write will have to write to the HDD before it acks
every time.  The performance boost that you gain from doing primary
affinity to use HDDs as secondary storage and SSDs/NVMe's as primary
storage is in the reads not the writes.


Having journals in front of the HDDs will still have the write happen
initially to the SSD.  If you can't configure that in Bluestore for the
HDDs then don't use Bluestore... There's no reason you can't use Filestore
with SSD/NVMe journals in front of them for your HDDs if it performs faster
for your configuration.  Bluestore is not the fastest solution for all use
cases and Filestore is not getting deprecated... yet.

Another note is that with 100GB DC S3510 and DC S3500 SSDs for journals of
4 HDDs each, they ran out of writes in under 18 months in a large RBD
cluster.  The DC S3520 is not drastically more durable than those.  I
wouldn't recommend using them.  The DC S3610's are much more durable and
not much more expensive.

On Mon, Aug 21, 2017 at 4:25 PM Xavier Trilla <
[email protected]> wrote:

> Hi,
>
>
>
> I’m working into improving the costs of our actual ceph cluster. We
> actually keep 3 x replicas, all of them in SSDs (That cluster hosts several
> hundred VMs RBD disks) and lately I’ve been wondering if the following
> setup would make sense, in order to improve cost / performance.
>
>
>
> The ideal would be to move PG primaries to high performance nodes using
> NVMe, keep secondary replica in SSDs and move the third replica to HDDs.
>
>
>
> Most probably the hardware will be:
>
>
>
> 1st Replica: Intel P4500 NVMe (2TB)
>
> 2nd Replica: Intel S3520 SATA SSD (1.6TB)
>
> 3rd Replica: WD Gold Harddrives (2 TB) (I’m considering either 1TB o 2TB
> model, as I want to have as many spins as possible)
>
>
>
> Also, hosts running OSDs would have a quite different HW configuration (In
> our experience NVMe need crazy CPU power in order to get the best out of
> them)
>
>
>
> I know the NVMe and SATA SSD replicas will work, no problem about that
> (We’ll just adjust the primary affinity and crushmap in order to have the
> desired data layoff + primary OSDs) what I’m worried is about the HDD
> replica.
>
>
>
> Also the pool will have min_size 1 (Would love to use min_size 2, but it
> would kill latency times) so, even if we have to do some maintenance in the
> NVMe nodes, writes to HDDs will be always “lazy”.
>
>
>
> Before bluestore (we are planning to move to luminous most probably by the
> end of the year or beginning 2018, once it is released and tested properly)
> I would just use  SSD/NVMe journals for the HDDs. So, all writes would go
> to the SSD journal, and then moved to the HDD. But now, with Bluestore I
> don’t think that’s an option anymore.
>
>
>
> What I’m worried is how would affect to the NVMe primary OSDs having a
> quite slow third replica. WD Gold hard drives seem quite decent (For a SATA
> drive) but obviously performance is nowhere near to SSDs or NVMe.
>
>
>
> So, what do you think? Does anybody have some opinions or experience he
> would like to share?
>
>
>
> Thanks!
>
> Xavier.
>
>
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] NVMe + SSD + HDD RBD Replicas with Bluestore...

Reply via email to