On 08/23/2017 07:17 PM, Mark Nelson wrote:


On 08/23/2017 06:18 PM, Xavier Trilla wrote:
Oh man, what do you know!... I'm quite amazed. I've been reviewing
more documentation about min_replica_size and seems like it doesn't
work as I thought (Although I remember specifically reading it
somewhere some years ago :/ ).

And, as all replicas need to be written before primary OSD informs the
client about the write being completed, we cannot have the third
replica on HDDs, no way. It would kill latency.

Well, we'll just keep adding NVMs to our cluster (I mean, S4500 and
P4500 price difference is negligible) and we'll decrease the primary
affinity weight for SATA SSDs, just to be sure we get the most out of
NVMe.

BTW, does anybody have any experience so far with erasure coding and
rbd? A 2/3 profile, would really save space on SSDs but I'm afraid
about the extra calculations needed and how will it affect
performance... Well, maybe I'll check into it, and I'll start a new
thread :)

There's a decent chance you'll get higher performance with something
like EC 6+2 vs 3X replication for large writes due simply to having less
data to write (we see somewhere between 2x and 3x rep performance in the
lab for 4MB writes to RBD). Small random writes will almost certainly be
slower due to increased latency.  Reads in general will be slower as
well.  With replication the read comes entirely from the primary but in
EC you have to fetch chunks from the secondaries and reconstruct the
object before sending it back to the client.

So basically compared to 3X rep you'll likely gain some performance on
large writes, lose some performance on large reads, and lose more
performance on small writes/reads (dependent on cpu speed and various
other factors).

I should follow up and mention though that you gain space vs 3X as well, so it's very much a question of what trade-offs you are willing to make.


Mark


Anyway, thanks for the info!
Xavier.

-----Mensaje original-----
De: Christian Balzer [mailto:ch...@gol.com]
Enviado el: martes, 22 de agosto de 2017 2:40
Para: ceph-users@lists.ceph.com
CC: Xavier Trilla <xavier.tri...@silicontower.net>
Asunto: Re: [ceph-users] NVMe + SSD + HDD RBD Replicas with Bluestore...


Hello,


Firstly, what David said.

On Mon, 21 Aug 2017 20:25:07 +0000 Xavier Trilla wrote:

Hi,

I'm working into improving the costs of our actual ceph cluster. We
actually keep 3 x replicas, all of them in SSDs (That cluster hosts
several hundred VMs RBD disks) and lately I've been wondering if the
following setup would make sense, in order to improve cost /
performance.


Have you done a full analysis of your current cluster, as in
utilization of your SSDs (IOPS), CPU, etc with
atop/iostat/collectd/grafana?
During peak utilization times?

If so, you should have a decent enough idea of what level IOPS you
need and can design from there.

The ideal would be to move PG primaries to high performance nodes
using NVMe, keep secondary replica in SSDs and move the third replica
to HDDs.

Most probably the hardware will be:

1st Replica: Intel P4500 NVMe (2TB)
2nd Replica: Intel S3520 SATA SSD (1.6TB)
Unless you have:
a) a lot of these and/or
b) very little writes
what David said.

Aside from that whole replica idea not working. as you think.

3rd Replica: WD Gold Harddrives (2 TB) (I'm considering either 1TB o
2TB model, as I want to have as many spins as possible)

Also, hosts running OSDs would have a quite different HW configuration
(In our experience NVMe need crazy CPU power in order to get the best
out of them)

Correct, one might run into that with pure NVMe/SSD nodes.

I know the NVMe and SATA SSD replicas will work, no problem about
that (We'll just adjust the primary affinity and crushmap in order to
have the desired data layoff + primary OSDs) what I'm worried is
about the HDD replica.

Also the pool will have min_size 1 (Would love to use min_size 2, but
it would kill latency times) so, even if we have to do some
maintenance in the NVMe nodes, writes to HDDs will be always "lazy".

Before bluestore (we are planning to move to luminous most probably
by the end of the year or beginning 2018, once it is released and
tested properly) I would just use  SSD/NVMe journals for the HDDs.
So, all writes would go to the SSD journal, and then moved to the
HDD. But now, with Bluestore I don't think that's an option anymore.

Bluestore bits are still a bit of dark magic in terms of concise and
complete documentation, but the essentials have been mentioned here
before.

Essentially, if you can get the needed IOPS with SSD/NVMe journals and
HDDs, Bluestore won't be worse than that, if done correctly.

With Bluestore use either NVMe for the WAL (small space, high
IOPS/data), SSDs for the actual rocksdb and the (surprise, surprise!)
journal for small writes (large space, nobody knows for sure how large
is large enough) and finally the HDDs.

If you're trying to optimize costs, decent SSDs (good luck finding any
with Intel 37xx and 36xx basically unavailable), maybe the S or P
4600, to hold both the WAL and DB should do the trick.

Christian

What I'm worried is how would affect to the NVMe primary OSDs having
a quite slow third replica. WD Gold hard drives seem quite decent
(For a SATA drive) but obviously performance is nowhere near to SSDs
or NVMe.

So, what do you think? Does anybody have some opinions or experience
he would like to share?

Thanks!
Xavier.





_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to