Re: [ceph-users] Ceph performance expectations

Mark Nelson Thu, 07 Apr 2016 10:04:27 -0700

Hi Sergio

On 04/07/2016 07:00 AM, Sergio A. de Carvalho Jr. wrote:

Hi all,


I've setup a testing/development Ceph cluster consisting of 5 Dell
PowerEdge R720xd servers (256GB RAM, 2x 8-core Xeon E5-2650 @ 2.60 GHz,
dual-port 10Gb Ethernet, 2x 900GB + 12x 4TB disks) running CentOS 6.5
and Ceph Hammer 0.94.6. All servers use one 900GB disk for the root
partition and the other 13 disks are assigned to OSDs, so we have 5 x 13
= 65 OSDs in total. We also run 1 monitor on every host. Journals are
5GB partitions on each disk (this is something we obviously will need to
revisit later). The purpose of this cluster will be to serve as a
backend storage for Cinder volumes and Glance images in an OpenStack cloud.

With this setup, I'm getting what I'm considering an "okay" performance:

# rados -p images bench 5 write
  Maintaining 16 concurrent writes of 4194304 bytes for up to 5 seconds
or 0 objects

Total writes made:      394
Write size:             4194304
Bandwidth (MB/sec):     299.968

Stddev Bandwidth:       127.334
Max bandwidth (MB/sec): 348
Min bandwidth (MB/sec): 0
Average Latency:        0.212524
Stddev Latency:         0.13317
Max latency:            0.828946
Min latency:            0.07073

Does that look acceptable? How much more can I expect to achieve by
fine-tunning and perhaps using a more efficient setup?

I'll assume 3x replication for these tests. In reasonable conditionsyou should be able to get about ~70MB/s raw per standard 7200rpmspinning disk with Ceph using filestore with XFS. For 65 OSDs, let'ssay about 4.5GB/s. Divide that by 3 for replication and you get1.5GB/s. Now add the journal double write penalty and you are down toabout 750MB/s. So I'd say your aggregate throughput here is lower thanwhat you might ideally see.

The first step would probably be to increase the concurrency and see howmuch that helps.


I do understand the bandwidth above is a product of running 16
concurrent writes, and rather small object sizes (4MB). Bandwidth lowers
significantly with 64MB and 1 thread:

# rados -p images bench 5 write -b 67108864 -t 1
  Maintaining 1 concurrent writes of 67108864 bytes for up to 5 seconds
or 0 objects

Total writes made:      7
Write size:             67108864
Bandwidth (MB/sec):     71.520

Stddev Bandwidth:       24.1897
Max bandwidth (MB/sec): 64
Min bandwidth (MB/sec): 0
Average Latency:        0.894792
Stddev Latency:         0.0547502
Max latency:            0.99311
Min latency:            0.832765

Is such a drop expected?

Yep! Concurrency is really important for distributed systems and Cephis no exception. If you only keep 1 write in flight, you can't reallyexpect better than the performance of a single OSD. Ceph writes a fullycopy of the data to the journal before sending a write acknowledgment tothe client. In fact every replica write also has to be fully written tothe journal on the secondary OSDs as well. These writes happen inparallel, but it adds latency and you'll only be as fast overall as theslowest of all of these journal writes. In your case, you also have tocontend with the filesystem writes contending with the journal writesdown since the journals are co-located.

In this case you probably are only getting 71MB/s because the test is soshort. In practice with co-located journals I'd expect for a longerrunning test you'd actually get less than this in practice.


Now, what I'm really concerned is about upload times. Uploading a
randomly-generated 1GB file takes a bit too long:

# time rados -p images put random_1GB /tmp/random_1GB

real0m35.328s
user0m0.560s
sys0m3.665s

Is this normal? If so, if I setup this cluster as a backend for Glance,
does that mean uploading a 1GB image will require 35 seconds (plus
whatever time Glance requires to do its own thing)?

And here's where you are getting less. I'd hope for a little fasterthan 29MB/s, but given how your cluster is setup 30-40MB/s is probablyabout right. If you need this use-case to be faster, you have a coupleof options.

1) Wait for bluestore to become production ready. This is the new OSDbackend that specifically avoids full-data journal writes for largesequential write IO. Expect per-osd speed to be around 1.5-2X as fastin this case for spinning disk only clusters.

2) Move the journals off the disks. A common way to do this is to buy acouple of very fast, high write endurance NVMe or SSDs. Some of thenewer NVMe drives are fast enough to support journals for 15-20 spinningdisks each. Just make sure they have enough write endurance to meetyour needs. Assuming no other bottlenecks, this is usually close to a2X large write IO performance improvement.

3) If that's not good enough, you might consider buying a small set ofSSDs/NVMes for a dedicated SSD pool for specific cases like this. Evenin this setup, you'll likely see higher performance with moreconcurrency. Here's an example I just ran on a 4 node cluster using asingle fast NVMe drive per node:


rados -p cbt-librbdfio bench 30 write -t 1
Bandwidth (MB/sec):     180.205

rados -p cbt-librbdfio bench 30 write -t 16
Bandwidth (MB/sec):     1197.11


Thanks,

Sergio




_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph performance expectations

Reply via email to