Hi Sergio

On 04/07/2016 07:00 AM, Sergio A. de Carvalho Jr. wrote:
Hi all,

I've setup a testing/development Ceph cluster consisting of 5 Dell
PowerEdge R720xd servers (256GB RAM, 2x 8-core Xeon E5-2650 @ 2.60 GHz,
dual-port 10Gb Ethernet, 2x 900GB + 12x 4TB disks) running CentOS 6.5
and Ceph Hammer 0.94.6. All servers use one 900GB disk for the root
partition and the other 13 disks are assigned to OSDs, so we have 5 x 13
= 65 OSDs in total. We also run 1 monitor on every host. Journals are
5GB partitions on each disk (this is something we obviously will need to
revisit later). The purpose of this cluster will be to serve as a
backend storage for Cinder volumes and Glance images in an OpenStack cloud.

With this setup, I'm getting what I'm considering an "okay" performance:

# rados -p images bench 5 write
  Maintaining 16 concurrent writes of 4194304 bytes for up to 5 seconds
or 0 objects

Total writes made:      394
Write size:             4194304
Bandwidth (MB/sec):     299.968

Stddev Bandwidth:       127.334
Max bandwidth (MB/sec): 348
Min bandwidth (MB/sec): 0
Average Latency:        0.212524
Stddev Latency:         0.13317
Max latency:            0.828946
Min latency:            0.07073

Does that look acceptable? How much more can I expect to achieve by
fine-tunning and perhaps using a more efficient setup?

I'll assume 3x replication for these tests. In reasonable conditions you should be able to get about ~70MB/s raw per standard 7200rpm spinning disk with Ceph using filestore with XFS. For 65 OSDs, let's say about 4.5GB/s. Divide that by 3 for replication and you get 1.5GB/s. Now add the journal double write penalty and you are down to about 750MB/s. So I'd say your aggregate throughput here is lower than what you might ideally see.

The first step would probably be to increase the concurrency and see how much that helps.


I do understand the bandwidth above is a product of running 16
concurrent writes, and rather small object sizes (4MB). Bandwidth lowers
significantly with 64MB and 1 thread:

# rados -p images bench 5 write -b 67108864 -t 1
  Maintaining 1 concurrent writes of 67108864 bytes for up to 5 seconds
or 0 objects

Total writes made:      7
Write size:             67108864
Bandwidth (MB/sec):     71.520

Stddev Bandwidth:       24.1897
Max bandwidth (MB/sec): 64
Min bandwidth (MB/sec): 0
Average Latency:        0.894792
Stddev Latency:         0.0547502
Max latency:            0.99311
Min latency:            0.832765

Is such a drop expected?

Yep! Concurrency is really important for distributed systems and Ceph is no exception. If you only keep 1 write in flight, you can't really expect better than the performance of a single OSD. Ceph writes a fully copy of the data to the journal before sending a write acknowledgment to the client. In fact every replica write also has to be fully written to the journal on the secondary OSDs as well. These writes happen in parallel, but it adds latency and you'll only be as fast overall as the slowest of all of these journal writes. In your case, you also have to contend with the filesystem writes contending with the journal writes down since the journals are co-located.

In this case you probably are only getting 71MB/s because the test is so short. In practice with co-located journals I'd expect for a longer running test you'd actually get less than this in practice.


Now, what I'm really concerned is about upload times. Uploading a
randomly-generated 1GB file takes a bit too long:

# time rados -p images put random_1GB /tmp/random_1GB

real0m35.328s
user0m0.560s
sys0m3.665s

Is this normal? If so, if I setup this cluster as a backend for Glance,
does that mean uploading a 1GB image will require 35 seconds (plus
whatever time Glance requires to do its own thing)?

And here's where you are getting less. I'd hope for a little faster than 29MB/s, but given how your cluster is setup 30-40MB/s is probably about right. If you need this use-case to be faster, you have a couple of options.

1) Wait for bluestore to become production ready. This is the new OSD backend that specifically avoids full-data journal writes for large sequential write IO. Expect per-osd speed to be around 1.5-2X as fast in this case for spinning disk only clusters.

2) Move the journals off the disks. A common way to do this is to buy a couple of very fast, high write endurance NVMe or SSDs. Some of the newer NVMe drives are fast enough to support journals for 15-20 spinning disks each. Just make sure they have enough write endurance to meet your needs. Assuming no other bottlenecks, this is usually close to a 2X large write IO performance improvement.

3) If that's not good enough, you might consider buying a small set of SSDs/NVMes for a dedicated SSD pool for specific cases like this. Even in this setup, you'll likely see higher performance with more concurrency. Here's an example I just ran on a 4 node cluster using a single fast NVMe drive per node:

rados -p cbt-librbdfio bench 30 write -t 1
Bandwidth (MB/sec):     180.205

rados -p cbt-librbdfio bench 30 write -t 16
Bandwidth (MB/sec):     1197.11



Thanks,

Sergio




_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to