Hi Christian, Good day to you, and thank you for your reply. :) See my reply inline.
On Mon, Apr 21, 2014 at 10:20 PM, Christian Balzer <[email protected]> wrote: > > Hello, > > On Mon, 21 Apr 2014 20:47:21 +0800 Indra Pramana wrote: > > > Dear all, > > > > I have a Ceph RBD cluster with around 31 OSDs running SSD drives, and I > > tried to use the benchmark tools recommended by Sebastien on his blog > > here: > > > How many OSDs per storage node and what is in those storage nodes in terms > of controller, CPU, RAM? > Each storage node has mainly 4 OSDs, although I have one node having 6. Each OSD consists of 480 GB / 500 GB SSD drives (depends on the brand). Each node has mainly SATA 2.0 controllers (newer one uses SATA 3.0), 4-core 3.3 GHz CPU, 16 GB of RAM. > > http://www.sebastien-han.fr/blog/2012/08/26/ceph-benchmarks/ > > > Sebastien has done a great job with those, however with Ceph being such a > fast moving target quite a bit of that information is somewhat dated. > > > Our configuration: > > > > - Ceph version 0.67.7 > That's also a bit dated. > Yes, decided to stick with the latest stable version of dumpling. Do you think upgrading to Emperor might help to improve performance? > - 31 OSDs of 500 GB SSD drives each > > - Journal for each OSD is configured on the same SSD drive itself > > - Journal size 10 GB > > > > After doing some tests recommended on the article, I find out that > > generally: > > > > - Local disk benchmark tests using dd is fast, around 245 MB/s since we > > are using SSDs. > > - Network benchmark tests using iperf and netcat is also fast, I can get > > around 9.9 Mbit/sec since we are using 10G network. > > I think you mean 9.9Gb/s there. ^o^ > Yes, I meant 9.9 Gbit/sec. Sorry for the typo. How many network ports per node, cluster network or not? > Each OSD has 2 x 10 Gbps connection to our 10 gigabit-switch, one for client network and another one is for replication network between OSDs. > > However: > > > > - RADOS bench test (rados bench -p my_pool 300 write) on the whole > > cluster is slow, averaging around 112 MB/s for write. > > That commands fires of a single thread, which is unlikely to be able to > saturate things. > > Try that with a "-t 32" before the time (300) and if that improves > things increase that value until it doesn't (probably around 128). > Using 32 concurrent writes, result is below. The speed really fluctuates. Total time run: 64.317049 Total writes made: 1095 Write size: 4194304 Bandwidth (MB/sec): 68.100 Stddev Bandwidth: 44.6773 Max bandwidth (MB/sec): 184 Min bandwidth (MB/sec): 0 Average Latency: 1.87761 Stddev Latency: 1.90906 Max latency: 9.99347 Min latency: 0.075849 Are you testing this from just one client? > Yes. One KVM hypervisor host. How is that client connected to the Ceph network? > It's connected through the same 10Gb network. iperf result shows no issue on the bandwidth between the client and the MONs/OSDs. > Another thing comes to mind, how many pg_num and pgp_num are in your > "my_pool"? > You could have some quite unevenly distributed data. > pg_num/pgp_num for the pool is currently set to 850. > > - Invididual test using "ceph tell osd.X bench" gives different results > > per OSD but also averaging around 110-130 MB/s only. > > > That at least is easily explained by what I'm mentioning below about the > remaining performance of your SSD when journal and OSD data are on it at > the same time. > > Anyone can advise what could be the reason of why our RADOS/Ceph > > benchmark test result is slow compared to a direct physical drive test > > on the OSDs directly? Anything on Ceph configuration that we need to > > optimise further? > > > For starters, since your journals (I frequently wonder if journals ought > be something that can be turned off) are on the same device as the OSD > data, your total throughput and IOPS of that device have now been halved. > > And what replication level are you using? That again will cut into your > cluster wide throughput and IOPS. > I maintain 2 replicas on the pool. > > I've read a number of times that Ceph will be in general half as fast as > your expected speed from the cluster hardware your deploying, but that of > course is something based on many factors and needs verification in each > specific case. > > For me, I have OSDs (11 disk RAID6 on an Areca 1882 with 1GB cache, 2 > OSDs each on 2 nodes total) that can handle the fio run below directly on > the OSD at 37k IOPS (since it fits into the cache nicely). > --- > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 > --rw=randwrite --name=fiojob --blocksize_range=4k-4K --iodepth=16 > --- > The Journal SSD is about same. > > However that same benchmark just delivers a mere 3100 IOPS when run from a > VM (userspace RBD, caching enabled but that makes no difference at all) and > the journal SSDs are busier (25%) than the actual OSDs (5%), but still > nowhere near their capacity. > This leads me to believe that aside from network latencies (4xQDDR > Infiniband here, which has less latency than 10GBE) that there is a lot of > space for improvement when it comes to how Ceph handles things > (bottlenecks in the code) and tuning in general. > Thanks for sharing. Any further tuning configuration which can be suggested is greatly appreciated. Cheers.
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
