Hi Christian, Good day to you, and thank you for your reply.
On Tue, Apr 22, 2014 at 12:53 PM, Christian Balzer <[email protected]> wrote: > On Tue, 22 Apr 2014 02:45:24 +0800 Indra Pramana wrote: > > > Hi Christian, > > > > Good day to you, and thank you for your reply. :) See my reply inline. > > > > On Mon, Apr 21, 2014 at 10:20 PM, Christian Balzer <[email protected]> > wrote: > > > > > > > > Hello, > > > > > > On Mon, 21 Apr 2014 20:47:21 +0800 Indra Pramana wrote: > > > > > > > Dear all, > > > > > > > > I have a Ceph RBD cluster with around 31 OSDs running SSD drives, > > > > and I tried to use the benchmark tools recommended by Sebastien on > > > > his blog here: > > > > > > > How many OSDs per storage node and what is in those storage nodes in > > > terms of controller, CPU, RAM? > > > > > > > Each storage node has mainly 4 OSDs, although I have one node having 6. > > Each OSD consists of 480 GB / 500 GB SSD drives (depends on the brand). > > > So I make that 7 or 8 nodes then? > Sorry, I miscalculated earlier. I have a total of 26 OSDs in 6 hosts. All hosts have 4 OSDs in general, except one host have 6 OSDs. > > Each node has mainly SATA 2.0 controllers (newer one uses SATA 3.0), > > 4-core 3.3 GHz CPU, 16 GB of RAM. > > > That sounds good enough as far as memory and CPU are concerned. > The SATA-2 speed will limit you, I have some journal SSDs hanging of > SATA-2 and they can't get over 250MB/s while they can get to 350MB/s on > SATA-3. > > > > > > > http://www.sebastien-han.fr/blog/2012/08/26/ceph-benchmarks/ > > > > > > > Sebastien has done a great job with those, however with Ceph being > > > such a fast moving target quite a bit of that information is somewhat > > > dated. > > > > > > > Our configuration: > > > > > > > > - Ceph version 0.67.7 > > > That's also a bit dated. > > > > > > > Yes, decided to stick with the latest stable version of dumpling. Do you > > think upgrading to Emperor might help to improve performance? > > > Given that older versions of Ceph tend to get little support (bug fixes > backported) and that Firefly is around the corner I would suggest moving > to Emperor to rule out any problems with Dumpling, get experience with > inevitable cluster upgrades and have a smoother path to Firefly when it > comes out. > Noted -- will consider upgrading to Emperor. > > - 31 OSDs of 500 GB SSD drives each > > > > - Journal for each OSD is configured on the same SSD drive itself > > > > - Journal size 10 GB > > > > > > > > After doing some tests recommended on the article, I find out that > > > > generally: > > > > > > > > - Local disk benchmark tests using dd is fast, around 245 MB/s since > > > > we are using SSDs. > > > > - Network benchmark tests using iperf and netcat is also fast, I can > > > > get around 9.9 Mbit/sec since we are using 10G network. > > > > > > I think you mean 9.9Gb/s there. ^o^ > > > > > > > Yes, I meant 9.9 Gbit/sec. Sorry for the typo. > > > > How many network ports per node, cluster network or not? > > > > > > > Each OSD has 2 x 10 Gbps connection to our 10 gigabit-switch, one for > > client network and another one is for replication network between OSDs. > > > All very good and by the book. > > > > > > > However: > > > > > > > > - RADOS bench test (rados bench -p my_pool 300 write) on the whole > > > > cluster is slow, averaging around 112 MB/s for write. > > > > > > That commands fires of a single thread, which is unlikely to be able to > > > saturate things. > > > > > > Try that with a "-t 32" before the time (300) and if that improves > > > things increase that value until it doesn't (probably around 128). > > > > > > > Using 32 concurrent writes, result is below. The speed really fluctuates. > > > > Total time run: 64.31704964.317049 > > Total writes made: 1095 > > Write size: 4194304 > > Bandwidth (MB/sec): 68.100 > > > > Stddev Bandwidth: 44.6773 > > Max bandwidth (MB/sec): 184 > > Min bandwidth (MB/sec): 0 > > Average Latency: 1.87761 > > Stddev Latency: 1.90906 > > Max latency: 9.99347 > > Min latency: 0.075849 > > > That is really weird, it should get faster, not slower. ^o^ > I assume you've run this a number of times? > > Also my apologies, the default is 16 threads, not 1, but that still isn't > enough to get my cluster to full speed: > --- > Bandwidth (MB/sec): 349.044 > > Stddev Bandwidth: 107.582 > Max bandwidth (MB/sec): 408 > --- > at 64 threads it will ramp up from a slow start to: > --- > Bandwidth (MB/sec): 406.967 > > Stddev Bandwidth: 114.015 > Max bandwidth (MB/sec): 452 > --- > > But what stands out is your latency. I don't have a 10GBE network to > compare, but my Infiniband based cluster (going through at least one > switch) gives me values like this: > --- > Average Latency: 0.335519 > Stddev Latency: 0.177663 > Max latency: 1.37517 > Min latency: 0.1017 > --- > > Of course that latency is not just the network. > What else can contribute to this latency? Storage node load, disk speed, anything else? > I would suggest running atop (gives you more information at one glance) or > "iostat -x 3" on all your storage nodes during these tests to identify any > node or OSD that is overloaded in some way. > Will try. > > Are you testing this from just one client? > > > > > > > Yes. One KVM hypervisor host. > > > > How is that client connected to the Ceph network? > > > > > > > It's connected through the same 10Gb network. iperf result shows no issue > > on the bandwidth between the client and the MONs/OSDs. > > > > > > > Another thing comes to mind, how many pg_num and pgp_num are in your > > > "my_pool"? > > > You could have some quite unevenly distributed data. > > > > > > > pg_num/pgp_num for the pool is currently set to 850. > > > If this isn't production yet, I would strongly suggest upping that to 2048 > for a much smoother distribution and adhering to the recommended values > for this. > That's the problem -- it's already in production. Any advice on how I can increase PGs without causing inconvenience to the users? Can I increase PGs one step at a time to prevent excessive I/O load and slow requests, e.g. increase 100 at a time? With 26 OSDs, the recommended value would be 1300 PGs, correct? 2048 will be too high? > > > > - Invididual test using "ceph tell osd.X bench" gives different > > > > results > > > > per OSD but also averaging around 110-130 MB/s only. > > > > > > > That at least is easily explained by what I'm mentioning below about > > > the remaining performance of your SSD when journal and OSD data are on > > > it at the same time. > > > > Anyone can advise what could be the reason of why our RADOS/Ceph > > > > benchmark test result is slow compared to a direct physical drive > > > > test on the OSDs directly? Anything on Ceph configuration that we > > > > need to optimise further? > > > > > > > For starters, since your journals (I frequently wonder if journals > > > ought be something that can be turned off) are on the same device as > > > the OSD data, your total throughput and IOPS of that device have now > > > been halved. > > > > > > And what replication level are you using? That again will cut into your > > > cluster wide throughput and IOPS. > > > > > > > I maintain 2 replicas on the pool. > > > > So to simplify things I will assume 8 nodes with OSDs each and all SSDs on > SATA-2, giving a raw speed of 250MB/s per SSD. > The speed per OSD will be just half that, though, since it has to share > that with the journal. > So just 500MB/s of potential speed per node or 4GB/s for the whole cluster. > > Now here is where it gets tricky. > With just one thread and one client you will write to one PG, first to > journal of the primary OSD, then that will be written to the journal of > the secondary OSD (on another node) and your transaction will be ACK'ed. > This if course doesn't take any advantage of the parallelism of Ceph and > will never get close to achieving maximum bandwidth per client. But it > also won't be impacted by on which OSDs the PGs reside, as there is no > competition from other clients/threads. > > With 16 threads (and more) the PG distribution becomes very crucial. > Ideally each thread would be writing to different primary OSDs and all the > secondary OSDs would be ones that aren't primary ones (32 assumed OSDs/2). > > But if the PGs are clumpy and for example osd.0 happens to the primary for > one PG being written to by one thread and the secondary for another > thread at the same time it bandwidth just dropped again. > Noted, thanks for this. Cheers. > > Regards, > > Christian > > > > > > > > I've read a number of times that Ceph will be in general half as fast > > > as your expected speed from the cluster hardware your deploying, but > > > that of course is something based on many factors and needs > > > verification in each specific case. > > > > > > For me, I have OSDs (11 disk RAID6 on an Areca 1882 with 1GB cache, 2 > > > OSDs each on 2 nodes total) that can handle the fio run below directly > > > on the OSD at 37k IOPS (since it fits into the cache nicely). > > > --- > > > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 > > > --rw=randwrite --name=fiojob --blocksize_range=4k-4K --iodepth=16 > > > --- > > > The Journal SSD is about same. > > > > > > However that same benchmark just delivers a mere 3100 IOPS when run > > > from a VM (userspace RBD, caching enabled but that makes no difference > > > at all) and the journal SSDs are busier (25%) than the actual OSDs > > > (5%), but still nowhere near their capacity. > > > This leads me to believe that aside from network latencies (4xQDDR > > > Infiniband here, which has less latency than 10GBE) that there is a > > > lot of space for improvement when it comes to how Ceph handles things > > > (bottlenecks in the code) and tuning in general. > > > > > > > Thanks for sharing. > > > > Any further tuning configuration which can be suggested is greatly > > appreciated. > > > > Cheers. > > > -- > Christian Balzer Network/Systems Engineer > [email protected] Global OnLine Japan/Fusion Communications > http://www.gol.com/ > Call Send SMS Add to Skype You'll need Skype CreditFree via Skype
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
