Re: [ceph-users] Slow RBD Benchmark Compared To Direct I/O Test

Indra Pramana Tue, 22 Apr 2014 21:40:26 -0700

Hi Christian,

Good day to you, and thank you for your reply.


On Tue, Apr 22, 2014 at 12:53 PM, Christian Balzer <[email protected]> wrote:

> On Tue, 22 Apr 2014 02:45:24 +0800 Indra Pramana wrote:
>
> > Hi Christian,
> >
> > Good day to you, and thank you for your reply. :)  See my reply inline.
> >
> > On Mon, Apr 21, 2014 at 10:20 PM, Christian Balzer <[email protected]>
> wrote:
> >
> > >
> > > Hello,
> > >
> > > On Mon, 21 Apr 2014 20:47:21 +0800 Indra Pramana wrote:
> > >
> > > > Dear all,
> > > >
> > > > I have a Ceph RBD cluster with around 31 OSDs running SSD drives,
> > > > and I tried to use the benchmark tools recommended by Sebastien on
> > > > his blog here:
> > > >
> > > How many OSDs per storage node and what is in those storage nodes in
> > > terms of controller, CPU, RAM?
> > >
> >
> > Each storage node has mainly 4 OSDs, although I have one node having 6.
> > Each OSD consists of 480 GB / 500 GB SSD drives (depends on the brand).
> >
> So I make that 7 or 8 nodes then?
>

Sorry, I miscalculated earlier. I have a total of 26 OSDs in 6 hosts. All
hosts have 4 OSDs in general, except one host have 6 OSDs.


>  > Each node has mainly SATA 2.0 controllers (newer one uses SATA 3.0),
> > 4-core 3.3 GHz CPU, 16 GB of RAM.
> >
> That sounds good enough as far as memory and CPU are concerned.
> The SATA-2 speed will limit you, I have some journal SSDs hanging of
> SATA-2 and they can't get over 250MB/s while they can get to 350MB/s on
> SATA-3.
>
> >
> > > > http://www.sebastien-han.fr/blog/2012/08/26/ceph-benchmarks/
> > > >
> > > Sebastien has done a great job with those, however with Ceph being
> > > such a fast moving target quite a bit of that information is somewhat
> > > dated.
> > >
> > > > Our configuration:
> > > >
> > > > - Ceph version 0.67.7
> > > That's also a bit dated.
> > >
> >
> > Yes, decided to stick with the latest stable version of dumpling. Do you
> > think upgrading to Emperor might help to improve performance?
> >
> Given that older versions of Ceph tend to get little support (bug fixes
> backported) and that Firefly is around the corner I would suggest moving
> to Emperor to rule out any problems with Dumpling, get experience with
> inevitable cluster upgrades and have a smoother path to Firefly when it
> comes out.
>

Noted -- will consider upgrading to Emperor.

 >  > - 31 OSDs of 500 GB SSD drives each
> > > > - Journal for each OSD is configured on the same SSD drive itself
> > > > - Journal size 10 GB
> > > >
> > > > After doing some tests recommended on the article, I find out that
> > > > generally:
> > > >
> > > > - Local disk benchmark tests using dd is fast, around 245 MB/s since
> > > > we are using SSDs.
> > > > - Network benchmark tests using iperf and netcat is also fast, I can
> > > > get around 9.9 Mbit/sec since we are using 10G network.
> > >
> > > I think you mean 9.9Gb/s there. ^o^
> > >
> >
> > Yes, I meant 9.9 Gbit/sec. Sorry for the typo.
> >
> > How many network ports per node, cluster network or not?
> > >
> >
> > Each OSD has 2 x 10 Gbps connection to our 10 gigabit-switch, one for
> > client network and another one is for replication network between OSDs.
> >
> All very good and by the book.
>
> >
> > > > However:
> > > >
> > > > - RADOS bench test (rados bench -p my_pool 300 write) on the whole
> > > > cluster is slow, averaging around 112 MB/s for write.
> > >
> > > That commands fires of a single thread, which is unlikely to be able to
> > > saturate things.
> > >
> > > Try that with a "-t 32" before the time (300) and if that improves
> > > things increase that value until it doesn't (probably around 128).
> > >
> >
> > Using 32 concurrent writes, result is below. The speed really fluctuates.
> >
> >  Total time run:         64.31704964.317049
> > Total writes made:      1095
> > Write size:             4194304
> > Bandwidth (MB/sec):     68.100
> >
> > Stddev Bandwidth:       44.6773
> > Max bandwidth (MB/sec): 184
> > Min bandwidth (MB/sec): 0
> > Average Latency:        1.87761
> > Stddev Latency:         1.90906
> > Max latency:            9.99347
> > Min latency:            0.075849
> >
> That is really weird, it should get faster, not slower. ^o^
> I assume you've run this a number of times?
>
> Also my apologies, the default is 16 threads, not 1, but that still isn't
> enough to get my cluster to full speed:
> ---
> Bandwidth (MB/sec):     349.044
>
> Stddev Bandwidth:       107.582
> Max bandwidth (MB/sec): 408
> ---
> at 64 threads it will ramp up from a slow start to:
> ---
> Bandwidth (MB/sec):     406.967
>
> Stddev Bandwidth:       114.015
> Max bandwidth (MB/sec): 452
> ---
>
> But what stands out is your latency. I don't have a 10GBE network to
> compare, but my Infiniband based cluster (going through at least one
> switch) gives me values like this:
> ---
> Average Latency:        0.335519
> Stddev Latency:         0.177663
> Max latency:            1.37517
> Min latency:            0.1017
> ---
>
> Of course that latency is not just the network.
>

What else can contribute to this latency? Storage node load, disk speed,
anything else?


> I would suggest running atop (gives you more information at one glance) or
> "iostat -x 3" on all your storage nodes during these tests to identify any
> node or OSD that is overloaded in some way.
>

Will try.


>  > Are you testing this from just one client?
> > >
> >
> > Yes. One KVM hypervisor host.
> >
> > How is that client connected to the Ceph network?
> > >
> >
> > It's connected through the same 10Gb network. iperf result shows no issue
> > on the bandwidth between the client and the MONs/OSDs.
> >
> >
> > > Another thing comes to mind, how many pg_num and pgp_num are in your
> > > "my_pool"?
> > > You could have some quite unevenly distributed data.
> > >
> >
> > pg_num/pgp_num for the pool is currently set to 850.
> >
> If this isn't production yet, I would strongly suggest upping that to 2048
> for a much smoother distribution and adhering to the recommended values
> for this.
>

That's the problem -- it's already in production. Any advice on how I can
increase PGs without causing inconvenience to the users? Can I increase PGs
one step at a time to prevent excessive I/O load and slow requests, e.g.
increase 100 at a time?

With 26 OSDs, the recommended value would be 1300 PGs, correct? 2048 will
be too high?


> > >  > - Invididual test using "ceph tell osd.X bench" gives different
> > >  > results
> > > > per OSD but also averaging around 110-130 MB/s only.
> > > >
> > > That at least is easily explained by what I'm mentioning below about
> > > the remaining performance of your SSD when journal and OSD data are on
> > > it at the same time.
> > > > Anyone can advise what could be the reason of why our RADOS/Ceph
> > > > benchmark test result is slow compared to a direct physical drive
> > > > test on the OSDs directly? Anything on Ceph configuration that we
> > > > need to optimise further?
> > > >
> > > For starters, since your journals (I frequently wonder if journals
> > > ought be something that can be turned off) are on the same device as
> > > the OSD data, your total throughput and IOPS of that device have now
> > > been halved.
> > >
> > > And what replication level are you using? That again will cut into your
> > > cluster wide throughput and IOPS.
> > >
> >
> > I maintain 2 replicas on the pool.
> >
>
> So to simplify things I will assume 8 nodes with OSDs each and all SSDs on
> SATA-2, giving a raw speed of 250MB/s per SSD.
> The speed per OSD will be just half that, though, since it has to share
> that with the journal.
> So just 500MB/s of potential speed per node or 4GB/s for the whole cluster.
>
> Now here is where it gets tricky.
> With just one thread and one client you will write to one PG, first to
> journal of the primary OSD, then that will be written to the journal of
> the secondary OSD (on another node) and your transaction will be ACK'ed.
> This if course doesn't take any advantage of the parallelism of Ceph and
> will never get close to achieving maximum bandwidth per client. But it
> also won't be impacted by on which OSDs the PGs reside, as there is no
> competition from other clients/threads.
>
> With 16 threads (and more) the PG distribution becomes very crucial.
> Ideally each thread would be writing to different primary OSDs and all the
> secondary OSDs would be ones that aren't primary ones (32 assumed OSDs/2).
>
> But if the PGs are clumpy and for example osd.0 happens to the primary for
> one PG being written to by one thread and the secondary for another
> thread at the same time it bandwidth just dropped again.
>

Noted, thanks for this.

Cheers.




>
> Regards,
>
> Christian
> >
> > >
> > > I've read a number of times that Ceph will be in general half as fast
> > > as your expected speed from the cluster hardware your deploying, but
> > > that of course is something based on many factors and needs
> > > verification in each specific case.
> > >
> > > For me, I have OSDs (11 disk RAID6 on an Areca 1882 with 1GB cache, 2
> > > OSDs each on 2 nodes total) that can handle the fio run below directly
> > > on the OSD at 37k IOPS (since it fits into the cache nicely).
> > > ---
> > > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1
> > > --rw=randwrite --name=fiojob --blocksize_range=4k-4K --iodepth=16
> > > ---
> > > The Journal SSD is about same.
> > >
> > > However that same benchmark just delivers a mere 3100 IOPS when run
> > > from a VM (userspace RBD, caching enabled but that makes no difference
> > > at all) and the journal SSDs are busier (25%) than the actual OSDs
> > > (5%), but still nowhere near their capacity.
> > > This leads me to believe that aside from network latencies (4xQDDR
> > > Infiniband here, which has less latency than 10GBE) that there is a
> > > lot of space for improvement when it comes to how Ceph handles things
> > > (bottlenecks in the code) and tuning in general.
> > >
> >
> > Thanks for sharing.
> >
> > Any further tuning configuration which can be suggested is greatly
> > appreciated.
> >
> > Cheers.
>
>
> --
> Christian Balzer        Network/Systems Engineer
> [email protected]           Global OnLine Japan/Fusion Communications
> http://www.gol.com/
>

Call
Send SMS
Add to Skype
You'll need Skype CreditFree via Skype

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow RBD Benchmark Compared To Direct I/O Test

Reply via email to