Re: [ceph-users] Why performance of benchmarks with small blocks is extremely small?

Andrei Mikhailovsky Wed, 01 Oct 2014 05:25:11 -0700

Timur, 

As far as I know, the latest master has a number of improvements for ssd disks. 
If you check the mailing list discussion from a couple of weeks back, you can 
see that the latest stable firefly is not that well optimised for ssd drives 
and IO is limited. However changes are being made to address that.


I am well surprised that you can get 10K IOps as in my tests I was not getting 
over 3K IOPs on the ssd disks which are capable of doing 90K IOps. 

P.S. does anyone know if the ssd optimisation code will be added to the next 
maintenance release of firefy? 

Andrei 
----- Original Message -----

> From: "Timur Nurlygayanov" <tnurlygaya...@mirantis.com>
> To: "Christian Balzer" <ch...@gol.com>
> Cc: ceph-us...@ceph.com
> Sent: Wednesday, 1 October, 2014 1:11:25 PM
> Subject: Re: [ceph-users] Why performance of benchmarks with small
> blocks is extremely small?

> Hello Christian,

> Thank you for your detailed answer!

> I have other pre-production environment with 4 Ceph servers, 4 SSD
> disks per Ceph server (each Ceph OSD node on the separate SSD disk)
> Probably I should move journals to other disks or it is not required
> in my case?

> [root@ceph-node ~]# mount | grep ceph
> /dev/sdb4 on /var/lib/ceph/osd/ceph-0 type xfs
> (rw,noexec,nodev,noatime,nodiratime,inode64,logbsize=256k,delaylog,user_xattr,data=writeback)
> /dev/sde4 on /var/lib/ceph/osd/ceph-5 type xfs
> (rw,noexec,nodev,noatime,nodiratime,inode64,logbsize=256k,delaylog,user_xattr,data=writeback)
> /dev/sdd4 on /var/lib/ceph/osd/ceph-2 type xfs
> (rw,noexec,nodev,noatime,nodiratime,inode64,logbsize=256k,delaylog,user_xattr,data=writeback)
> /dev/sdc4 on /var/lib/ceph/osd/ceph-1 type xfs
> (rw,noexec,nodev,noatime,nodiratime,inode64,logbsize=256k,delaylog,user_xattr,data=writeback)

> [root@ceph-node ~]# find /var/lib/ceph/osd/ | grep journal
> /var/lib/ceph/osd/ceph-0/journal
> /var/lib/ceph/osd/ceph-5/journal
> /var/lib/ceph/osd/ceph-1/journal
> /var/lib/ceph/osd/ceph-2/journal

> My SSD disks have ~ 40k IOPS per disk, but on the VM I can see only ~
> 10k - 14k IOPS for disks operations.
> To check this I execute the following command on VM with root
> partition mounted on disk in Ceph storage:

> root@test-io:/home/ubuntu# rm -rf /tmp/test && spew -d --write -r -b
> 4096 10M /tmp/test
> WTR: 56506.22 KiB/s Transfer time: 00:00:00 IOPS: 14126.55

> Is it expected result or I can improve the performance and get at
> least 30k-40k IOPS on the VM disks? (I have 2x 10Gb/s networks
> interfaces in LACP bonding for storage network, looks like network
> can't be the bottleneck).

> Thank you!

> On Wed, Oct 1, 2014 at 6:50 AM, Christian Balzer < ch...@gol.com >
> wrote:

> > Hello,
> 

> > [reduced to ceph-users]
> 

> > On Sat, 27 Sep 2014 19:17:22 +0400 Timur Nurlygayanov wrote:
> 

> > > Hello all,
> 
> > >
> 
> > > I installed OpenStack with Glance + Ceph OSD with replication
> > > factor 2
> 
> > > and now I can see the write operations are extremly slow.
> 
> > > For example, I can see only 0.04 MB/s write speed when I run
> > > rados
> > > bench
> 
> > > with 512b blocks:
> 
> > >
> 
> > > rados bench -p test 60 write --no-cleanup -t 1 -b 512
> 
> > >
> 
> > There are 2 things wrong with that this test:
> 

> > 1. You're using rados bench, when in fact you should be testing
> > from
> 
> > within VMs. For starters a VM could make use of the rbd cache you
> > enabled,
> 
> > rados bench won't.
> 

> > 2. Given the parameters of this test you're testing network latency
> > more
> 
> > than anything else. If you monitor the Ceph nodes (atop is a good
> > tool for
> 
> > that), you will probably see that neither CPU nor disks resources
> > are
> 
> > being exhausted. With a single thread rados puts that tiny block of
> > 512
> 
> > bytes on the wire, the primary OSD for the PG has to write this to
> > the
> 
> > journal (on your slow, non-SSD disks) and send it to the secondary
> > OSD,
> 
> > which has to ACK the write to its journal back to the primary one,
> > which
> 
> > in turn then ACKs it to the client (rados bench) and then rados
> > bench
> > can
> 
> > send the next packet.
> 
> > You get the drift.
> 

> > Using your parameters I can get 0.17MB/s on a pre-production
> > cluster
> 
> > that uses 4xQDR Infiniband (IPoIB) connections, on my shitty test
> > cluster
> 
> > with 1GB/s links I get similar results to you, unsurprisingly.
> 

> > Ceph excels only with lots of parallelism, so an individual thread
> > might
> 
> > be slow (and in your case HAS to be slow, which has nothing to do
> > with
> 
> > Ceph per se) but many parallel ones will utilize the resources
> > available.
> 

> > Having data blocks that are adequately sized (4MB, the default
> > rados
> > size)
> 
> > will help for bandwidth and the rbd cache inside a properly
> > configured VM
> 
> > should make that happen.
> 

> > Of course in most real life scenarios you will run out of IOPS long
> > before
> 
> > you run out of bandwidth.
> 

> > > Maintaining 1 concurrent writes of 512 bytes for up to 60 seconds
> > > or 0
> 
> > > objects
> 
> > > Object prefix: benchmark_data_node-17.domain.tld_15862
> 
> > > sec Cur ops started finished avg MB/s cur MB/s last
> 
> > > lat avg lat
> 
> > > 0 0 0 0 0
> 
> > > 0 - 0
> 
> > > 1 1 83 82 0.0400341 0.0400391
> 
> > > 0.008465 0.0120985
> 
> > > 2 1 169 168 0.0410111 0.0419922
> 
> > > 0.080433 0.0118995
> 
> > > 3 1 240 239 0.0388959 0.034668
> 
> > > 0.008052 0.0125385
> 
> > > 4 1 356 355 0.0433309 0.0566406
> 
> > > 0.00837 0.0112662
> 
> > > 5 1 472 471 0.0459919 0.0566406
> 
> > > 0.008343 0.0106034
> 
> > > 6 1 550 549 0.0446735 0.0380859
> 
> > > 0.036639 0.0108791
> 
> > > 7 1 581 580 0.0404538 0.0151367
> 
> > > 0.008614 0.0120654
> 
> > >
> 
> > >
> 
> > > My test environment configuration:
> 
> > > Hardware servers with 1Gb network interfaces, 64Gb RAM and 16 CPU
> > > cores
> 
> > > per node, HDDs WDC WD5003ABYX-01WERA0.
> 
> > For anything production, consider faster network connections and
> > SSD
> 
> > journals.
> 

> > > OpenStack with 1 controller, 1 compute and 2 ceph nodes (ceph on
> > > separate
> 
> > > nodes).
> 
> > > CentOS 6.5, kernel 2.6.32-431.el6.x86_64.
> 
> > >
> 
> > You will probably want a 3.14 or 3.16 kernel for various reasons.
> 

> > Regards,
> 

> > Christian
> 

> > > I tested several config options for optimizations, like in
> 
> > > /etc/ceph/ceph.conf:
> 
> > >
> 
> > > [default]
> 
> > > ...
> 
> > > osd_pool_default_pg_num = 1024
> 
> > > osd_pool_default_pgp_num = 1024
> 
> > > osd_pool_default_flag_hashpspool = true
> 
> > > ...
> 
> > > [osd]
> 
> > > osd recovery max active = 1
> 
> > > osd max backfills = 1
> 
> > > filestore max sync interval = 30
> 
> > > filestore min sync interval = 29
> 
> > > filestore flusher = false
> 
> > > filestore queue max ops = 10000
> 
> > > filestore op threads = 16
> 
> > > osd op threads = 16
> 
> > > ...
> 
> > > [client]
> 
> > > rbd_cache = true
> 
> > > rbd_cache_writethrough_until_flush = true
> 
> > >
> 
> > > and in /etc/cinder/cinder.conf:
> 
> > >
> 
> > > [DEFAULT]
> 
> > > volume_tmp_dir=/tmp
> 
> > >
> 
> > > but in the result performance was increased only on ~30 % and it
> > > not
> 
> > > looks like huge success.
> 
> > >
> 
> > > Non-default mount options and TCP optimization increase the speed
> > > in
> 
> > > about 1%:
> 
> > >
> 
> > > [root@node-17 ~]# mount | grep ceph
> 
> > > /dev/sda4 on /var/lib/ceph/osd/ceph-0 type xfs
> 
> > > (rw,noexec,nodev,noatime,nodiratime,user_xattr,data=writeback,barrier=0)
> 
> > >
> 
> > > [root@node-17 ~]# cat /etc/sysctl.conf
> 
> > > net.core.rmem_max = 16777216
> 
> > > net.core.wmem_max = 16777216
> 
> > > net.ipv4.tcp_rmem = 4096 87380 16777216
> 
> > > net.ipv4.tcp_wmem = 4096 65536 16777216
> 
> > > net.ipv4.tcp_window_scaling = 1
> 
> > > net.ipv4.tcp_timestamps = 1
> 
> > > net.ipv4.tcp_sack = 1
> 
> > >
> 
> > >
> 
> > > Do we have other ways to significantly improve CEPH storage
> > > performance?
> 
> > > Any feedback and comments are welcome!
> 
> > >
> 
> > > Thank you!
> 
> > >
> 
> > >
> 

> > --
> 
> > Christian Balzer Network/Systems Engineer
> 
> > ch...@gol.com Global OnLine Japan/Fusion Communications
> 
> > http://www.gol.com/
> 

> --

> Timur,
> QA Engineer
> OpenStack Projects
> Mirantis Inc
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Why performance of benchmarks with small blocks is extremely small?

Reply via email to