Re: [ceph-users] Why performance of benchmarks with small blocks is extremely small?

Timur Nurlygayanov Wed, 01 Oct 2014 05:12:19 -0700

Hello Christian,

Thank you for your detailed answer!


I have other pre-production environment with 4 Ceph servers, 4 SSD disks
per Ceph server (each Ceph OSD node on the separate SSD disk)
Probably I should move journals to other disks or it is not required in my
case?

[root@ceph-node ~]# mount | grep ceph
/dev/sdb4 on /var/lib/ceph/osd/ceph-0 type xfs
(rw,noexec,nodev,noatime,nodiratime,inode64,logbsize=256k,delaylog,user_xattr,data=writeback)
/dev/sde4 on /var/lib/ceph/osd/ceph-5 type xfs
(rw,noexec,nodev,noatime,nodiratime,inode64,logbsize=256k,delaylog,user_xattr,data=writeback)
/dev/sdd4 on /var/lib/ceph/osd/ceph-2 type xfs
(rw,noexec,nodev,noatime,nodiratime,inode64,logbsize=256k,delaylog,user_xattr,data=writeback)
/dev/sdc4 on /var/lib/ceph/osd/ceph-1 type xfs
(rw,noexec,nodev,noatime,nodiratime,inode64,logbsize=256k,delaylog,user_xattr,data=writeback)

[root@ceph-node ~]# find /var/lib/ceph/osd/ | grep journal
/var/lib/ceph/osd/ceph-0/journal
/var/lib/ceph/osd/ceph-5/journal
/var/lib/ceph/osd/ceph-1/journal
/var/lib/ceph/osd/ceph-2/journal

My SSD disks have ~ 40k IOPS per disk, but on the VM I can see only ~ 10k -
14k IOPS for disks operations.
To check this I execute the following command on VM with root partition
mounted on disk in Ceph storage:

root@test-io:/home/ubuntu# rm -rf /tmp/test && spew -d --write -r -b 4096
10M /tmp/test
WTR:    56506.22 KiB/s   Transfer time: 00:00:00    IOPS:    14126.55

Is it expected result or I can improve the performance and get at least
30k-40k IOPS on the VM disks? (I have 2x 10Gb/s networks interfaces in LACP
bonding for storage network, looks like network can't be the bottleneck).

Thank you!


On Wed, Oct 1, 2014 at 6:50 AM, Christian Balzer <[email protected]> wrote:

>
> Hello,
>
> [reduced to ceph-users]
>
> On Sat, 27 Sep 2014 19:17:22 +0400 Timur Nurlygayanov wrote:
>
> > Hello all,
> >
> > I installed OpenStack with Glance + Ceph OSD with replication factor 2
> > and now I can see the write operations are extremly slow.
> > For example, I can see only 0.04 MB/s write speed when I run rados bench
> > with 512b blocks:
> >
> > rados bench -p test 60 write --no-cleanup -t 1 -b 512
> >
> There are 2 things wrong with that this test:
>
> 1. You're using rados bench, when in fact you should be testing from
> within VMs. For starters a VM could make use of the rbd cache you enabled,
> rados bench won't.
>
> 2. Given the parameters of this test you're testing network latency more
> than anything else. If you monitor the Ceph nodes (atop is a good tool for
> that), you will probably see that neither CPU nor disks resources are
> being exhausted. With a single thread rados puts that tiny block of 512
> bytes on the wire, the primary OSD for the PG has to write this to the
> journal (on your slow, non-SSD disks) and send it to the secondary OSD,
> which has to ACK the write to its journal back to the primary one, which
> in turn then ACKs it to the client (rados bench) and then rados bench can
> send the next packet.
> You get the drift.
>
> Using your parameters I can get 0.17MB/s on a pre-production cluster
> that uses 4xQDR Infiniband (IPoIB) connections, on my shitty test cluster
> with 1GB/s links I get similar results to you, unsurprisingly.
>
> Ceph excels only with lots of parallelism, so an individual thread might
> be slow (and in your case HAS to be slow, which has nothing to do with
> Ceph per se) but many parallel ones will utilize the resources available.
>
> Having data blocks that are adequately sized (4MB, the default rados size)
> will help for bandwidth and the rbd cache inside a properly configured VM
> should make that happen.
>
> Of course in most real life scenarios you will run out of IOPS long before
> you run out of bandwidth.
>
>
> >  Maintaining 1 concurrent writes of 512 bytes for up to 60 seconds or 0
> > objects
> >  Object prefix: benchmark_data_node-17.domain.tld_15862
> >    sec Cur ops   started  finished    avg MB/s     cur MB/s       last
> > lat          avg lat
> >      0       0         0         0              0
> > 0                   -                   0
> >      1       1        83        82            0.0400341   0.0400391
> > 0.008465       0.0120985
> >      2       1       169       168          0.0410111    0.0419922
> > 0.080433       0.0118995
> >      3       1       240       239          0.0388959    0.034668
> > 0.008052       0.0125385
> >      4       1       356       355          0.0433309   0.0566406
> > 0.00837         0.0112662
> >      5       1       472       471          0.0459919   0.0566406
> > 0.008343       0.0106034
> >      6       1       550       549          0.0446735   0.0380859
> > 0.036639       0.0108791
> >      7       1       581       580          0.0404538   0.0151367
> > 0.008614       0.0120654
> >
> >
> > My test environment configuration:
> > Hardware servers with 1Gb network interfaces, 64Gb RAM and 16 CPU cores
> > per node, HDDs WDC WD5003ABYX-01WERA0.
> For anything production, consider faster network connections and SSD
> journals.
>
> > OpenStack with 1 controller, 1 compute and 2 ceph nodes (ceph on separate
> > nodes).
> > CentOS 6.5, kernel 2.6.32-431.el6.x86_64.
> >
> You will probably want a 3.14 or 3.16 kernel for various reasons.
>
> Regards,
>
> Christian
>
> > I tested several config options for optimizations, like in
> > /etc/ceph/ceph.conf:
> >
> > [default]
> > ...
> > osd_pool_default_pg_num = 1024
> > osd_pool_default_pgp_num = 1024
> > osd_pool_default_flag_hashpspool = true
> > ...
> > [osd]
> > osd recovery max active = 1
> > osd max backfills = 1
> > filestore max sync interval = 30
> > filestore min sync interval = 29
> > filestore flusher = false
> > filestore queue max ops = 10000
> > filestore op threads = 16
> > osd op threads = 16
> > ...
> > [client]
> > rbd_cache = true
> > rbd_cache_writethrough_until_flush = true
> >
> > and in /etc/cinder/cinder.conf:
> >
> > [DEFAULT]
> > volume_tmp_dir=/tmp
> >
> > but in the result performance was increased only on ~30 % and it not
> > looks like huge success.
> >
> > Non-default mount options and TCP optimization increase the speed in
> > about 1%:
> >
> > [root@node-17 ~]# mount | grep ceph
> > /dev/sda4 on /var/lib/ceph/osd/ceph-0 type xfs
> > (rw,noexec,nodev,noatime,nodiratime,user_xattr,data=writeback,barrier=0)
> >
> > [root@node-17 ~]# cat /etc/sysctl.conf
> > net.core.rmem_max = 16777216
> > net.core.wmem_max = 16777216
> > net.ipv4.tcp_rmem = 4096 87380 16777216
> > net.ipv4.tcp_wmem = 4096 65536 16777216
> > net.ipv4.tcp_window_scaling = 1
> > net.ipv4.tcp_timestamps = 1
> > net.ipv4.tcp_sack = 1
> >
> >
> > Do we have other ways to significantly improve CEPH storage performance?
> > Any feedback and comments are welcome!
> >
> > Thank you!
> >
> >
>
>
> --
> Christian Balzer        Network/Systems Engineer
> [email protected]           Global OnLine Japan/Fusion Communications
> http://www.gol.com/
>



-- 

Timur,
QA Engineer
OpenStack Projects
Mirantis Inc

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Why performance of benchmarks with small blocks is extremely small?

Reply via email to