Re: [ceph-users] Why performance of benchmarks with small blocks is extremely small?

Irek Fasikhov Wed, 01 Oct 2014 05:42:08 -0700

Timur, read this thread:
https://www.mail-archive.com/[email protected]/msg12486.html
Тимур, прочитай эту ветку.



2014-10-01 16:24 GMT+04:00 Andrei Mikhailovsky <[email protected]>:

> Timur,
>
> As far as I know, the latest master has a number of improvements for ssd
> disks. If you check the mailing list discussion from a couple of weeks
> back, you can see that the latest stable firefly is not that well optimised
> for ssd drives and IO is limited. However changes are being made to address
> that.
>
> I am well surprised that you can get 10K IOps as in my tests I was not
> getting over 3K IOPs on the ssd disks which are capable of doing 90K IOps.
>
> P.S. does anyone know if the ssd optimisation code will be added to the
> next maintenance release of firefy?
>
> Andrei
> ------------------------------
>
> *From: *"Timur Nurlygayanov" <[email protected]>
> *To: *"Christian Balzer" <[email protected]>
> *Cc: *[email protected]
> *Sent: *Wednesday, 1 October, 2014 1:11:25 PM
> *Subject: *Re: [ceph-users] Why performance of benchmarks with small
> blocks is extremely small?
>
>
> Hello Christian,
>
> Thank you for your detailed answer!
>
> I have other pre-production environment with 4 Ceph servers, 4 SSD disks
> per Ceph server (each Ceph OSD node on the separate SSD disk)
> Probably I should move journals to other disks or it is not required in my
> case?
>
> [root@ceph-node ~]# mount | grep ceph
> /dev/sdb4 on /var/lib/ceph/osd/ceph-0 type xfs
> (rw,noexec,nodev,noatime,nodiratime,inode64,logbsize=256k,delaylog,user_xattr,data=writeback)
> /dev/sde4 on /var/lib/ceph/osd/ceph-5 type xfs
> (rw,noexec,nodev,noatime,nodiratime,inode64,logbsize=256k,delaylog,user_xattr,data=writeback)
> /dev/sdd4 on /var/lib/ceph/osd/ceph-2 type xfs
> (rw,noexec,nodev,noatime,nodiratime,inode64,logbsize=256k,delaylog,user_xattr,data=writeback)
> /dev/sdc4 on /var/lib/ceph/osd/ceph-1 type xfs
> (rw,noexec,nodev,noatime,nodiratime,inode64,logbsize=256k,delaylog,user_xattr,data=writeback)
>
> [root@ceph-node ~]# find /var/lib/ceph/osd/ | grep journal
> /var/lib/ceph/osd/ceph-0/journal
> /var/lib/ceph/osd/ceph-5/journal
> /var/lib/ceph/osd/ceph-1/journal
> /var/lib/ceph/osd/ceph-2/journal
>
> My SSD disks have ~ 40k IOPS per disk, but on the VM I can see only ~ 10k
> - 14k IOPS for disks operations.
> To check this I execute the following command on VM with root partition
> mounted on disk in Ceph storage:
>
> root@test-io:/home/ubuntu# rm -rf /tmp/test && spew -d --write -r -b 4096
> 10M /tmp/test
> WTR:    56506.22 KiB/s   Transfer time: 00:00:00    IOPS:    14126.55
>
> Is it expected result or I can improve the performance and get at least
> 30k-40k IOPS on the VM disks? (I have 2x 10Gb/s networks interfaces in LACP
> bonding for storage network, looks like network can't be the bottleneck).
>
> Thank you!
>
>
> On Wed, Oct 1, 2014 at 6:50 AM, Christian Balzer <[email protected]> wrote:
>
>>
>> Hello,
>>
>> [reduced to ceph-users]
>>
>> On Sat, 27 Sep 2014 19:17:22 +0400 Timur Nurlygayanov wrote:
>>
>> > Hello all,
>> >
>> > I installed OpenStack with Glance + Ceph OSD with replication factor 2
>> > and now I can see the write operations are extremly slow.
>> > For example, I can see only 0.04 MB/s write speed when I run rados bench
>> > with 512b blocks:
>> >
>> > rados bench -p test 60 write --no-cleanup -t 1 -b 512
>> >
>> There are 2 things wrong with that this test:
>>
>> 1. You're using rados bench, when in fact you should be testing from
>> within VMs. For starters a VM could make use of the rbd cache you enabled,
>> rados bench won't.
>>
>> 2. Given the parameters of this test you're testing network latency more
>> than anything else. If you monitor the Ceph nodes (atop is a good tool for
>> that), you will probably see that neither CPU nor disks resources are
>> being exhausted. With a single thread rados puts that tiny block of 512
>> bytes on the wire, the primary OSD for the PG has to write this to the
>> journal (on your slow, non-SSD disks) and send it to the secondary OSD,
>> which has to ACK the write to its journal back to the primary one, which
>> in turn then ACKs it to the client (rados bench) and then rados bench can
>> send the next packet.
>> You get the drift.
>>
>> Using your parameters I can get 0.17MB/s on a pre-production cluster
>> that uses 4xQDR Infiniband (IPoIB) connections, on my shitty test cluster
>> with 1GB/s links I get similar results to you, unsurprisingly.
>>
>> Ceph excels only with lots of parallelism, so an individual thread might
>> be slow (and in your case HAS to be slow, which has nothing to do with
>> Ceph per se) but many parallel ones will utilize the resources available.
>>
>> Having data blocks that are adequately sized (4MB, the default rados size)
>> will help for bandwidth and the rbd cache inside a properly configured VM
>> should make that happen.
>>
>> Of course in most real life scenarios you will run out of IOPS long before
>> you run out of bandwidth.
>>
>>
>> >  Maintaining 1 concurrent writes of 512 bytes for up to 60 seconds or 0
>> > objects
>> >  Object prefix: benchmark_data_node-17.domain.tld_15862
>> >    sec Cur ops   started  finished    avg MB/s     cur MB/s       last
>> > lat          avg lat
>> >      0       0         0         0              0
>> > 0                   -                   0
>> >      1       1        83        82            0.0400341   0.0400391
>> > 0.008465       0.0120985
>> >      2       1       169       168          0.0410111    0.0419922
>> > 0.080433       0.0118995
>> >      3       1       240       239          0.0388959    0.034668
>> > 0.008052       0.0125385
>> >      4       1       356       355          0.0433309   0.0566406
>> > 0.00837         0.0112662
>> >      5       1       472       471          0.0459919   0.0566406
>> > 0.008343       0.0106034
>> >      6       1       550       549          0.0446735   0.0380859
>> > 0.036639       0.0108791
>> >      7       1       581       580          0.0404538   0.0151367
>> > 0.008614       0.0120654
>> >
>> >
>> > My test environment configuration:
>> > Hardware servers with 1Gb network interfaces, 64Gb RAM and 16 CPU cores
>> > per node, HDDs WDC WD5003ABYX-01WERA0.
>> For anything production, consider faster network connections and SSD
>> journals.
>>
>> > OpenStack with 1 controller, 1 compute and 2 ceph nodes (ceph on
>> separate
>> > nodes).
>> > CentOS 6.5, kernel 2.6.32-431.el6.x86_64.
>> >
>> You will probably want a 3.14 or 3.16 kernel for various reasons.
>>
>> Regards,
>>
>> Christian
>>
>> > I tested several config options for optimizations, like in
>> > /etc/ceph/ceph.conf:
>> >
>> > [default]
>> > ...
>> > osd_pool_default_pg_num = 1024
>> > osd_pool_default_pgp_num = 1024
>> > osd_pool_default_flag_hashpspool = true
>> > ...
>> > [osd]
>> > osd recovery max active = 1
>> > osd max backfills = 1
>> > filestore max sync interval = 30
>> > filestore min sync interval = 29
>> > filestore flusher = false
>> > filestore queue max ops = 10000
>> > filestore op threads = 16
>> > osd op threads = 16
>> > ...
>> > [client]
>> > rbd_cache = true
>> > rbd_cache_writethrough_until_flush = true
>> >
>> > and in /etc/cinder/cinder.conf:
>> >
>> > [DEFAULT]
>> > volume_tmp_dir=/tmp
>> >
>> > but in the result performance was increased only on ~30 % and it not
>> > looks like huge success.
>> >
>> > Non-default mount options and TCP optimization increase the speed in
>> > about 1%:
>> >
>> > [root@node-17 ~]# mount | grep ceph
>> > /dev/sda4 on /var/lib/ceph/osd/ceph-0 type xfs
>> > (rw,noexec,nodev,noatime,nodiratime,user_xattr,data=writeback,barrier=0)
>> >
>> > [root@node-17 ~]# cat /etc/sysctl.conf
>> > net.core.rmem_max = 16777216
>> > net.core.wmem_max = 16777216
>> > net.ipv4.tcp_rmem = 4096 87380 16777216
>> > net.ipv4.tcp_wmem = 4096 65536 16777216
>> > net.ipv4.tcp_window_scaling = 1
>> > net.ipv4.tcp_timestamps = 1
>> > net.ipv4.tcp_sack = 1
>> >
>> >
>> > Do we have other ways to significantly improve CEPH storage performance?
>> > Any feedback and comments are welcome!
>> >
>> > Thank you!
>> >
>> >
>>
>>
>> --
>> Christian Balzer        Network/Systems Engineer
>> [email protected]           Global OnLine Japan/Fusion Communications
>> http://www.gol.com/
>>
>
>
>
> --
>
> Timur,
> QA Engineer
> OpenStack Projects
> Mirantis Inc
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Why performance of benchmarks with small blocks is extremely small?

Reply via email to