Hello Christian, Thank you for your detailed answer!
I have other pre-production environment with 4 Ceph servers, 4 SSD disks per Ceph server (each Ceph OSD node on the separate SSD disk) Probably I should move journals to other disks or it is not required in my case? [root@ceph-node ~]# mount | grep ceph /dev/sdb4 on /var/lib/ceph/osd/ceph-0 type xfs (rw,noexec,nodev,noatime,nodiratime,inode64,logbsize=256k,delaylog,user_xattr,data=writeback) /dev/sde4 on /var/lib/ceph/osd/ceph-5 type xfs (rw,noexec,nodev,noatime,nodiratime,inode64,logbsize=256k,delaylog,user_xattr,data=writeback) /dev/sdd4 on /var/lib/ceph/osd/ceph-2 type xfs (rw,noexec,nodev,noatime,nodiratime,inode64,logbsize=256k,delaylog,user_xattr,data=writeback) /dev/sdc4 on /var/lib/ceph/osd/ceph-1 type xfs (rw,noexec,nodev,noatime,nodiratime,inode64,logbsize=256k,delaylog,user_xattr,data=writeback) [root@ceph-node ~]# find /var/lib/ceph/osd/ | grep journal /var/lib/ceph/osd/ceph-0/journal /var/lib/ceph/osd/ceph-5/journal /var/lib/ceph/osd/ceph-1/journal /var/lib/ceph/osd/ceph-2/journal My SSD disks have ~ 40k IOPS per disk, but on the VM I can see only ~ 10k - 14k IOPS for disks operations. To check this I execute the following command on VM with root partition mounted on disk in Ceph storage: root@test-io:/home/ubuntu# rm -rf /tmp/test && spew -d --write -r -b 4096 10M /tmp/test WTR: 56506.22 KiB/s Transfer time: 00:00:00 IOPS: 14126.55 Is it expected result or I can improve the performance and get at least 30k-40k IOPS on the VM disks? (I have 2x 10Gb/s networks interfaces in LACP bonding for storage network, looks like network can't be the bottleneck). Thank you! On Wed, Oct 1, 2014 at 6:50 AM, Christian Balzer <[email protected]> wrote: > > Hello, > > [reduced to ceph-users] > > On Sat, 27 Sep 2014 19:17:22 +0400 Timur Nurlygayanov wrote: > > > Hello all, > > > > I installed OpenStack with Glance + Ceph OSD with replication factor 2 > > and now I can see the write operations are extremly slow. > > For example, I can see only 0.04 MB/s write speed when I run rados bench > > with 512b blocks: > > > > rados bench -p test 60 write --no-cleanup -t 1 -b 512 > > > There are 2 things wrong with that this test: > > 1. You're using rados bench, when in fact you should be testing from > within VMs. For starters a VM could make use of the rbd cache you enabled, > rados bench won't. > > 2. Given the parameters of this test you're testing network latency more > than anything else. If you monitor the Ceph nodes (atop is a good tool for > that), you will probably see that neither CPU nor disks resources are > being exhausted. With a single thread rados puts that tiny block of 512 > bytes on the wire, the primary OSD for the PG has to write this to the > journal (on your slow, non-SSD disks) and send it to the secondary OSD, > which has to ACK the write to its journal back to the primary one, which > in turn then ACKs it to the client (rados bench) and then rados bench can > send the next packet. > You get the drift. > > Using your parameters I can get 0.17MB/s on a pre-production cluster > that uses 4xQDR Infiniband (IPoIB) connections, on my shitty test cluster > with 1GB/s links I get similar results to you, unsurprisingly. > > Ceph excels only with lots of parallelism, so an individual thread might > be slow (and in your case HAS to be slow, which has nothing to do with > Ceph per se) but many parallel ones will utilize the resources available. > > Having data blocks that are adequately sized (4MB, the default rados size) > will help for bandwidth and the rbd cache inside a properly configured VM > should make that happen. > > Of course in most real life scenarios you will run out of IOPS long before > you run out of bandwidth. > > > > Maintaining 1 concurrent writes of 512 bytes for up to 60 seconds or 0 > > objects > > Object prefix: benchmark_data_node-17.domain.tld_15862 > > sec Cur ops started finished avg MB/s cur MB/s last > > lat avg lat > > 0 0 0 0 0 > > 0 - 0 > > 1 1 83 82 0.0400341 0.0400391 > > 0.008465 0.0120985 > > 2 1 169 168 0.0410111 0.0419922 > > 0.080433 0.0118995 > > 3 1 240 239 0.0388959 0.034668 > > 0.008052 0.0125385 > > 4 1 356 355 0.0433309 0.0566406 > > 0.00837 0.0112662 > > 5 1 472 471 0.0459919 0.0566406 > > 0.008343 0.0106034 > > 6 1 550 549 0.0446735 0.0380859 > > 0.036639 0.0108791 > > 7 1 581 580 0.0404538 0.0151367 > > 0.008614 0.0120654 > > > > > > My test environment configuration: > > Hardware servers with 1Gb network interfaces, 64Gb RAM and 16 CPU cores > > per node, HDDs WDC WD5003ABYX-01WERA0. > For anything production, consider faster network connections and SSD > journals. > > > OpenStack with 1 controller, 1 compute and 2 ceph nodes (ceph on separate > > nodes). > > CentOS 6.5, kernel 2.6.32-431.el6.x86_64. > > > You will probably want a 3.14 or 3.16 kernel for various reasons. > > Regards, > > Christian > > > I tested several config options for optimizations, like in > > /etc/ceph/ceph.conf: > > > > [default] > > ... > > osd_pool_default_pg_num = 1024 > > osd_pool_default_pgp_num = 1024 > > osd_pool_default_flag_hashpspool = true > > ... > > [osd] > > osd recovery max active = 1 > > osd max backfills = 1 > > filestore max sync interval = 30 > > filestore min sync interval = 29 > > filestore flusher = false > > filestore queue max ops = 10000 > > filestore op threads = 16 > > osd op threads = 16 > > ... > > [client] > > rbd_cache = true > > rbd_cache_writethrough_until_flush = true > > > > and in /etc/cinder/cinder.conf: > > > > [DEFAULT] > > volume_tmp_dir=/tmp > > > > but in the result performance was increased only on ~30 % and it not > > looks like huge success. > > > > Non-default mount options and TCP optimization increase the speed in > > about 1%: > > > > [root@node-17 ~]# mount | grep ceph > > /dev/sda4 on /var/lib/ceph/osd/ceph-0 type xfs > > (rw,noexec,nodev,noatime,nodiratime,user_xattr,data=writeback,barrier=0) > > > > [root@node-17 ~]# cat /etc/sysctl.conf > > net.core.rmem_max = 16777216 > > net.core.wmem_max = 16777216 > > net.ipv4.tcp_rmem = 4096 87380 16777216 > > net.ipv4.tcp_wmem = 4096 65536 16777216 > > net.ipv4.tcp_window_scaling = 1 > > net.ipv4.tcp_timestamps = 1 > > net.ipv4.tcp_sack = 1 > > > > > > Do we have other ways to significantly improve CEPH storage performance? > > Any feedback and comments are welcome! > > > > Thank you! > > > > > > > -- > Christian Balzer Network/Systems Engineer > [email protected] Global OnLine Japan/Fusion Communications > http://www.gol.com/ > -- Timur, QA Engineer OpenStack Projects Mirantis Inc
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
