measuring resource load as outlined earlier will show if the drives are
performing well or not. Also how many osds do you have ?
On 2017-10-18 19:26, Russell Glaue wrote:
> The SSD drives are Crucial M500
> A Ceph user did some benchmarks and found it had good performance
> https://forum.proxmox.com/threads/ceph-bad-performance-in-qemu-guests.21551/
> [1]
>
> However, a user comment from 3 years ago on the blog post you linked to says
> to avoid the Crucial M500
>
> Yet, this performance posting tells that the Crucial M500 is good.
> https://inside.servers.com/ssd-performance-2017-c4307a92dea [2]
>
> On Wed, Oct 18, 2017 at 11:53 AM, Maged Mokhtar <mmokh...@petasan.org> wrote:
>
> Check out the following link: some SSDs perform bad in Ceph due to sync
> writes to journal
>
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
> [3]
>
> Anther thing that can help is to re-run the rados 32 threads as stress and
> view resource usage using atop (or collectl/sar) to check for %busy cpu and
> %busy disks to give you an idea of what is holding down your cluster..for
> example: if cpu/disk % are all low then check your network/switches. If disk
> %busy is high (90%) for all disks then your disks are the bottleneck: which
> either means you have SSDs that are not suitable for Ceph or you have too few
> disks (which i doubt is the case). If only 1 disk %busy is high, there may be
> something wrong with this disk should be removed.
>
> Maged
>
> On 2017-10-18 18:13, Russell Glaue wrote:
>
> In my previous post, in one of my points I was wondering if the request size
> would increase if I enabled jumbo packets. currently it is disabled.
>
> @jdillama: The qemu settings for both these two guest machines, with RAID/LVM
> and Ceph/rbd images, are the same. I am not thinking that changing the qemu
> settings of "min_io_size=<limited to 16bits>,opt_io_size=<RBD image object
> size>" will directly address the issue.
> @mmokhtar: Ok. So you suggest the request size is the result of the problem
> and not the cause of the problem. meaning I should go after a different
> issue.
>
> I have been trying to get write speeds up to what people on this mail list
> are discussing.
> It seems that for our configuration, as it matches others, we should be
> getting about 70MB/s write speed.
> But we are not getting that.
> Single writes to disk are lucky to get 5MB/s to 6MB/s, but are typically
> 1MB/s to 2MB/s.
> Monitoring the entire Ceph cluster (using http://cephdash.crapworks.de/), I
> have seen very rare momentary spikes up to 30MB/s.
>
> My storage network is connected via a 10Gb switch
> I have 4 storage servers with a LSI Logic MegaRAID SAS 2208 controller
> Each storage server has 9 1TB SSD drives, each drive as 1 osd (no RAID)
> Each drive is one LVM group, with two volumes - one volume for the osd, one
> volume for the journal
>
> Each osd is formatted with xfs
> The crush map is simple: default->rack->[host[1..4]->osd] with an evenly
> distributed weight
> The redundancy is triple replication
>
> While I have read comments that having the osd and journal on the same disk
> decreases write speed, I have also read that once past 8 OSDs per node this
> is the recommended configuration, however this is also the reason why SSD
> drives are used exclusively for OSDs in the storage nodes.
> None-the-less, I was still expecting write speeds to be above 30MB/s, not
> below 6MB/s.
> Even at 12x slower than the RAID, using my previously posted iostat data set,
> I should be seeing write speeds that average 10MB/s, not 2MB/s.
>
> In regards to the rados benchmark tests you asked me to run, here is the
> output:
>
> [centos7]# rados bench -p scbench -b 4096 30 write -t 1
> Maintaining 1 concurrent writes of 4096 bytes to objects of size 4096 for up
> to 30 seconds or 0 objects
> Object prefix: benchmark_data_hamms.sys.cu [4].cait.org_85049
> sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
> 0 0 0 0 0 0 - 0
> 1 1 201 200 0.78356 0.78125 0.00522307 0.00496574
> 2 1 469 468 0.915303 1.04688 0.00437497 0.00426141
> 3 1 741 740 0.964371 1.0625 0.00512853 0.0040434
> 4 1 888 887 0.866739 0.574219 0.00307699 0.00450177
> 5 1 1147 1146 0.895725 1.01172 0.00376454 0.0043559
> 6 1 1325 1324 0.862293 0.695312 0.00459443 0.004525
> 7 1 1494 1493 0.83339 0.660156 0.00461002 0.00458452
> 8 1 1736 1735 0.847369 0.945312 0.00253971 0.00460458
> 9 1 1998 1997 0.866922 1.02344 0.00236573 0.00450172
> 10 1 2260 2259 0.882563 1.02344 0.00262179 0.00442152
> 11 1 2526 2525 0.896775 1.03906 0.00336914 0.00435092
> 12 1 2760 2759 0.898203 0.914062 0.00351827 0.00434491
> 13 1 3016 3015 0.906025 1 0.00335703 0.00430691
> 14 1 3257 3256 0.908545 0.941406 0.00332344 0.00429495
> 15 1 3490 3489 0.908644 0.910156 0.00318815 0.00426387
> 16 1 3728 3727 0.909952 0.929688 0.0032881 0.00428895
> 17 1 3986 3985 0.915703 1.00781 0.00274809 0.0042614
> 18 1 4250 4249 0.922116 1.03125 0.00287411 0.00423214
> 19 1 4505 4504 0.926003 0.996094 0.00375435 0.00421442
> 2017-10-18 10:56:31.267173 min lat: 0.00181259 max lat: 0.270553 avg lat:
> 0.00420118
> sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
> 20 1 4757 4756 0.928915 0.984375 0.00463972 0.00420118
> 21 1 5009 5008 0.93155 0.984375 0.00360065 0.00418937
> 22 1 5235 5234 0.929329 0.882812 0.00626214 0.004199
> 23 1 5500 5499 0.933925 1.03516 0.00466584 0.00417836
> 24 1 5708 5707 0.928861 0.8125 0.00285727 0.00420146
> 25 0 5964 5964 0.931858 1.00391 0.00417383 0.0041881
> 26 1 6216 6215 0.933722 0.980469 0.0041009 0.00417915
> 27 1 6481 6480 0.937474 1.03516 0.00307484 0.00416118
> 28 1 6745 6744 0.940819 1.03125 0.00266329 0.00414777
> 29 1 7003 7002 0.943124 1.00781 0.00305905 0.00413758
> 30 1 7271 7270 0.946578 1.04688 0.00391017 0.00412238
> Total time run: 30.006060
> Total writes made: 7272
> Write size: 4096
> Object size: 4096
> Bandwidth (MB/sec): 0.946684
> Stddev Bandwidth: 0.123762
> Max bandwidth (MB/sec): 1.0625
> Min bandwidth (MB/sec): 0.574219
> Average IOPS: 242
> Stddev IOPS: 31
> Max IOPS: 272
> Min IOPS: 147
> Average Latency(s): 0.00412247
> Stddev Latency(s): 0.00648437
> Max latency(s): 0.270553
> Min latency(s): 0.00175318
> Cleaning up (deleting benchmark objects)
> Clean up completed and total clean up time :29.069423
>
> [centos7]# rados bench -p scbench -b 4096 30 write -t 32
> Maintaining 32 concurrent writes of 4096 bytes to objects of size 4096 for up
> to 30 seconds or 0 objects
> Object prefix: benchmark_data_hamms.sys.cu [4].cait.org_86076
> sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
> 0 0 0 0 0 0 - 0
> 1 32 3013 2981 11.6438 11.6445 0.00247906 0.00572026
> 2 32 5349 5317 10.3834 9.125 0.00246662 0.00932016
> 3 32 5707 5675 7.3883 1.39844 0.00389774 0.0156726
> 4 32 5895 5863 5.72481 0.734375 1.13137 0.0167946
> 5 32 6869 6837 5.34068 3.80469 0.0027652 0.0226577
> 6 32 8901 8869 5.77306 7.9375 0.0053211 0.0216259
> 7 32 10800 10768 6.00785 7.41797 0.00358187 0.0207418
> 8 32 11825 11793 5.75728 4.00391 0.00217575 0.0215494
> 9 32 12941 12909 5.6019 4.35938 0.00278512 0.0220567
> 10 32 13317 13285 5.18849 1.46875 0.0034973 0.0240665
> 11 32 16189 16157 5.73653 11.2188 0.00255841 0.0212708
> 12 32 16749 16717 5.44077 2.1875 0.00330334 0.0215915
> 13 32 16756 16724 5.02436 0.0273438 0.00338994 0.021849
> 14 32 17908 17876 4.98686 4.5 0.00402598 0.0244568
> 15 32 17936 17904 4.66171 0.109375 0.00375799 0.0245545
> 16 32 18279 18247 4.45409 1.33984 0.00483873 0.0267929
> 17 32 18372 18340 4.21346 0.363281 0.00505187 0.0275887
> 18 32 19403 19371 4.20309 4.02734 0.00545154 0.029348
> 19 31 19845 19814 4.07295 1.73047 0.00254726 0.0306775
> 2017-10-18 10:57:58.160536 min lat: 0.0015005 max lat: 2.27707 avg lat:
> 0.0307559
> sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
> 20 31 20401 20370 3.97788 2.17188 0.00307238 0.0307559
> 21 32 21338 21306 3.96254 3.65625 0.00464563 0.0312288
> 22 32 23057 23025 4.0876 6.71484 0.00296295 0.0299267
> 23 32 23057 23025 3.90988 0 - 0.0299267
> 24 32 23803 23771 3.86837 1.45703 0.00301471 0.0312804
> 25 32 24112 24080 3.76191 1.20703 0.00191063 0.0331462
> 26 31 25303 25272 3.79629 4.65625 0.00794399 0.0329129
> 27 32 28803 28771 4.16183 13.668 0.0109817 0.0297469
> 28 32 29592 29560 4.12325 3.08203 0.00188185 0.0301911
> 29 32 30595 30563 4.11616 3.91797 0.00379099 0.0296794
> 30 32 31031 30999 4.03572 1.70312 0.00283347 0.0302411
> Total time run: 30.822350
> Total writes made: 31032
> Write size: 4096
> Object size: 4096
> Bandwidth (MB/sec): 3.93282
> Stddev Bandwidth: 3.66265
> Max bandwidth (MB/sec): 13.668
> Min bandwidth (MB/sec): 0
> Average IOPS: 1006
> Stddev IOPS: 937
> Max IOPS: 3499
> Min IOPS: 0
> Average Latency(s): 0.0317779
> Stddev Latency(s): 0.164076
> Max latency(s): 2.27707
> Min latency(s): 0.0013848
> Cleaning up (deleting benchmark objects)
> Clean up completed and total clean up time :20.166559
>
> On Wed, Oct 18, 2017 at 8:51 AM, Maged Mokhtar <mmokh...@petasan.org> wrote:
>
> First a general comment: local RAID will be faster than Ceph for a single
> threaded (queue depth=1) io operation test. A single thread Ceph client will
> see at best same disk speed for reads and for writes 4-6 times slower than
> single disk. Not to mention the latency of local disks will much better.
> Where Ceph shines is when you have many concurrent ios, it scales whereas
> RAID will decrease speed per client as you add more.
>
> Having said that, i would recommend running rados/rbd bench-write and measure
> 4k iops at 1 and 32 threads to get a better idea of how your cluster
> performs:
>
> ceph osd pool create testpool 256 256
> rados bench -p testpool -b 4096 30 write -t 1
> rados bench -p testpool -b 4096 30 write -t 32
> ceph osd pool delete testpool testpool --yes-i-really-really-mean-it
>
> rbd bench-write test-image --io-threads=1 --io-size 4096 --io-pattern rand
> --rbd_cache=false
> rbd bench-write test-image --io-threads=32 --io-size 4096 --io-pattern rand
> --rbd_cache=false
>
> I think the request size difference you see is due to the io scheduler in the
> case of local disks having more ios to re-group so has a better chance in
> generating larger requests. Depending on your kernel, the io scheduler may be
> different for rbd (blq-mq) vs sdx (cfq) but again i would think the request
> size is a result not a cause.
>
> Maged
>
> On 2017-10-17 23:12, Russell Glaue wrote:
>
> I am running ceph jewel on 5 nodes with SSD OSDs.
> I have an LVM image on a local RAID of spinning disks.
> I have an RBD image on in a pool of SSD disks.
> Both disks are used to run an almost identical CentOS 7 system.
> Both systems were installed with the same kickstart, though the disk
> partitioning is different.
>
> I want to make writes on the the ceph image faster. For example, lots of
> writes to MySQL (via MySQL replication) on a ceph SSD image are about 10x
> slower than on a spindle RAID disk image. The MySQL server on ceph rbd image
> has a hard time keeping up in replication.
>
> So I wanted to test writes on these two systems
> I have a 10GB compressed (gzip) file on both servers.
> I simply gunzip the file on both systems, while running iostat.
>
> The primary difference I see in the results is the average size of the
> request to the disk.
> CentOS7-lvm-raid-sata writes a lot faster to disk, and the size of the
> request is about 40x, but the number of writes per second is about the same
> This makes me want to conclude that the smaller size of the request for
> CentOS7-ceph-rbd-ssd system is the cause of it being slow.
>
> How can I make the size of the request larger for ceph rbd images, so I can
> increase the write throughput?
> Would this be related to having jumbo packets enabled in my ceph storage
> network?
>
> Here is a sample of the results:
>
> [CentOS7-lvm-raid-sata]
> $ gunzip large10gFile.gz &
> $ iostat -x vg_root-lv_var -d 5 -m -N
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
> avgqu-sz await r_await w_await svctm %util
> ...
> vg_root-lv_var 0.00 0.00 30.60 452.20 13.60 222.15 1000.04
> 8.69 14.05 0.99 14.93 2.07 100.04
> vg_root-lv_var 0.00 0.00 88.20 182.00 39.20 89.43 974.95
> 4.65 9.82 0.99 14.10 3.70 100.00
> vg_root-lv_var 0.00 0.00 75.45 278.24 33.53 136.70 985.73
> 4.36 33.26 1.34 41.91 0.59 20.84
> vg_root-lv_var 0.00 0.00 111.60 181.80 49.60 89.34 969.84
> 2.60 8.87 0.81 13.81 0.13 3.90
> vg_root-lv_var 0.00 0.00 68.40 109.60 30.40 53.63 966.87
> 1.51 8.46 0.84 13.22 0.80 14.16
> ...
>
> [CentOS7-ceph-rbd-ssd]
> $ gunzip large10gFile.gz &
> $ iostat -x vg_root-lv_data -d 5 -m -N
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
> avgqu-sz await r_await w_await svctm %util
> ...
> vg_root-lv_data 0.00 0.00 46.40 167.80 0.88 1.46 22.36
> 1.23 5.66 2.47 6.54 4.52 96.82
> vg_root-lv_data 0.00 0.00 16.60 55.20 0.36 0.14 14.44
> 0.99 13.91 9.12 15.36 13.71 98.46
> vg_root-lv_data 0.00 0.00 69.00 173.80 1.34 1.32 22.48
> 1.25 5.19 3.77 5.75 3.94 95.68
> vg_root-lv_data 0.00 0.00 74.40 293.40 1.37 1.47 15.83
> 1.22 3.31 2.06 3.63 2.54 93.26
> vg_root-lv_data 0.00 0.00 90.80 359.00 1.96 3.41 24.45
> 1.63 3.63 1.94 4.05 2.10 94.38
> ...
>
> [iostat key]
> w/s == The number (after merges) of write requests completed per second for
> the device.
> wMB/s == The number of sectors (kilobytes, megabytes) written to the device
> per second.
> avgrq-sz == The average size (in kilobytes) of the requests that were issued
> to the device.
> avgqu-sz == The average queue length of the requests that were issued to the
> device.
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [5]
Links:
------
[1]
https://forum.proxmox.com/threads/ceph-bad-performance-in-qemu-guests.21551/
[2] https://inside.servers.com/ssd-performance-2017-c4307a92dea
[3]
https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
[4] http://benchmark_data_hamms.sys.cu
[5] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com