Re: [ceph-users] rbd_cache, limiting read on high iops around 40k

pushpesh sharma Thu, 11 Jun 2015 22:53:03 -0700

Hi Alexandre,

I agree with your rational, of one iothread per disk. CPU consumed in
IOwait is pretty high in each VM. But I am not finding a way to set
the same on a nova instance. I am using openstack Juno with QEMU+KVM.
As per libvirt documentation for setting iothreads, I can edit
domain.xml directly and achieve the same effect. However in as in
openstack env domain xml is created by nova with some additional
metadata, so editing the domain xml using 'virsh edit' does not seems
to work(I agree, it is not a very cloud way of doing things, but a
hack). Changes made there vanish after saving them, due to reason
libvirt validation fails on the same.


#virsh dumpxml instance-000000c5 > vm.xml
#virt-xml-validate vm.xml
Relax-NG validity error : Extra element cpu in interleave
vm.xml:1: element domain: Relax-NG validity error : Element domain
failed to validate content
vm.xml fails to validate

Second approach I took was to setting QoS in volumes types. But there
is no option to set iothreads per volume, there are parameter realted
to max_read/wrirte ops/bytes.

Thirdly, editing Nova flavor and proving extra specs like
hw:cpu_socket/thread/core, can change guest CPU topology however again
no way to set iothread. It does accept hw_disk_iothreads(no type check
in place, i believe ), but can not pass the same in domain.xml.

Could you suggest me a way to set the same.

-Pushpesh

On Wed, Jun 10, 2015 at 12:59 PM, Alexandre DERUMIER
<[email protected]> wrote:
>>>I need to try out the performance on qemu soon and may come back to you if I 
>>>need some qemu setting trick :-)
>
> Sure no problem.
>
> (BTW, I can reach around 200k iops in 1 qemu vm with 5 virtio disks with 1 
> iothread by disk)
>
>
> ----- Mail original -----
> De: "Somnath Roy" <[email protected]>
> À: "aderumier" <[email protected]>, "Irek Fasikhov" <[email protected]>
> Cc: "ceph-devel" <[email protected]>, "pushpesh sharma" 
> <[email protected]>, "ceph-users" <[email protected]>
> Envoyé: Mercredi 10 Juin 2015 09:06:32
> Objet: RE: rbd_cache, limiting read on high iops around 40k
>
> Hi Alexandre,
> Thanks for sharing the data.
> I need to try out the performance on qemu soon and may come back to you if I 
> need some qemu setting trick :-)
>
> Regards
> Somnath
>
> -----Original Message-----
> From: ceph-users [mailto:[email protected]] On Behalf Of 
> Alexandre DERUMIER
> Sent: Tuesday, June 09, 2015 10:42 PM
> To: Irek Fasikhov
> Cc: ceph-devel; pushpesh sharma; ceph-users
> Subject: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
>
>>>Very good work!
>>>Do you have a rpm-file?
>>>Thanks.
> no sorry, I'm have compiled it manually (and I'm using debian jessie as 
> client)
>
>
>
> ----- Mail original -----
> De: "Irek Fasikhov" <[email protected]>
> À: "aderumier" <[email protected]>
> Cc: "Robert LeBlanc" <[email protected]>, "ceph-devel" 
> <[email protected]>, "pushpesh sharma" <[email protected]>, 
> "ceph-users" <[email protected]>
> Envoyé: Mercredi 10 Juin 2015 07:21:42
> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
>
> Hi, Alexandre.
>
> Very good work!
> Do you have a rpm-file?
> Thanks.
>
> 2015-06-10 7:10 GMT+03:00 Alexandre DERUMIER < [email protected] > :
>
>
> Hi,
>
> I have tested qemu with last tcmalloc 2.4, and the improvement is huge with 
> iothread: 50k iops (+45%) !
>
>
>
> qemu : no iothread : glibc : iops=33395 qemu : no-iothread : tcmalloc (2.2.1) 
> : iops=34516 (+3%) qemu : no-iothread : jemmaloc : iops=42226 (+26%) qemu : 
> no-iothread : tcmalloc (2.4) : iops=35974 (+7%)
>
>
> qemu : iothread : glibc : iops=34516
> qemu : iothread : tcmalloc : iops=38676 (+12%) qemu : iothread : jemmaloc : 
> iops=28023 (-19%) qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%)
>
>
>
>
>
> qemu : iothread : tcmalloc (2.4) : iops=50276 (+45%)
> ------------------------------------------------------
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, 
> ioengine=libaio, iodepth=32
> fio-2.1.11
> Starting 1 process
> Jobs: 1 (f=1): [r(1)] [100.0% done] [214.7MB/0KB/0KB /s] [54.1K/0/0 iops] 
> [eta 00m:00s]
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=894: Wed Jun 10 05:54:24 
> 2015 read : io=5120.0MB, bw=201108KB/s, iops=50276, runt= 26070msec slat 
> (usec): min=1, max=1136, avg= 3.54, stdev= 3.58 clat (usec): min=128, 
> max=6262, avg=631.41, stdev=197.71 lat (usec): min=149, max=6265, avg=635.27, 
> stdev=197.40 clat percentiles (usec):
> | 1.00th=[ 318], 5.00th=[ 378], 10.00th=[ 418], 20.00th=[ 474],
> | 30.00th=[ 516], 40.00th=[ 564], 50.00th=[ 612], 60.00th=[ 652],
> | 70.00th=[ 700], 80.00th=[ 756], 90.00th=[ 860], 95.00th=[ 980],
> | 99.00th=[ 1272], 99.50th=[ 1384], 99.90th=[ 1688], 99.95th=[ 1896],
> | 99.99th=[ 3760]
> bw (KB /s): min=145608, max=249688, per=100.00%, avg=201108.00, 
> stdev=21718.87 lat (usec) : 250=0.04%, 500=25.84%, 750=53.00%, 1000=16.63% 
> lat (msec) : 2=4.46%, 4=0.03%, 10=0.01% cpu : usr=9.73%, sys=24.93%, 
> ctx=66417, majf=0, minf=38 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 
> 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 
> 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 
> 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=1310720/w=0/d=0, 
> short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
> READ: io=5120.0MB, aggrb=201107KB/s, minb=201107KB/s, maxb=201107KB/s, 
> mint=26070msec, maxt=26070msec
>
> Disk stats (read/write):
> vdb: ios=1302555/0, merge=0/0, ticks=715176/0, in_queue=714840, util=99.73%
>
>
>
>
>
>
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, 
> ioengine=libaio, iodepth=32
> fio-2.1.11
> Starting 1 process
> Jobs: 1 (f=1): [r(1)] [100.0% done] [158.7MB/0KB/0KB /s] [40.6K/0/0 iops] 
> [eta 00m:00s]
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=889: Wed Jun 10 06:05:06 
> 2015 read : io=5120.0MB, bw=143897KB/s, iops=35974, runt= 36435msec slat 
> (usec): min=1, max=710, avg= 3.31, stdev= 3.35 clat (usec): min=191, 
> max=4740, avg=884.66, stdev=315.65 lat (usec): min=289, max=4743, avg=888.31, 
> stdev=315.51 clat percentiles (usec):
> | 1.00th=[ 462], 5.00th=[ 516], 10.00th=[ 548], 20.00th=[ 596],
> | 30.00th=[ 652], 40.00th=[ 764], 50.00th=[ 868], 60.00th=[ 940],
> | 70.00th=[ 1004], 80.00th=[ 1096], 90.00th=[ 1256], 95.00th=[ 1416],
> | 99.00th=[ 2024], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2640],
> | 99.99th=[ 3632]
> bw (KB /s): min=98352, max=177328, per=99.91%, avg=143772.11, stdev=21782.39 
> lat (usec) : 250=0.01%, 500=3.48%, 750=35.69%, 1000=30.01% lat (msec) : 
> 2=29.74%, 4=1.07%, 10=0.01% cpu : usr=7.10%, sys=16.90%, ctx=54855, majf=0, 
> minf=38 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, 
> >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, 
> >=64=0.0% issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0 latency : 
> target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
> READ: io=5120.0MB, aggrb=143896KB/s, minb=143896KB/s, maxb=143896KB/s, 
> mint=36435msec, maxt=36435msec
>
> Disk stats (read/write):
> vdb: ios=1301357/0, merge=0/0, ticks=1033036/0, in_queue=1032716, util=99.85%
>
>
> ----- Mail original -----
> De: "aderumier" < [email protected] >
> À: "Robert LeBlanc" < [email protected] >
> Cc: "Mark Nelson" < [email protected] >, "ceph-devel" < 
> [email protected] >, "pushpesh sharma" < [email protected] >, 
> "ceph-users" < [email protected] >
> Envoyé: Mardi 9 Juin 2015 18:47:27
> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
>
> Hi Robert,
>
>>>What I found was that Ceph OSDs performed well with either tcmalloc or
>>>jemalloc (except when RocksDB was built with jemalloc instead of
>>>tcmalloc, I'm still working to dig into why that might be the case).
> yes,from my test, for osd tcmalloc is a little faster (but very little) than 
> jemalloc.
>
>
>
>>>However, I found that tcmalloc with QEMU/KVM was very detrimental to
>>>small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much
>>>better for QEMU/KVM in the tests that we ran. [1]
>
>
> Just have done qemu test (4k randread - rbd_cache=off), I don't see speed 
> regression with tcmalloc.
> with qemu iothread, tcmalloc have a speed increase over glib
> with qemu iothread, jemalloc have a speed decrease
>
> without iothread, jemalloc have a big speed increase
>
> this is with
> -qemu 2.3
> -tcmalloc 2.2.1
> -jemmaloc 3.6
> -libc6 2.19
>
>
> qemu : no iothread : glibc : iops=33395
> qemu : no-iothread : tcmalloc : iops=34516 (+3%)
> qemu : no-iothread : jemmaloc : iops=42226 (+26%)
>
> qemu : iothread : glibc : iops=34516
> qemu : iothread : tcmalloc : iops=38676 (+12%)
> qemu : iothread : jemmaloc : iops=28023 (-19%)
>
>
> (The benefit of iothreads is that we can scale with more disks in 1vm)
>
>
> fio results:
> ------------
>
> qemu : iothread : tcmalloc : iops=38676
> -----------------------------------------
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, 
> ioengine=libaio, iodepth=32
> fio-2.1.11
> Starting 1 process
> Jobs: 1 (f=0): [r(1)] [100.0% done] [123.5MB/0KB/0KB /s] [31.6K/0/0 iops] 
> [eta 00m:00s]
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=1265: Tue Jun 9 18:16:53 
> 2015
> read : io=5120.0MB, bw=154707KB/s, iops=38676, runt= 33889msec
> slat (usec): min=1, max=715, avg= 3.63, stdev= 3.42
> clat (usec): min=152, max=5736, avg=822.12, stdev=289.34
> lat (usec): min=231, max=5740, avg=826.10, stdev=289.08
> clat percentiles (usec):
> | 1.00th=[ 402], 5.00th=[ 466], 10.00th=[ 510], 20.00th=[ 572],
> | 30.00th=[ 636], 40.00th=[ 716], 50.00th=[ 780], 60.00th=[ 852],
> | 70.00th=[ 932], 80.00th=[ 1020], 90.00th=[ 1160], 95.00th=[ 1352],
> | 99.00th=[ 1800], 99.50th=[ 1944], 99.90th=[ 2256], 99.95th=[ 2448],
> | 99.99th=[ 3888]
> bw (KB /s): min=123888, max=198584, per=100.00%, avg=154824.40, stdev=16978.03
> lat (usec) : 250=0.01%, 500=8.91%, 750=36.44%, 1000=32.63%
> lat (msec) : 2=21.65%, 4=0.37%, 10=0.01%
> cpu : usr=8.29%, sys=19.76%, ctx=55882, majf=0, minf=39
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
> READ: io=5120.0MB, aggrb=154707KB/s, minb=154707KB/s, maxb=154707KB/s, 
> mint=33889msec, maxt=33889msec
>
> Disk stats (read/write):
> vdb: ios=1302739/0, merge=0/0, ticks=934444/0, in_queue=934096, util=99.77%
>
>
>
> qemu : no-iothread : tcmalloc : iops=34516
> ---------------------------------------------
> Jobs: 1 (f=1): [r(1)] [100.0% done] [163.2MB/0KB/0KB /s] [41.8K/0/0 iops] 
> [eta 00m:00s]
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=896: Tue Jun 9 18:19:08 
> 2015
> read : io=5120.0MB, bw=138065KB/s, iops=34516, runt= 37974msec
> slat (usec): min=1, max=708, avg= 3.98, stdev= 3.57
> clat (usec): min=208, max=11858, avg=921.43, stdev=333.61
> lat (usec): min=266, max=11862, avg=925.77, stdev=333.40
> clat percentiles (usec):
> | 1.00th=[ 434], 5.00th=[ 510], 10.00th=[ 564], 20.00th=[ 652],
> | 30.00th=[ 732], 40.00th=[ 812], 50.00th=[ 876], 60.00th=[ 940],
> | 70.00th=[ 1020], 80.00th=[ 1112], 90.00th=[ 1320], 95.00th=[ 1576],
> | 99.00th=[ 1992], 99.50th=[ 2128], 99.90th=[ 2736], 99.95th=[ 3248],
> | 99.99th=[ 4320]
> bw (KB /s): min=77312, max=185576, per=99.74%, avg=137709.88, stdev=16883.77
> lat (usec) : 250=0.01%, 500=4.36%, 750=27.61%, 1000=35.60%
> lat (msec) : 2=31.49%, 4=0.92%, 10=0.02%, 20=0.01%
> cpu : usr=7.19%, sys=19.52%, ctx=55903, majf=0, minf=38
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
> READ: io=5120.0MB, aggrb=138064KB/s, minb=138064KB/s, maxb=138064KB/s, 
> mint=37974msec, maxt=37974msec
>
> Disk stats (read/write):
> vdb: ios=1309902/0, merge=0/0, ticks=1068768/0, in_queue=1068396, util=99.86%
>
>
>
> qemu : iothread : glibc : iops=34516
> -------------------------------------
>
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, 
> ioengine=libaio, iodepth=32
> fio-2.1.11
> Starting 1 process
> Jobs: 1 (f=1): [r(1)] [100.0% done] [133.4MB/0KB/0KB /s] [34.2K/0/0 iops] 
> [eta 00m:00s]
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=876: Tue Jun 9 18:24:01 
> 2015
> read : io=5120.0MB, bw=137786KB/s, iops=34446, runt= 38051msec
> slat (usec): min=1, max=496, avg= 3.88, stdev= 3.66
> clat (usec): min=283, max=7515, avg=923.34, stdev=300.28
> lat (usec): min=286, max=7519, avg=927.58, stdev=300.02
> clat percentiles (usec):
> | 1.00th=[ 506], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652],
> | 30.00th=[ 724], 40.00th=[ 804], 50.00th=[ 884], 60.00th=[ 964],
> | 70.00th=[ 1048], 80.00th=[ 1144], 90.00th=[ 1304], 95.00th=[ 1448],
> | 99.00th=[ 1896], 99.50th=[ 2096], 99.90th=[ 2480], 99.95th=[ 2640],
> | 99.99th=[ 3984]
> bw (KB /s): min=102680, max=171112, per=100.00%, avg=137877.78, stdev=15521.30
> lat (usec) : 500=0.84%, 750=32.97%, 1000=30.82%
> lat (msec) : 2=34.65%, 4=0.71%, 10=0.01%
> cpu : usr=7.42%, sys=19.47%, ctx=52455, majf=0, minf=38
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
> READ: io=5120.0MB, aggrb=137785KB/s, minb=137785KB/s, maxb=137785KB/s, 
> mint=38051msec, maxt=38051msec
>
> Disk stats (read/write):
> vdb: ios=1307426/0, merge=0/0, ticks=1051416/0, in_queue=1050972, util=99.85%
>
>
>
> qemu : no iothread : glibc : iops=33395
> -----------------------------------------
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, 
> ioengine=libaio, iodepth=32
> fio-2.1.11
> Starting 1 process
> Jobs: 1 (f=1): [r(1)] [100.0% done] [125.4MB/0KB/0KB /s] [32.9K/0/0 iops] 
> [eta 00m:00s]
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=886: Tue Jun 9 18:27:18 
> 2015
> read : io=5120.0MB, bw=133583KB/s, iops=33395, runt= 39248msec
> slat (usec): min=1, max=1054, avg= 3.86, stdev= 4.29
> clat (usec): min=139, max=12635, avg=952.85, stdev=335.51
> lat (usec): min=303, max=12638, avg=957.01, stdev=335.29
> clat percentiles (usec):
> | 1.00th=[ 516], 5.00th=[ 564], 10.00th=[ 596], 20.00th=[ 652],
> | 30.00th=[ 724], 40.00th=[ 820], 50.00th=[ 924], 60.00th=[ 996],
> | 70.00th=[ 1080], 80.00th=[ 1176], 90.00th=[ 1336], 95.00th=[ 1528],
> | 99.00th=[ 2096], 99.50th=[ 2320], 99.90th=[ 2672], 99.95th=[ 2928],
> | 99.99th=[ 4832]
> bw (KB /s): min=98136, max=171624, per=100.00%, avg=133682.64, stdev=19121.91
> lat (usec) : 250=0.01%, 500=0.57%, 750=32.57%, 1000=26.98%
> lat (msec) : 2=38.59%, 4=1.28%, 10=0.01%, 20=0.01%
> cpu : usr=9.24%, sys=15.92%, ctx=51219, majf=0, minf=38
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
> READ: io=5120.0MB, aggrb=133583KB/s, minb=133583KB/s, maxb=133583KB/s, 
> mint=39248msec, maxt=39248msec
>
> Disk stats (read/write):
> vdb: ios=1304526/0, merge=0/0, ticks=1075020/0, in_queue=1074536, util=99.84%
>
>
>
> qemu : iothread : jemmaloc : iops=28023
> ----------------------------------------
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, 
> ioengine=libaio, iodepth=32
> fio-2.1.11
> Starting 1 process
> Jobs: 1 (f=1): [r(1)] [97.9% done] [155.2MB/0KB/0KB /s] [39.1K/0/0 iops] [eta 
> 00m:01s]
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=899: Tue Jun 9 18:30:26 
> 2015
> read : io=5120.0MB, bw=112094KB/s, iops=28023, runt= 46772msec
> slat (usec): min=1, max=467, avg= 4.33, stdev= 4.77
> clat (usec): min=253, max=11307, avg=1135.63, stdev=346.55
> lat (usec): min=256, max=11309, avg=1140.39, stdev=346.22
> clat percentiles (usec):
> | 1.00th=[ 510], 5.00th=[ 628], 10.00th=[ 700], 20.00th=[ 820],
> | 30.00th=[ 924], 40.00th=[ 1032], 50.00th=[ 1128], 60.00th=[ 1224],
> | 70.00th=[ 1320], 80.00th=[ 1416], 90.00th=[ 1560], 95.00th=[ 1688],
> | 99.00th=[ 2096], 99.50th=[ 2224], 99.90th=[ 2544], 99.95th=[ 2832],
> | 99.99th=[ 3760]
> bw (KB /s): min=91792, max=174416, per=99.90%, avg=111985.27, stdev=17381.70
> lat (usec) : 500=0.80%, 750=13.10%, 1000=23.33%
> lat (msec) : 2=61.30%, 4=1.46%, 10=0.01%, 20=0.01%
> cpu : usr=7.12%, sys=17.43%, ctx=54507, majf=0, minf=38
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
> READ: io=5120.0MB, aggrb=112094KB/s, minb=112094KB/s, maxb=112094KB/s, 
> mint=46772msec, maxt=46772msec
>
> Disk stats (read/write):
> vdb: ios=1309169/0, merge=0/0, ticks=1305796/0, in_queue=1305376, util=98.68%
>
>
>
> qemu : non-iothread : jemmaloc : iops=42226
> --------------------------------------------
> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, 
> ioengine=libaio, iodepth=32
> fio-2.1.11
> Starting 1 process
> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.2MB/0KB/0KB /s] [43.9K/0/0 iops] 
> [eta 00m:00s]
> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=892: Tue Jun 9 18:34:11 
> 2015
> read : io=5120.0MB, bw=177130KB/s, iops=44282, runt= 29599msec
> slat (usec): min=1, max=527, avg= 3.80, stdev= 3.74
> clat (usec): min=174, max=3841, avg=717.08, stdev=237.53
> lat (usec): min=210, max=3844, avg=721.23, stdev=237.22
> clat percentiles (usec):
> | 1.00th=[ 354], 5.00th=[ 422], 10.00th=[ 462], 20.00th=[ 516],
> | 30.00th=[ 572], 40.00th=[ 628], 50.00th=[ 684], 60.00th=[ 740],
> | 70.00th=[ 804], 80.00th=[ 884], 90.00th=[ 1004], 95.00th=[ 1128],
> | 99.00th=[ 1544], 99.50th=[ 1672], 99.90th=[ 1928], 99.95th=[ 2064],
> | 99.99th=[ 2608]
> bw (KB /s): min=138120, max=230816, per=100.00%, avg=177192.14, stdev=23440.79
> lat (usec) : 250=0.01%, 500=16.24%, 750=45.93%, 1000=27.46%
> lat (msec) : 2=10.30%, 4=0.07%
> cpu : usr=10.14%, sys=23.84%, ctx=60938, majf=0, minf=39
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
> issued : total=r=1310720/w=0/d=0, short=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
> READ: io=5120.0MB, aggrb=177130KB/s, minb=177130KB/s, maxb=177130KB/s, 
> mint=29599msec, maxt=29599msec
>
> Disk stats (read/write):
> vdb: ios=1303992/0, merge=0/0, ticks=798008/0, in_queue=797636, util=99.80%
>
>
>
> ----- Mail original -----
> De: "Robert LeBlanc" < [email protected] >
> À: "aderumier" < [email protected] >
> Cc: "Mark Nelson" < [email protected] >, "ceph-devel" < 
> [email protected] >, "pushpesh sharma" < [email protected] >, 
> "ceph-users" < [email protected] >
> Envoyé: Mardi 9 Juin 2015 18:00:29
> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> I also saw a similar performance increase by using alternative memory
> allocators. What I found was that Ceph OSDs performed well with either
> tcmalloc or jemalloc (except when RocksDB was built with jemalloc
> instead of tcmalloc, I'm still working to dig into why that might be
> the case).
>
> However, I found that tcmalloc with QEMU/KVM was very detrimental to
> small I/O, but provided huge gains in I/O >=1MB. Jemalloc was much
> better for QEMU/KVM in the tests that we ran. [1]
>
> I'm currently looking into I/O bottlenecks around the 16KB range and
> I'm seeing a lot of time in thread creation and destruction, the
> memory allocators are quite a bit down the list (both fio with
> ioengine rbd and on the OSDs). I wonder what the difference can be.
> I've tried using the async messenger but there wasn't a huge
> difference. [2]
>
> Further down the rabbit hole....
>
> [1] https://www.mail-archive.com/[email protected]/msg20197.html
> [2] https://www.mail-archive.com/[email protected]/msg23982.html
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v0.13.1
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJVdw2ZCRDmVDuy+mK58QAA4MwP/1vt65cvTyyVGGSGRrE8
> unuWjafMHzl486XH+EaVrDVTXFVFOoncJ6kugSpD7yavtCpZNdhsIaTRZguU
> YpfAppNAJU5biSwNv9QPI7kPP2q2+I7Z8ZkvhcVnkjIythoeNnSjV7zJrw87
> afq46GhPHqEXdjp3rOB4RRPniOMnub5oU6QRnKn3HPW8Dx9ZqTeCofRDnCY2
> S695Dt1gzt0ERUOgrUUkt0FQJdkkV6EURcUschngjtEd5727VTLp02HivVl3
> vDYWxQHPK8oS6Xe8GOW0JjulwiqlYotSlrqSU5FMU5gozbk9zMFPIUW1e+51
> 9ART8Ta2ItMhPWtAhRwwvxgy51exCy9kBc+m+ptKW5XRUXOImGcOQxszPGOO
> qIIOG1vVG/GBmo/0i6tliqBFYdXmw1qFV7tFiIbisZRH7Q/1NahjYTHqHhu3
> Dv61T6WrerD+9N6S1Lrz1QYe2Fqa56BHhHSXM82NE86SVxEvUkoGegQU+c7b
> 6rY1JvuJHJzva7+M2XHApYCchCs4a1Yyd1qWB7yThJD57RIyX1TOg0+siV13
> R+v6wxhQU0vBovH+5oAWmCZaPNT+F0Uvs3xWAxxaIR9r83wMj9qQeBZTKVzQ
> 1aFIi15KqAwOp12yWCmrqKTeXhjwYQNd8viCQCGN7AQyPglmzfbuEHalVjz4
> oSJX
> =k281
> -----END PGP SIGNATURE-----
> ----------------
> Robert LeBlanc
> GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
>
>
> On Tue, Jun 9, 2015 at 6:02 AM, Alexandre DERUMIER < [email protected] > 
> wrote:
>>>>Frankly, I'm a little impressed that without RBD cache we can hit 80K
>>>>IOPS from 1 VM!
>>
>> Note that theses result are not in a vm (fio-rbd on host), so in a vm we'll 
>> have overhead.
>> (I'm planning to send results in qemu soon)
>>
>>>>How fast are the SSDs in those 3 OSDs?
>>
>> Theses results are with datas in buffer memory of osd nodes.
>>
>> When reading fulling on ssd (intel s3500),
>>
>> For 1 client,
>>
>> I'm around 33k iops without cache and 32k iops with cache, with 1 osd.
>> I'm around 55k iops without cache and 38k iops with cache, with 3 osd.
>>
>> with multiple clients jobs, I can reach around 70kiops by osd , and 250k 
>> iops by osd when datas are in buffer.
>>
>> (cpus servers/clients are 2x 10 cores 3,1ghz e5 xeon)
>>
>>
>>
>> small tip :
>> I'm using tcmalloc for fio-rbd or rados bench to improve latencies by around 
>> 20%
>>
>> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 fio ...
>> LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4 rados bench ...
>>
>> as a lot of time is spent in malloc/free
>>
>>
>> (qemu support also tcmalloc since some months , I'll bench it too
>> https://lists.gnu.org/archive/html/qemu-devel/2015-03/msg05372.html )
>>
>>
>>
>> I'll try to send full bench results soon, from 1 to 18 ssd osd.
>>
>>
>>
>>
>> ----- Mail original -----
>> De: "Mark Nelson" < [email protected] >
>> À: "aderumier" < [email protected] >, "pushpesh sharma" < 
>> [email protected] >
>> Cc: "ceph-devel" < [email protected] >, "ceph-users" < 
>> [email protected] >
>> Envoyé: Mardi 9 Juin 2015 13:36:31
>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
>>
>> Hi All,
>>
>> In the past we've hit some performance issues with RBD cache that we've
>> fixed, but we've never really tried pushing a single VM beyond 40+K read
>> IOPS in testing (or at least I never have). I suspect there's a couple
>> of possibilities as to why it might be slower, but perhaps joshd can
>> chime in as he's more familiar with what that code looks like.
>>
>> Frankly, I'm a little impressed that without RBD cache we can hit 80K
>> IOPS from 1 VM! How fast are the SSDs in those 3 OSDs?
>>
>> Mark
>>
>> On 06/09/2015 03:36 AM, Alexandre DERUMIER wrote:
>>> It's seem that the limit is mainly going in high queue depth (+- > 16)
>>>
>>> Here the result in iops with 1client- 4krandread- 3osd - with differents 
>>> queue depth size.
>>> rbd_cache is almost the same than without cache with queue depth <16
>>>
>>>
>>> cache
>>> -----
>>> qd1: 1651
>>> qd2: 3482
>>> qd4: 7958
>>> qd8: 17912
>>> qd16: 36020
>>> qd32: 42765
>>> qd64: 46169
>>>
>>> no cache
>>> --------
>>> qd1: 1748
>>> qd2: 3570
>>> qd4: 8356
>>> qd8: 17732
>>> qd16: 41396
>>> qd32: 78633
>>> qd64: 79063
>>> qd128: 79550
>>>
>>>
>>> ----- Mail original -----
>>> De: "aderumier" < [email protected] >
>>> À: "pushpesh sharma" < [email protected] >
>>> Cc: "ceph-devel" < [email protected] >, "ceph-users" < 
>>> [email protected] >
>>> Envoyé: Mardi 9 Juin 2015 09:28:21
>>> Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
>>>
>>> Hi,
>>>
>>>>> We tried adding more RBDs to single VM, but no luck.
>>>
>>> If you want to scale with more disks in a single qemu vm, you need to use 
>>> iothread feature from qemu and assign 1 iothread by disk (works with 
>>> virtio-blk).
>>> It's working for me, I can scale with adding more disks.
>>>
>>>
>>> My bench here are done with fio-rbd on host.
>>> I can scale up to 400k iops with 10clients-rbd_cache=off on a single host 
>>> and around 250kiops 10clients-rbdcache=on.
>>>
>>>
>>> I just wonder why I don't have performance decrease around 30k iops with 
>>> 1osd.
>>>
>>> I'm going to see if this tracker
>>> http://tracker.ceph.com/issues/11056
>>>
>>> could be the cause.
>>>
>>> (My master build was done some week ago)
>>>
>>>
>>>
>>> ----- Mail original -----
>>> De: "pushpesh sharma" < [email protected] >
>>> À: "aderumier" < [email protected] >
>>> Cc: "ceph-devel" < [email protected] >, "ceph-users" < 
>>> [email protected] >
>>> Envoyé: Mardi 9 Juin 2015 09:21:04
>>> Objet: Re: rbd_cache, limiting read on high iops around 40k
>>>
>>> Hi Alexandre,
>>>
>>> We have also seen something very similar on Hammer(0.94-1). We were doing 
>>> some benchmarking for VMs hosted on hypervisor (QEMU-KVM, openstack-juno). 
>>> Each Ubuntu-VM has a RBD as root disk, and 1 RBD as additional storage. For 
>>> some strange reason it was not able to scale 4K- RR iops on each VM beyond 
>>> 35-40k. We tried adding more RBDs to single VM, but no luck. However 
>>> increasing number of VMs to 4 on a single hypervisor did scale to some 
>>> extent. After this there was no much benefit we got from adding more VMs.
>>>
>>> Here is the trend we have seen, x-axis is number of hypervisor, each 
>>> hypervisor has 4 VM, each VM has 1 RBD:-
>>>
>>>
>>>
>>>
>>> VDbench is used as benchmarking tool. We were not saturating network and 
>>> CPUs at OSD nodes. We were not able to saturate CPUs at hypervisors, and 
>>> that is where we were suspecting of some throttling effect. However we 
>>> haven't setted any such limits from nova or kvm end. We tried some CPU 
>>> pinning and other KVM related tuning as well, but no luck.
>>>
>>> We tried the same experiment on a bare metal. It was 4K RR IOPs were 
>>> scaling from 40K(1 RBD) to 180K(4 RBDs). But after that rather than scaling 
>>> beyond that point the numbers were actually degrading. (Single pipe more 
>>> congestion effect)
>>>
>>> We never suspected that rbd cache enable could be detrimental to 
>>> performance. It would nice to route cause the problem if that is the case.
>>>
>>> On Tue, Jun 9, 2015 at 11:21 AM, Alexandre DERUMIER < [email protected] > 
>>> wrote:
>>>
>>>
>>> Hi,
>>>
>>> I'm doing benchmark (ceph master branch), with randread 4k qdepth=32,
>>> and rbd_cache=true seem to limit the iops around 40k
>>>
>>>
>>> no cache
>>> --------
>>> 1 client - rbd_cache=false - 1osd : 38300 iops
>>> 1 client - rbd_cache=false - 2osd : 69073 iops
>>> 1 client - rbd_cache=false - 3osd : 78292 iops
>>>
>>>
>>> cache
>>> -----
>>> 1 client - rbd_cache=true - 1osd : 38100 iops
>>> 1 client - rbd_cache=true - 2osd : 42457 iops
>>> 1 client - rbd_cache=true - 3osd : 45823 iops
>>>
>>>
>>>
>>> Is it expected ?
>>>
>>>
>>>
>>> fio result rbd_cache=false 3 osd
>>> --------------------------------
>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, 
>>> iodepth=32
>>> fio-2.1.11
>>> Starting 1 process
>>> rbd engine: RBD version: 0.1.9
>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [307.5MB/0KB/0KB /s] [78.8K/0/0 iops] 
>>> [eta 00m:00s]
>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113548: Tue Jun 9 
>>> 07:48:42 2015
>>> read : io=10000MB, bw=313169KB/s, iops=78292, runt= 32698msec
>>> slat (usec): min=5, max=530, avg=11.77, stdev= 6.77
>>> clat (usec): min=70, max=2240, avg=336.08, stdev=94.82
>>> lat (usec): min=101, max=2247, avg=347.84, stdev=95.49
>>> clat percentiles (usec):
>>> | 1.00th=[ 173], 5.00th=[ 209], 10.00th=[ 231], 20.00th=[ 262],
>>> | 30.00th=[ 282], 40.00th=[ 302], 50.00th=[ 322], 60.00th=[ 346],
>>> | 70.00th=[ 370], 80.00th=[ 402], 90.00th=[ 454], 95.00th=[ 506],
>>> | 99.00th=[ 628], 99.50th=[ 692], 99.90th=[ 860], 99.95th=[ 948],
>>> | 99.99th=[ 1176]
>>> bw (KB /s): min=238856, max=360448, per=100.00%, avg=313402.34, 
>>> stdev=25196.21
>>> lat (usec) : 100=0.01%, 250=15.94%, 500=78.60%, 750=5.19%, 1000=0.23%
>>> lat (msec) : 2=0.03%, 4=0.01%
>>> cpu : usr=74.48%, sys=13.25%, ctx=703225, majf=0, minf=12452
>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.8%, 16=87.0%, 32=12.1%, >=64=0.0%
>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>> complete : 0=0.0%, 4=91.6%, 8=3.4%, 16=4.5%, 32=0.4%, 64=0.0%, >=64=0.0%
>>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0
>>> latency : target=0, window=0, percentile=100.00%, depth=32
>>>
>>> Run status group 0 (all jobs):
>>> READ: io=10000MB, aggrb=313169KB/s, minb=313169KB/s, maxb=313169KB/s, 
>>> mint=32698msec, maxt=32698msec
>>>
>>> Disk stats (read/write):
>>> dm-0: ios=0/45, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/24, 
>>> aggrmerge=0/21, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
>>> sda: ios=0/24, merge=0/21, ticks=0/0, in_queue=0, util=0.00%
>>>
>>>
>>>
>>>
>>> fio result rbd_cache=true 3osd
>>> ------------------------------
>>>
>>> rbd_iodepth32-test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, 
>>> iodepth=32
>>> fio-2.1.11
>>> Starting 1 process
>>> rbd engine: RBD version: 0.1.9
>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [171.6MB/0KB/0KB /s] [43.1K/0/0 iops] 
>>> [eta 00m:00s]
>>> rbd_iodepth32-test: (groupid=0, jobs=1): err= 0: pid=113389: Tue Jun 9 
>>> 07:47:30 2015
>>> read : io=10000MB, bw=183296KB/s, iops=45823, runt= 55866msec
>>> slat (usec): min=7, max=805, avg=21.26, stdev=15.84
>>> clat (usec): min=101, max=4602, avg=478.55, stdev=143.73
>>> lat (usec): min=123, max=4669, avg=499.80, stdev=146.03
>>> clat percentiles (usec):
>>> | 1.00th=[ 227], 5.00th=[ 274], 10.00th=[ 306], 20.00th=[ 350],
>>> | 30.00th=[ 390], 40.00th=[ 430], 50.00th=[ 470], 60.00th=[ 506],
>>> | 70.00th=[ 548], 80.00th=[ 596], 90.00th=[ 660], 95.00th=[ 724],
>>> | 99.00th=[ 844], 99.50th=[ 908], 99.90th=[ 1112], 99.95th=[ 1288],
>>> | 99.99th=[ 2192]
>>> bw (KB /s): min=115280, max=204416, per=100.00%, avg=183315.10, 
>>> stdev=15079.93
>>> lat (usec) : 250=2.42%, 500=55.61%, 750=38.48%, 1000=3.28%
>>> lat (msec) : 2=0.19%, 4=0.01%, 10=0.01%
>>> cpu : usr=60.27%, sys=12.01%, ctx=2995393, majf=0, minf=14100
>>> IO depths : 1=0.1%, 2=0.1%, 4=0.2%, 8=13.5%, 16=81.0%, 32=5.3%, >=64=0.0%
>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>> complete : 0=0.0%, 4=95.0%, 8=0.1%, 16=1.0%, 32=4.0%, 64=0.0%, >=64=0.0%
>>> issued : total=r=2560000/w=0/d=0, short=r=0/w=0/d=0
>>> latency : target=0, window=0, percentile=100.00%, depth=32
>>>
>>> Run status group 0 (all jobs):
>>> READ: io=10000MB, aggrb=183295KB/s, minb=183295KB/s, maxb=183295KB/s, 
>>> mint=55866msec, maxt=55866msec
>>>
>>> Disk stats (read/write):
>>> dm-0: ios=0/61, merge=0/0, ticks=0/8, in_queue=8, util=0.01%, aggrios=0/29, 
>>> aggrmerge=0/32, aggrticks=0/8, aggrin_queue=8, aggrutil=0.01%
>>> sda: ios=0/29, merge=0/32, ticks=0/8, in_queue=8, util=0.01%
>>>
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
>
>
> --
> С уважением, Фасихов Ирек Нургаязович
> Моб.: +79229045757
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
>



-- 
-Pushpesh
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rbd_cache, limiting read on high iops around 40k

Reply via email to