> Let's take IOPS, assuming the spinners can do 50 (4k) synced sustained IOPS 
> (I hope they can do more ^^), we should be around 50x84/3 = 1400 IOPS, which 
> is far from rados bench (538) and fio (847). And surprisingly fio numbers are 
> greater than rados.
> 

I think the missing factor here is filesystem journal overhead - that would 
explain the strange numbers you are seeing and the low performance in rados 
bench - every filesystem metadata operation has to do at least one 1 (synced) 
OP to the journal and that's not only file creation but also file growth (or 
filling the holes). And that's on the OSD as well as on the client filesystem 
side(!).


To do a proper benchmark, fill the RBD mounted filesytem first with data 
completely and then try again with fio on a preallocated file. (and don't 
enable discard if that's supported)
Better yet, run fio on the block device itself but write it over with dd 
if=/dev/zero first.
I think you'll get bit different numbers then.
Of course whether that's representative of what your usage pattern might be is 
another story.

Can you tell us what workload should be running on this and what the 
expectations were?
Can you see someting maxed our while the benchmark is running? (CPU or drives?) 
Have you tried switching schedulers on the drives?

Jan

> On 02 Dec 2015, at 22:33, Adrien Gillard <gillard.adr...@gmail.com> wrote:
> 
> Hi everyone,
> 
>  
> I am currently testing our new cluster and I would like some feedback on the 
> numbers I am getting.
> 
>  
> For the hardware :
> 
> 7 x OSD : 2 x Intel 2640v3 (8x2.6GHz), 64B RAM, 2x10Gbits LACP for public 
> net., 2x10Gbits LACP for cluster net., MTU 9000
> 
> 1 x MON : 2 x Intel 2630L (6x2GHz), 32GB RAM and Intel DC SSD, 2x10Gbits LACP 
> for public net., MTU 9000
> 
> 2 x MON : VMs (8 cores, 8GB RAM), backed by SSD
> 
>  
> Journals are 20GB partitions on SSD
> 
>  
> The system is CentOS 7.1 with stock kernel (3.10.0-229.20.1.el7.x86_64). No 
> particular system optimizations.
> 
>  
> Ceph is Infernalis from Ceph repository  : ceph version 9.2.0 
> (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
> 
>  
> [cephadm@cph-adm-01  ~/scripts]$ ceph -s
> 
>     cluster 259f65a3-d6c8-4c90-a9c2-71d4c3c55cce
> 
>      health HEALTH_OK
> 
>      monmap e1: 3 mons at 
> {clb-cph-frpar1-mon-02=x.x.x.2:6789/0,clb-cph-frpar2-mon-01=x.x.x.1:6789/0,clb-cph-frpar2-mon-03=x.x.x.3:6789/0}
> 
>             election epoch 62, quorum 0,1,2 
> clb-cph-frpar2-mon-01,clb-cph-frpar1-mon-02,clb-cph-frpar2-mon-03
> 
>      osdmap e844: 84 osds: 84 up, 84 in
> 
>             flags sortbitwise
> 
>       pgmap v111655: 3136 pgs, 3 pools, 3166 GB data, 19220 kobjects
> 
>             8308 GB used, 297 TB / 305 TB avail
> 
>                 3136 active+clean
> 
>  
> My ceph.conf :
> 
>  
> [global]
> 
> fsid = 259f65a3-d6c8-4c90-a9c2-71d4c3c55cce
> 
> mon_initial_members = clb-cph-frpar2-mon-01, clb-cph-frpar1-mon-02, 
> clb-cph-frpar2-mon-03
> 
> mon_host = x.x.x.1,x.x.x.2,x.x.x.3
> 
> auth_cluster_required = cephx
> 
> auth_service_required = cephx
> 
> auth_client_required = cephx
> 
> filestore_xattr_use_omap = true
> 
> public network = 10.25.25.0/24 <http://10.25.25.0/24>
> cluster network = 10.25.26.0/24 <http://10.25.26.0/24>
> debug_lockdep = 0/0
> 
> debug_context = 0/0
> 
> debug_crush = 0/0
> 
> debug_buffer = 0/0
> 
> debug_timer = 0/0
> 
> debug_filer = 0/0
> 
> debug_objecter = 0/0
> 
> debug_rados = 0/0
> 
> debug_rbd = 0/0
> 
> debug_journaler = 0/0
> 
> debug_objectcatcher = 0/0
> 
> debug_client = 0/0
> 
> debug_osd = 0/0
> 
> debug_optracker = 0/0
> 
> debug_objclass = 0/0
> 
> debug_filestore = 0/0
> 
> debug_journal = 0/0
> 
> debug_ms = 0/0
> 
> debug_monc = 0/0
> 
> debug_tp = 0/0
> 
> debug_auth = 0/0
> 
> debug_finisher = 0/0
> 
> debug_heartbeatmap = 0/0
> 
> debug_perfcounter = 0/0
> 
> debug_asok = 0/0
> 
> debug_throttle = 0/0
> 
> debug_mon = 0/0
> 
> debug_paxos = 0/0
> 
> debug_rgw = 0/0
> 
>  
> [osd]
> 
> osd journal size = 0
> 
> osd mount options xfs = "rw,noatime,inode64,logbufs=8,logbsize=256k"
> 
> filestore min sync interval = 5
> 
> filestore max sync interval = 15
> 
> filestore queue max ops = 2048
> 
> filestore queue max bytes = 1048576000
> 
> filestore queue committing max ops = 4096
> 
> filestore queue committing max bytes = 1048576000
> 
> filestore op thread = 32
> 
> filestore journal writeahead = true
> 
> filestore merge threshold = 40
> 
> filestore split multiple = 8
> 
>  
> journal max write bytes = 1048576000
> 
> journal max write entries = 4096
> 
> journal queue max ops = 8092
> 
> journal queue max bytes = 1048576000
> 
>  
> osd max write size = 512
> 
> osd op threads = 16
> 
> osd disk threads = 2
> 
> osd op num threads per shard = 3
> 
> osd op num shards = 10
> 
> osd map cache size = 1024
> 
> osd max backfills = 1
> 
> osd recovery max active = 2
> 
>  
> I have set up 2 pools : one for cache with 3x replication in front of an EC 
> pool. At the moment I am only interested in the cache pool, so no 
> promotions/flushes/evictions happen.
> 
> (I know, I am using the same set of OSD for hot and cold data, but in my use 
> case they should not be used at the same time.)
> 
>  
> I am accessing the cluster via RBD volumes mapped with the kernel module on 
> CentOS 7.1. These volumes are formatted in XFS on the clients.
> 
>  
> The journal SSDs seem to perform quite well according to the results of 
> Sebastien Han’s benchmark suggestion (they are Sandisk) :
> 
> write: io=22336MB, bw=381194KB/s, iops=95298, runt= 60001msec (this is for 
> numjob=10)
> 
>  
> Here are the rados bench tests :
> 
>  
> rados bench -p rbdcache 120 write -b 4K -t 32 --no-cleanup
> 
> 
> 
> Total time run:         121.410763
> 
> Total writes made:      65357
> 
> Write size:             4096
> 
> Bandwidth (MB/sec):     2.1
> 
> Stddev Bandwidth:       0.597
> 
> Max bandwidth (MB/sec): 3.89
> 
> Min bandwidth (MB/sec): 0.00781
> 
> Average IOPS:           538
> 
> Stddev IOPS:            152
> 
> Max IOPS:               995
> 
> Min IOPS:               2
> 
> Average Latency:        0.0594
> 
> Stddev Latency:         0.18
> 
> Max latency:            2.82
> 
> Min latency:            0.00494
> 
>  
> And the results of the fio test with the following parameters :
> 
>  
> [global]
> 
> size=8G
> 
> runtime=300
> 
> ioengine=libaio
> 
> invalidate=1
> 
> direct=1
> 
> sync=1
> 
> fsync=1
> 
> numjobs=32
> 
> rw=randwrite
> 
> name=4k-32-1-randwrite-libaio
> 
> blocksize=4K
> 
> iodepth=1
> 
> directory=/mnt/rbd
> 
> group_reporting=1
> 
> 
> 
> 4k-32-1-randwrite-libaio: (groupid=0, jobs=32): err= 0: pid=20442: Wed Dec  2 
> 21:38:30 2015
> 
>   write: io=992.11MB, bw=3389.3KB/s, iops=847, runt=300011msec
> 
>     slat (usec): min=5, max=4726, avg=40.32, stdev=41.28
> 
>     clat (msec): min=2, max=2208, avg=19.35, stdev=74.34
> 
>      lat (msec): min=2, max=2208, avg=19.39, stdev=74.34
> 
>     clat percentiles (msec):
> 
>      |  1.00th=[    3],  5.00th=[    4], 10.00th=[    4], 20.00th=[    4],
> 
>      | 30.00th=[    4], 40.00th=[    5], 50.00th=[    5], 60.00th=[    5],
> 
>      | 70.00th=[    6], 80.00th=[    7], 90.00th=[   38], 95.00th=[   63],
> 
>      | 99.00th=[  322], 99.50th=[  570], 99.90th=[ 1074], 99.95th=[ 1221],
> 
>      | 99.99th=[ 1532]
> 
>     bw (KB  /s): min=    1, max=  448, per=3.64%, avg=123.48, stdev=102.09
> 
>     lat (msec) : 4=30.30%, 10=51.27%, 20=1.71%, 50=9.91%, 100=4.03%
> 
>     lat (msec) : 250=1.55%, 500=0.62%, 750=0.33%, 1000=0.16%
> 
>   cpu          : usr=0.09%, sys=0.25%, ctx=963114, majf=0, minf=928
> 
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> 
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >=64=0.0%
> 
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >=64=0.0%
> 
>      issued    : total=r=0/w=254206/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
> 
>      latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> 
> 
> Run status group 0 (all jobs):
> 
>   WRITE: io=992.11MB, aggrb=3389KB/s, minb=3389KB/s, maxb=3389KB/s, 
> mint=300011msec, maxt=300011msec
> 
> 
> 
> Disk stats (read/write):
> 
>   rbd0: ios=0/320813, merge=0/10001, ticks=0/5670847, in_queue=5677825, 
> util=100.00%
> 
> 
> 
> 
> And a job closer to what the actual workload would be (blocksize=200K, 
> numjob=16, QD=32)
> 
> 
> 
> 200k-16-32-randwrite-libaio: (groupid=0, jobs=16): err= 0: pid=4828: Wed Dec  
> 2 18:58:53 2015
> 
>   write: io=47305MB, bw=161367KB/s, iops=806, runt=300189msec
> 
>     slat (usec): min=17, max=358430, avg=155.11, stdev=2361.49
> 
>     clat (msec): min=9, max=3584, avg=613.88, stdev=168.68
> 
>      lat (msec): min=10, max=3584, avg=614.04, stdev=168.66
> 
>     clat percentiles (msec):
> 
>      |  1.00th=[  375],  5.00th=[  469], 10.00th=[  502], 20.00th=[  537],
> 
>      | 30.00th=[  553], 40.00th=[  578], 50.00th=[  594], 60.00th=[  603],
> 
>      | 70.00th=[  627], 80.00th=[  652], 90.00th=[  701], 95.00th=[  881],
> 
>      | 99.00th=[ 1205], 99.50th=[ 1483], 99.90th=[ 2638], 99.95th=[ 2671],
> 
>      | 99.99th=[ 2999]
> 
>     bw (KB  /s): min=  260, max=18181, per=6.31%, avg=10189.40, stdev=2009.86
> 
>     lat (msec) : 10=0.01%, 20=0.01%, 50=0.01%, 100=0.02%, 250=0.08%
> 
>     lat (msec) : 500=9.26%, 750=83.21%, 1000=4.09%
> 
>   cpu          : usr=0.22%, sys=0.55%, ctx=719279, majf=0, minf=433
> 
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.8%, >=64=0.0%
> 
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >=64=0.0%
> 
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, 
> >=64=0.0%
> 
>      issued    : total=r=0/w=242203/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
> 
>      latency   : target=0, window=0, percentile=100.00%, depth=32
> 
> 
> 
> Run status group 0 (all jobs):
> 
>   WRITE: io=47305MB, aggrb=161367KB/s, minb=161367KB/s, maxb=161367KB/s, 
> mint=300189msec, maxt=300189msec
> 
> 
> 
> Disk stats (read/write):
> 
>   rbd0: ios=1/287809, merge=0/18393, ticks=50/5887593, in_queue=5887504, 
> util=99.91%
> 
> 
> 
> 
> 
> The 4k block performance does not interest me so much but is given as a 
> reference. I am more looking for throughput, but anyway, the numbers seem 
> quite low.
> 
> 
> 
> Let's take IOPS, assuming the spinners can do 50 (4k) synced sustained IOPS 
> (I hope they can do more ^^), we should be around 50x84/3 = 1400 IOPS, which 
> is far from rados bench (538) and fio (847). And surprisingly fio numbers are 
> greater than rados.
> 
> 
> 
> So I don't know wether I am missing something here or if something is going 
> wrong (maybe both !).
> 
> 
> 
> Any input would be very valuable.
> 
> 
> 
> Thank you,
> 
> 
> 
> Adrien
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to