Hi Kris, Indeed I am seeing some spikes on the latency, they seem to be linked to other spikes on throughput and cluster global IOPS. I also see some spikes on the OSD (I guess this is when the journal is flushed) but IO on the journals are quite steady. I already tuned a bit the osd filestore and journal parameters to check there wasn't a limitation hidden somewhere that could explain what I originally didn't understand.
As you said I will need to check the behavior of the cluster under the actual workload and adjust accordingly. That should happen some time next week. Thanks for your input :) On Wed, Dec 9, 2015 at 5:10 PM, Kris Gillespie <[email protected]> wrote: > One thing I noticed with all my testing, as the speed difference between > the SSDs and the spinning rust can be quite high and as your journal needs > to flush every X bytes (configurable), the impact of this flush can be > hard, as IO to this journal will stop until it’s finished (I believe). > Something to see, run a fio test but also log the latency stats and then > graph them. Should make the issue pretty clear. I’ll predict you’re gonna > see some spikes. > > If so, you may need to > > a) decide if its a problem with the future defined workload - maybe it’s > not so bursty…. > b) have a look at > http://docs.ceph.com/docs/hammer/rados/configuration/journal-ref/ and > maybe tweak the “journal max writes bytes” or the others > > There won’t be a golden rule here however and it’s one of the reasons some > benchmarks can lead to unfounded worrying. > > Cheers > > Kris > > > On 04 Dec 2015, at 15:10, Jan Schermer <[email protected]> wrote: > > > On 04 Dec 2015, at 14:31, Adrien Gillard <[email protected]> wrote: > > After some more tests : > > - The pool being used as cache pool has no impact on performance, I get > the same results with a "dedicated" replicated pool. > - You are right Jan, on raw devices I get better performance on a volume > if I fill it first, or at least if I write a zone that already has been > allocated > - The same seem to apply when the test is run on the mounted filesystem. > > > Yeah. The the first (raw device) is because the objects on OSDs get > "thick" in the process. > The second (filesystem) is because of both the OSD objects getting thick > and the guest filesystem getting thick. > Preallocating the space can speed up things considerably (like 100x)). > Unfortunately I haven't found a way to convince fallocate() &co. to thick > provision files. > > Jan > > > > > > On Thu, Dec 3, 2015 at 2:49 PM, Adrien Gillard <[email protected]> > wrote: > >> I did some more tests : >> >> fio on a raw RBD volume (4K, numjob=32, QD=1) gives me around 3000 IOPS >> >> I also tuned xfs mount options on client (I realized I didn't do that >> already) and with >> "largeio,inode64,swalloc,logbufs=8,logbsize=256k,attr2,auto,nodev,noatime,nodiratime" >> I get better performance : >> >> 4k-32-1-randwrite-libaio: (groupid=0, jobs=32): err= 0: pid=26793: Thu >> Dec 3 10:45:55 2015 >> write: io=1685.3MB, bw=5720.1KB/s, iops=1430, runt=301652msec >> slat (usec): min=5, max=1620, avg=41.61, stdev=25.82 >> clat (msec): min=1, max=4141, avg=14.61, stdev=112.55 >> lat (msec): min=1, max=4141, avg=14.65, stdev=112.55 >> clat percentiles (msec): >> | 1.00th=[ 3], 5.00th=[ 4], 10.00th=[ 4], 20.00th=[ 4], >> | 30.00th=[ 4], 40.00th=[ 5], 50.00th=[ 5], 60.00th=[ 5], >> | 70.00th=[ 5], 80.00th=[ 6], 90.00th=[ 7], 95.00th=[ 7], >> | 99.00th=[ 227], 99.50th=[ 717], 99.90th=[ 1844], 99.95th=[ 2245], >> | 99.99th=[ 3097] >> >> So, more than 50% improvement but it actually varies quite a lot between >> tests (sometimes I get a bit more than 1000). If I run the test fo 30 >> minutes it drops to 900 IOPS. >> >> As you suggested I also filled a volume with zeros (dd if=/dev/zero >> of=/dev/rbd1 bs=1M) and then ran fio on the raw device, I didn't see a lot >> of improvement. >> >> If I run fio test directly on block devices I seem to saturate the >> spinners, [1] is a graph of IO load on one of the OSD host. >> [2] is the same OSD graph but when the test is done on a device mounted >> and formatted with XFS on the client. >> If I get half of the IOPS on the XFS volume because of the journal, >> shouldn't I get the same amount of IOPS on the backend ? >> [3] shows what happen if I run the test for 30 minutes. >> >> During the fio tests on the raw device, load average on the OSD servers >> increases up to 13/14 and I get a bit of iowait (I guess because the OSD >> are busy) >> During the fio tests on the raw device, load average on the OSD servers >> peaks at the beginning and decreases to 5/6, but goes trough the roof on >> the client. >> Scheduler is deadline for all the drives, I didn't try to change it yet. >> >> What I don't understand, even with your explanations, are the rados >> results. From what I understand it performs at the RADOS level and thus >> should not be impacted by client filesystem. >> Given the results above I guess you are right and this has to do with the >> client filesystem. >> >> The cluster will be used for backups, write IO size during backups is >> around 150/200K (I guess mostly sequential) and I am looking for the >> highest bandwith and parallelization. >> >> @Nick, I will try to create a new stand alone replicated pool. >> >> >> [1] http://postimg.org/image/qvtvdq1n1/ >> [2] http://postimg.org/image/nhf6lzwgl/ >> [3] http://postimg.org/image/h7l0obw7h/ >> >> On Thu, Dec 3, 2015 at 1:30 PM, Nick Fisk <[email protected]> wrote: >> >>> Couple of things to check >>> >>> 1. Can you create just a normal non cached pool and test >>> performance to rule out any funnies going on there. >>> >>> 2. Can you also run something like iostat during the benchmarks >>> and see if it looks like all your disks are getting saturated. >>> >>> >>> >>> >>> >>> _____________________________________________ >>> *From:* ceph-users [mailto:[email protected] >>> <[email protected]>]* On Behalf Of* Adrien Gillard >>> *Sent:* 02 December 2015 21:33 >>> *To:* [email protected] >>> *Subject:* [ceph-users] New cluster performance analysis >>> >>> Hi everyone, >>> >>> >>> I am currently testing our new cluster and I would like some >>> feedback on the numbers I am getting. >>> >>> >>> For the hardware : >>> >>> 7 x OSD : 2 x Intel 2640v3 (8x2.6GHz), 64B RAM, 2x10Gbits LACP >>> for public net., 2x10Gbits LACP for cluster net., MTU 9000 >>> >>> 1 x MON : 2 x Intel 2630L (6x2GHz), 32GB RAM and Intel DC SSD, >>> 2x10Gbits LACP for public net., MTU 9000 >>> >>> 2 x MON : VMs (8 cores, 8GB RAM), backed by SSD >>> >>> >>> Journals are 20GB partitions on SSD >>> >>> >>> The system is CentOS 7.1 with stock kernel >>> (3.10.0-229.20.1.el7.x86_64). No particular system optimizations. >>> >>> >>> Ceph is Infernalis from Ceph repository : ceph version 9.2.0 >>> (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299) >>> >>> >>> [cephadm@cph-adm-01 ~/scripts]$ ceph -s >>> >>> cluster 259f65a3-d6c8-4c90-a9c2-71d4c3c55cce >>> >>> health HEALTH_OK >>> >>> monmap e1: 3 mons at >>> >>> {clb-cph-frpar1-mon-02=x.x.x.2:6789/0,clb-cph-frpar2-mon-01=x.x.x.1:6789/0,clb-cph-frpar2-mon-03=x.x.x.3:6789/0} >>> >>> election epoch 62, quorum 0,1,2 >>> clb-cph-frpar2-mon-01,clb-cph-frpar1-mon-02,clb-cph-frpar2-mon-03 >>> >>> osdmap e844: 84 osds: 84 up, 84 in >>> >>> flags sortbitwise >>> >>> pgmap v111655: 3136 pgs, 3 pools, 3166 GB data, 19220 >>> kobjects >>> >>> 8308 GB used, 297 TB / 305 TB avail >>> >>> 3136 active+clean >>> >>> >>> My ceph.conf : >>> >>> >>> [global] >>> >>> fsid = 259f65a3-d6c8-4c90-a9c2-71d4c3c55cce >>> >>> mon_initial_members = clb-cph-frpar2-mon-01, >>> clb-cph-frpar1-mon-02, clb-cph-frpar2-mon-03 >>> >>> mon_host = x.x.x.1,x.x.x.2,x.x.x.3 >>> >>> auth_cluster_required = cephx >>> >>> auth_service_required = cephx >>> >>> auth_client_required = cephx >>> >>> filestore_xattr_use_omap = true >>> >>> public network = *10.25.25.0/24* >>> >>> <http://xo4t.mj.am/link/xo4t/r36izu2/1/7CTRRFQ_Kf3wNWPYd7QbWA/aHR0cDovLzEwLjI1LjI1LjAvMjQ> >>> >>> cluster network = *10.25.26.0/24* >>> >>> <http://xo4t.mj.am/link/xo4t/r36izu2/2/HsIkz-1efpdFK1tjIFEU0A/aHR0cDovLzEwLjI1LjI2LjAvMjQ> >>> >>> debug_lockdep = 0/0 >>> >>> debug_context = 0/0 >>> >>> debug_crush = 0/0 >>> >>> debug_buffer = 0/0 >>> >>> debug_timer = 0/0 >>> >>> debug_filer = 0/0 >>> >>> debug_objecter = 0/0 >>> >>> debug_rados = 0/0 >>> >>> debug_rbd = 0/0 >>> >>> debug_journaler = 0/0 >>> >>> debug_objectcatcher = 0/0 >>> >>> debug_client = 0/0 >>> >>> debug_osd = 0/0 >>> >>> debug_optracker = 0/0 >>> >>> debug_objclass = 0/0 >>> >>> debug_filestore = 0/0 >>> >>> debug_journal = 0/0 >>> >>> debug_ms = 0/0 >>> >>> debug_monc = 0/0 >>> >>> debug_tp = 0/0 >>> >>> debug_auth = 0/0 >>> >>> debug_finisher = 0/0 >>> >>> debug_heartbeatmap = 0/0 >>> >>> debug_perfcounter = 0/0 >>> >>> debug_asok = 0/0 >>> >>> debug_throttle = 0/0 >>> >>> debug_mon = 0/0 >>> >>> debug_paxos = 0/0 >>> >>> debug_rgw = 0/0 >>> >>> >>> [osd] >>> >>> osd journal size = 0 >>> >>> osd mount options xfs = >>> "rw,noatime,inode64,logbufs=8,logbsize=256k" >>> >>> filestore min sync interval = 5 >>> >>> filestore max sync interval = 15 >>> >>> filestore queue max ops = 2048 >>> >>> filestore queue max bytes = 1048576000 >>> >>> filestore queue committing max ops = 4096 >>> >>> filestore queue committing max bytes = 1048576000 >>> >>> filestore op thread = 32 >>> >>> filestore journal writeahead = true >>> >>> filestore merge threshold = 40 >>> >>> filestore split multiple = 8 >>> >>> >>> journal max write bytes = 1048576000 >>> >>> journal max write entries = 4096 >>> >>> journal queue max ops = 8092 >>> >>> journal queue max bytes = 1048576000 >>> >>> >>> osd max write size = 512 >>> >>> osd op threads = 16 >>> >>> osd disk threads = 2 >>> >>> osd op num threads per shard = 3 >>> >>> osd op num shards = 10 >>> >>> osd map cache size = 1024 >>> >>> osd max backfills = 1 >>> >>> osd recovery max active = 2 >>> >>> >>> I have set up 2 pools : one for cache with 3x replication in >>> front of an EC pool. At the moment I am only interested in the cache >>> pool, >>> so no promotions/flushes/evictions happen. >>> >>> (I know, I am using the same set of OSD for hot and cold data, >>> but in my use case they should not be used at the same time.) >>> >>> >>> I am accessing the cluster via RBD volumes mapped with the kernel >>> module on CentOS 7.1. These volumes are formatted in XFS on the >>> clients. >>> >>> >>> The journal SSDs seem to perform quite well according to the >>> results of Sebastien Han’s benchmark suggestion (they are Sandisk) : >>> >>> write: io=22336MB, bw=381194KB/s, iops=95298, runt= 60001msec >>> (this is for numjob=10) >>> >>> >>> Here are the rados bench tests : >>> >>> >>> rados bench -p rbdcache 120 write -b 4K -t 32 --no-cleanup >>> >>> Total time run: 121.410763 >>> >>> Total writes made: 65357 >>> >>> Write size: 4096 >>> >>> Bandwidth (MB/sec): 2.1 >>> >>> Stddev Bandwidth: 0.597 >>> >>> Max bandwidth (MB/sec): 3.89 >>> >>> Min bandwidth (MB/sec): 0.00781 >>> >>> Average IOPS: 538 >>> >>> Stddev IOPS: 152 >>> >>> Max IOPS: 995 >>> >>> Min IOPS: 2 >>> >>> Average Latency: 0.0594 >>> >>> Stddev Latency: 0.18 >>> >>> Max latency: 2.82 >>> >>> Min latency: 0.00494 >>> >>> >>> And the results of the fio test with the following parameters : >>> >>> >>> [global] >>> >>> size=8G >>> >>> runtime=300 >>> >>> ioengine=libaio >>> >>> invalidate=1 >>> >>> direct=1 >>> >>> sync=1 >>> >>> fsync=1 >>> >>> numjobs=32 >>> >>> rw=randwrite >>> >>> name=4k-32-1-randwrite-libaio >>> >>> blocksize=4K >>> >>> iodepth=1 >>> >>> directory=/mnt/rbd >>> >>> group_reporting=1 >>> >>> 4k-32-1-randwrite-libaio: (groupid=0, jobs=32): err= 0: >>> pid=20442: Wed Dec 2 21:38:30 2015 >>> >>> write: io=992.11MB, bw=3389.3KB/s, iops=847, runt=300011msec >>> >>> slat (usec): min=5, max=4726, avg=40.32, stdev=41.28 >>> >>> clat (msec): min=2, max=2208, avg=19.35, stdev=74.34 >>> >>> lat (msec): min=2, max=2208, avg=19.39, stdev=74.34 >>> >>> clat percentiles (msec): >>> >>> | 1.00th=[ 3], 5.00th=[ 4], 10.00th=[ 4], >>> 20.00th=[ 4], >>> >>> | 30.00th=[ 4], 40.00th=[ 5], 50.00th=[ 5], >>> 60.00th=[ 5], >>> >>> | 70.00th=[ 6], 80.00th=[ 7], 90.00th=[ 38], >>> 95.00th=[ 63], >>> >>> | 99.00th=[ 322], 99.50th=[ 570], 99.90th=[ 1074], >>> 99.95th=[ 1221], >>> >>> | 99.99th=[ 1532] >>> >>> bw (KB /s): min= 1, max= 448, per=3.64%, avg=123.48, >>> stdev=102.09 >>> >>> lat (msec) : 4=30.30%, 10=51.27%, 20=1.71%, 50=9.91%, >>> 100=4.03% >>> >>> lat (msec) : 250=1.55%, 500=0.62%, 750=0.33%, 1000=0.16% >>> >>> cpu : usr=0.09%, sys=0.25%, ctx=963114, majf=0, >>> minf=928 >>> >>> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, >>> 32=0.0%, >=64=0.0% >>> >>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >>> 64=0.0%, >=64=0.0% >>> >>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >>> 64=0.0%, >=64=0.0% >>> >>> issued : total=r=0/w=254206/d=0, short=r=0/w=0/d=0, >>> drop=r=0/w=0/d=0 >>> >>> latency : target=0, window=0, percentile=100.00%, depth=1 >>> >>> Run status group 0 (all jobs): >>> >>> WRITE: io=992.11MB, aggrb=3389KB/s, minb=3389KB/s, >>> maxb=3389KB/s, mint=300011msec, maxt=300011msec >>> >>> Disk stats (read/write): >>> >>> rbd0: ios=0/320813, merge=0/10001, ticks=0/5670847, >>> in_queue=5677825, util=100.00% >>> >>> >>> And a job closer to what the actual workload would be >>> (blocksize=200K, numjob=16, QD=32) >>> >>> 200k-16-32-randwrite-libaio: (groupid=0, jobs=16): err= 0: >>> pid=4828: Wed Dec 2 18:58:53 2015 >>> >>> write: io=47305MB, bw=161367KB/s, iops=806, runt=300189msec >>> >>> slat (usec): min=17, max=358430, avg=155.11, stdev=2361.49 >>> >>> clat (msec): min=9, max=3584, avg=613.88, stdev=168.68 >>> >>> lat (msec): min=10, max=3584, avg=614.04, stdev=168.66 >>> >>> clat percentiles (msec): >>> >>> | 1.00th=[ 375], 5.00th=[ 469], 10.00th=[ 502], >>> 20.00th=[ 537], >>> >>> | 30.00th=[ 553], 40.00th=[ 578], 50.00th=[ 594], >>> 60.00th=[ 603], >>> >>> | 70.00th=[ 627], 80.00th=[ 652], 90.00th=[ 701], >>> 95.00th=[ 881], >>> >>> | 99.00th=[ 1205], 99.50th=[ 1483], 99.90th=[ 2638], >>> 99.95th=[ 2671], >>> >>> | 99.99th=[ 2999] >>> >>> bw (KB /s): min= 260, max=18181, per=6.31%, avg=10189.40, >>> stdev=2009.86 >>> >>> lat (msec) : 10=0.01%, 20=0.01%, 50=0.01%, 100=0.02%, >>> 250=0.08% >>> >>> lat (msec) : 500=9.26%, 750=83.21%, 1000=4.09% >>> >>> cpu : usr=0.22%, sys=0.55%, ctx=719279, majf=0, >>> minf=433 >>> >>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, >>> 32=99.8%, >=64=0.0% >>> >>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >>> 64=0.0%, >=64=0.0% >>> >>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, >>> 64=0.0%, >=64=0.0% >>> >>> issued : total=r=0/w=242203/d=0, short=r=0/w=0/d=0, >>> drop=r=0/w=0/d=0 >>> >>> latency : target=0, window=0, percentile=100.00%, depth=32 >>> >>> Run status group 0 (all jobs): >>> >>> WRITE: io=47305MB, aggrb=161367KB/s, minb=161367KB/s, >>> maxb=161367KB/s, mint=300189msec, maxt=300189msec >>> >>> Disk stats (read/write): >>> >>> rbd0: ios=1/287809, merge=0/18393, ticks=50/5887593, >>> in_queue=5887504, util=99.91% >>> >>> The 4k block performance does not interest me so much but is >>> given as a reference. I am more looking for throughput, but anyway, >>> the >>> numbers seem quite low. >>> >>> Let's take IOPS, assuming the spinners can do 50 (4k) synced >>> sustained IOPS (I hope they can do more ^^), we should be around >>> 50x84/3 = >>> 1400 IOPS, which is far from rados bench (538) and fio (847). And >>> surprisingly fio numbers are greater than rados. >>> >>> So I don't know wether I am missing something here or if >>> something is going wrong (maybe both !). >>> >>> Any input would be very valuable. >>> >>> Thank you, >>> >>> Adrien << File: ATT00001.txt >> >>> >>> >>> >> >> >> -- >> >> ----------------------------------------------------------------------------------------- >> Adrien GILLARD >> >> +33 (0)6 29 06 16 31 >> [email protected] >> > > > > -- > > ----------------------------------------------------------------------------------------- > Adrien GILLARD > > +33 (0)6 29 06 16 31 > [email protected] > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > -- ----------------------------------------------------------------------------------------- Adrien GILLARD +33 (0)6 29 06 16 31 [email protected]
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
