Re: [ceph-users] New cluster performance analysis

Adrien Gillard Thu, 10 Dec 2015 00:38:38 -0800

Hi Kris,

Indeed I am seeing some spikes on the latency, they seem to be linked to
other spikes on throughput and cluster global IOPS. I also see some spikes
on the OSD (I guess this is when the journal is flushed) but IO on the
journals are quite steady. I already tuned a bit the osd filestore and
journal parameters to check there wasn't a limitation hidden somewhere that
could explain what I originally didn't understand.


As you said I will need to check the behavior of the cluster under the
actual workload and adjust accordingly. That should happen some time next
week.

Thanks for your input :)

On Wed, Dec 9, 2015 at 5:10 PM, Kris Gillespie <[email protected]> wrote:

> One thing I noticed with all my testing, as the speed difference between
> the SSDs and the spinning rust can be quite high and as your journal needs
> to flush every X bytes (configurable), the impact of this flush can be
> hard, as IO to this journal will stop until it’s finished (I believe).
> Something to see, run a fio test but also log the latency stats and then
> graph them. Should make the issue pretty clear. I’ll predict you’re gonna
> see some spikes.
>
> If so, you may need to
>
> a) decide if its a problem with the future defined workload - maybe it’s
> not so bursty….
> b) have a look at
> http://docs.ceph.com/docs/hammer/rados/configuration/journal-ref/ and
> maybe tweak the “journal max writes bytes” or the others
>
> There won’t be a golden rule here however and it’s one of the reasons some
> benchmarks can lead to unfounded worrying.
>
> Cheers
>
> Kris
>
>
> On 04 Dec 2015, at 15:10, Jan Schermer <[email protected]> wrote:
>
>
> On 04 Dec 2015, at 14:31, Adrien Gillard <[email protected]> wrote:
>
> After some more tests :
>
>  - The pool being used as cache pool has no impact on performance, I get
> the same results with a "dedicated" replicated pool.
>  - You are right Jan, on raw devices I get better performance on a volume
> if I fill it first, or at least if I write a zone that already has been
> allocated
>  - The same seem to apply when the test is run on the mounted filesystem.
>
>
> Yeah. The the first (raw device) is because the objects on OSDs get
> "thick" in the process.
> The second (filesystem) is because of both the OSD objects getting thick
> and the guest filesystem getting thick.
> Preallocating the space can speed up things considerably (like 100x)).
> Unfortunately I haven't found a way to convince fallocate() &co. to thick
> provision files.
>
> Jan
>
>
>
>
>
> On Thu, Dec 3, 2015 at 2:49 PM, Adrien Gillard <[email protected]>
> wrote:
>
>> I did some more tests :
>>
>> fio on a raw RBD volume (4K, numjob=32, QD=1) gives me around 3000 IOPS
>>
>> I also tuned xfs mount options on client (I realized I didn't do that
>> already) and with
>> "largeio,inode64,swalloc,logbufs=8,logbsize=256k,attr2,auto,nodev,noatime,nodiratime"
>> I get better performance :
>>
>> 4k-32-1-randwrite-libaio: (groupid=0, jobs=32): err= 0: pid=26793: Thu
>> Dec  3 10:45:55 2015
>>   write: io=1685.3MB, bw=5720.1KB/s, iops=1430, runt=301652msec
>>     slat (usec): min=5, max=1620, avg=41.61, stdev=25.82
>>     clat (msec): min=1, max=4141, avg=14.61, stdev=112.55
>>      lat (msec): min=1, max=4141, avg=14.65, stdev=112.55
>>     clat percentiles (msec):
>>      |  1.00th=[    3],  5.00th=[    4], 10.00th=[    4], 20.00th=[    4],
>>      | 30.00th=[    4], 40.00th=[    5], 50.00th=[    5], 60.00th=[    5],
>>      | 70.00th=[    5], 80.00th=[    6], 90.00th=[    7], 95.00th=[    7],
>>      | 99.00th=[  227], 99.50th=[  717], 99.90th=[ 1844], 99.95th=[ 2245],
>>      | 99.99th=[ 3097]
>>
>> So, more than 50% improvement but it actually varies quite a lot between
>> tests (sometimes I get a bit more than 1000). If I run the test fo 30
>> minutes it drops to 900 IOPS.
>>
>> As you suggested I also filled a volume with zeros (dd if=/dev/zero
>> of=/dev/rbd1 bs=1M) and then ran fio on the raw device, I didn't see a lot
>> of improvement.
>>
>> If I run fio test directly on block devices I seem to saturate the
>> spinners, [1] is a graph of IO load on one of the OSD host.
>> [2] is the same OSD graph but when the test is done on a device mounted
>> and formatted with XFS on the client.
>> If I get half of the IOPS on the XFS volume because of the journal,
>> shouldn't I get the same amount of IOPS on the backend ?
>> [3] shows what happen if I run the test for 30 minutes.
>>
>> During the fio tests on the raw device, load average on the OSD servers
>> increases up to 13/14 and I get a bit of iowait (I guess because the OSD
>> are busy)
>> During the fio tests on the raw device, load average on the OSD servers
>> peaks at the beginning and decreases to 5/6, but goes trough the roof on
>> the client.
>> Scheduler is deadline for all the drives, I didn't try to change it yet.
>>
>> What I don't understand, even with your explanations, are the rados
>> results. From what I understand it performs at the RADOS level and thus
>> should not be impacted by client filesystem.
>> Given the results above I guess you are right and this has to do with the
>> client filesystem.
>>
>> The cluster will be used for backups, write IO size during backups is
>> around 150/200K (I guess mostly sequential) and I am looking for the
>> highest bandwith and parallelization.
>>
>> @Nick, I will try to create a new stand alone replicated pool.
>>
>>
>> [1] http://postimg.org/image/qvtvdq1n1/
>> [2] http://postimg.org/image/nhf6lzwgl/
>> [3] http://postimg.org/image/h7l0obw7h/
>>
>> On Thu, Dec 3, 2015 at 1:30 PM, Nick Fisk <[email protected]> wrote:
>>
>>> Couple of things to check
>>>
>>> 1.      Can you create just a normal non cached pool and test
>>> performance to rule out any funnies going on there.
>>>
>>> 2.      Can you also run something like iostat during the benchmarks
>>> and see if it looks like all your disks are getting saturated.
>>>
>>>
>>>
>>>
>>>
>>>    _____________________________________________
>>>       *From:* ceph-users [mailto:[email protected]
>>>       <[email protected]>]* On Behalf Of* Adrien Gillard
>>>       *Sent:* 02 December 2015 21:33
>>>       *To:* [email protected]
>>>       *Subject:* [ceph-users] New cluster performance analysis
>>>
>>>       Hi everyone,
>>>
>>>
>>>       I am currently testing our new cluster and I would like some
>>>       feedback on the numbers I am getting.
>>>
>>>
>>>       For the hardware :
>>>
>>>       7 x OSD : 2 x Intel 2640v3 (8x2.6GHz), 64B RAM, 2x10Gbits LACP
>>>       for public net., 2x10Gbits LACP for cluster net., MTU 9000
>>>
>>>       1 x MON : 2 x Intel 2630L (6x2GHz), 32GB RAM and Intel DC SSD,
>>>       2x10Gbits LACP for public net., MTU 9000
>>>
>>>       2 x MON : VMs (8 cores, 8GB RAM), backed by SSD
>>>
>>>
>>>       Journals are 20GB partitions on SSD
>>>
>>>
>>>       The system is CentOS 7.1 with stock kernel
>>>       (3.10.0-229.20.1.el7.x86_64). No particular system optimizations.
>>>
>>>
>>>       Ceph is Infernalis from Ceph repository  : ceph version 9.2.0
>>>       (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
>>>
>>>
>>>       [cephadm@cph-adm-01  ~/scripts]$ ceph -s
>>>
>>>           cluster 259f65a3-d6c8-4c90-a9c2-71d4c3c55cce
>>>
>>>            health HEALTH_OK
>>>
>>>            monmap e1: 3 mons at
>>>       
>>> {clb-cph-frpar1-mon-02=x.x.x.2:6789/0,clb-cph-frpar2-mon-01=x.x.x.1:6789/0,clb-cph-frpar2-mon-03=x.x.x.3:6789/0}
>>>
>>>                   election epoch 62, quorum 0,1,2
>>>       clb-cph-frpar2-mon-01,clb-cph-frpar1-mon-02,clb-cph-frpar2-mon-03
>>>
>>>            osdmap e844: 84 osds: 84 up, 84 in
>>>
>>>                   flags sortbitwise
>>>
>>>             pgmap v111655: 3136 pgs, 3 pools, 3166 GB data, 19220
>>>       kobjects
>>>
>>>                   8308 GB used, 297 TB / 305 TB avail
>>>
>>>                       3136 active+clean
>>>
>>>
>>>       My ceph.conf :
>>>
>>>
>>>       [global]
>>>
>>>       fsid = 259f65a3-d6c8-4c90-a9c2-71d4c3c55cce
>>>
>>>       mon_initial_members = clb-cph-frpar2-mon-01,
>>>       clb-cph-frpar1-mon-02, clb-cph-frpar2-mon-03
>>>
>>>       mon_host = x.x.x.1,x.x.x.2,x.x.x.3
>>>
>>>       auth_cluster_required = cephx
>>>
>>>       auth_service_required = cephx
>>>
>>>       auth_client_required = cephx
>>>
>>>       filestore_xattr_use_omap = true
>>>
>>>       public network = *10.25.25.0/24*
>>>       
>>> <http://xo4t.mj.am/link/xo4t/r36izu2/1/7CTRRFQ_Kf3wNWPYd7QbWA/aHR0cDovLzEwLjI1LjI1LjAvMjQ>
>>>
>>>       cluster network = *10.25.26.0/24*
>>>       
>>> <http://xo4t.mj.am/link/xo4t/r36izu2/2/HsIkz-1efpdFK1tjIFEU0A/aHR0cDovLzEwLjI1LjI2LjAvMjQ>
>>>
>>>       debug_lockdep = 0/0
>>>
>>>       debug_context = 0/0
>>>
>>>       debug_crush = 0/0
>>>
>>>       debug_buffer = 0/0
>>>
>>>       debug_timer = 0/0
>>>
>>>       debug_filer = 0/0
>>>
>>>       debug_objecter = 0/0
>>>
>>>       debug_rados = 0/0
>>>
>>>       debug_rbd = 0/0
>>>
>>>       debug_journaler = 0/0
>>>
>>>       debug_objectcatcher = 0/0
>>>
>>>       debug_client = 0/0
>>>
>>>       debug_osd = 0/0
>>>
>>>       debug_optracker = 0/0
>>>
>>>       debug_objclass = 0/0
>>>
>>>       debug_filestore = 0/0
>>>
>>>       debug_journal = 0/0
>>>
>>>       debug_ms = 0/0
>>>
>>>       debug_monc = 0/0
>>>
>>>       debug_tp = 0/0
>>>
>>>       debug_auth = 0/0
>>>
>>>       debug_finisher = 0/0
>>>
>>>       debug_heartbeatmap = 0/0
>>>
>>>       debug_perfcounter = 0/0
>>>
>>>       debug_asok = 0/0
>>>
>>>       debug_throttle = 0/0
>>>
>>>       debug_mon = 0/0
>>>
>>>       debug_paxos = 0/0
>>>
>>>       debug_rgw = 0/0
>>>
>>>
>>>       [osd]
>>>
>>>       osd journal size = 0
>>>
>>>       osd mount options xfs =
>>>       "rw,noatime,inode64,logbufs=8,logbsize=256k"
>>>
>>>       filestore min sync interval = 5
>>>
>>>       filestore max sync interval = 15
>>>
>>>       filestore queue max ops = 2048
>>>
>>>       filestore queue max bytes = 1048576000
>>>
>>>       filestore queue committing max ops = 4096
>>>
>>>       filestore queue committing max bytes = 1048576000
>>>
>>>       filestore op thread = 32
>>>
>>>       filestore journal writeahead = true
>>>
>>>       filestore merge threshold = 40
>>>
>>>       filestore split multiple = 8
>>>
>>>
>>>       journal max write bytes = 1048576000
>>>
>>>       journal max write entries = 4096
>>>
>>>       journal queue max ops = 8092
>>>
>>>       journal queue max bytes = 1048576000
>>>
>>>
>>>       osd max write size = 512
>>>
>>>       osd op threads = 16
>>>
>>>       osd disk threads = 2
>>>
>>>       osd op num threads per shard = 3
>>>
>>>       osd op num shards = 10
>>>
>>>       osd map cache size = 1024
>>>
>>>       osd max backfills = 1
>>>
>>>       osd recovery max active = 2
>>>
>>>
>>>       I have set up 2 pools : one for cache with 3x replication in
>>>       front of an EC pool. At the moment I am only interested in the cache 
>>> pool,
>>>       so no promotions/flushes/evictions happen.
>>>
>>>       (I know, I am using the same set of OSD for hot and cold data,
>>>       but in my use case they should not be used at the same time.)
>>>
>>>
>>>       I am accessing the cluster via RBD volumes mapped with the kernel
>>>       module on CentOS 7.1. These volumes are formatted in XFS on the 
>>> clients.
>>>
>>>
>>>       The journal SSDs seem to perform quite well according to the
>>>       results of Sebastien Han’s benchmark suggestion (they are Sandisk) :
>>>
>>>       write: io=22336MB, bw=381194KB/s, iops=95298, runt= 60001msec
>>>       (this is for numjob=10)
>>>
>>>
>>>       Here are the rados bench tests :
>>>
>>>
>>>       rados bench -p rbdcache 120 write -b 4K -t 32 --no-cleanup
>>>
>>>       Total time run:         121.410763
>>>
>>>       Total writes made:      65357
>>>
>>>       Write size:             4096
>>>
>>>       Bandwidth (MB/sec):     2.1
>>>
>>>       Stddev Bandwidth:       0.597
>>>
>>>       Max bandwidth (MB/sec): 3.89
>>>
>>>       Min bandwidth (MB/sec): 0.00781
>>>
>>>       Average IOPS:           538
>>>
>>>       Stddev IOPS:            152
>>>
>>>       Max IOPS:               995
>>>
>>>       Min IOPS:               2
>>>
>>>       Average Latency:        0.0594
>>>
>>>       Stddev Latency:         0.18
>>>
>>>       Max latency:            2.82
>>>
>>>       Min latency:            0.00494
>>>
>>>
>>>       And the results of the fio test with the following parameters :
>>>
>>>
>>>       [global]
>>>
>>>       size=8G
>>>
>>>       runtime=300
>>>
>>>       ioengine=libaio
>>>
>>>       invalidate=1
>>>
>>>       direct=1
>>>
>>>       sync=1
>>>
>>>       fsync=1
>>>
>>>       numjobs=32
>>>
>>>       rw=randwrite
>>>
>>>       name=4k-32-1-randwrite-libaio
>>>
>>>       blocksize=4K
>>>
>>>       iodepth=1
>>>
>>>       directory=/mnt/rbd
>>>
>>>       group_reporting=1
>>>
>>>       4k-32-1-randwrite-libaio: (groupid=0, jobs=32): err= 0:
>>>       pid=20442: Wed Dec  2 21:38:30 2015
>>>
>>>         write: io=992.11MB, bw=3389.3KB/s, iops=847, runt=300011msec
>>>
>>>           slat (usec): min=5, max=4726, avg=40.32, stdev=41.28
>>>
>>>           clat (msec): min=2, max=2208, avg=19.35, stdev=74.34
>>>
>>>            lat (msec): min=2, max=2208, avg=19.39, stdev=74.34
>>>
>>>           clat percentiles (msec):
>>>
>>>            |  1.00th=[    3],  5.00th=[    4], 10.00th=[    4],
>>>       20.00th=[    4],
>>>
>>>            | 30.00th=[    4], 40.00th=[    5], 50.00th=[    5],
>>>       60.00th=[    5],
>>>
>>>            | 70.00th=[    6], 80.00th=[    7], 90.00th=[   38],
>>>       95.00th=[   63],
>>>
>>>            | 99.00th=[  322], 99.50th=[  570], 99.90th=[ 1074],
>>>       99.95th=[ 1221],
>>>
>>>            | 99.99th=[ 1532]
>>>
>>>           bw (KB  /s): min=    1, max=  448, per=3.64%, avg=123.48,
>>>       stdev=102.09
>>>
>>>           lat (msec) : 4=30.30%, 10=51.27%, 20=1.71%, 50=9.91%,
>>>       100=4.03%
>>>
>>>           lat (msec) : 250=1.55%, 500=0.62%, 750=0.33%, 1000=0.16%
>>>
>>>         cpu          : usr=0.09%, sys=0.25%, ctx=963114, majf=0,
>>>       minf=928
>>>
>>>         IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%,
>>>       32=0.0%, >=64=0.0%
>>>
>>>            submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>>       64=0.0%, >=64=0.0%
>>>
>>>            complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>>       64=0.0%, >=64=0.0%
>>>
>>>            issued    : total=r=0/w=254206/d=0, short=r=0/w=0/d=0,
>>>       drop=r=0/w=0/d=0
>>>
>>>            latency   : target=0, window=0, percentile=100.00%, depth=1
>>>
>>>       Run status group 0 (all jobs):
>>>
>>>         WRITE: io=992.11MB, aggrb=3389KB/s, minb=3389KB/s,
>>>       maxb=3389KB/s, mint=300011msec, maxt=300011msec
>>>
>>>       Disk stats (read/write):
>>>
>>>         rbd0: ios=0/320813, merge=0/10001, ticks=0/5670847,
>>>       in_queue=5677825, util=100.00%
>>>
>>>
>>>       And a job closer to what the actual workload would be
>>>       (blocksize=200K, numjob=16, QD=32)
>>>
>>>       200k-16-32-randwrite-libaio: (groupid=0, jobs=16): err= 0:
>>>       pid=4828: Wed Dec  2 18:58:53 2015
>>>
>>>         write: io=47305MB, bw=161367KB/s, iops=806, runt=300189msec
>>>
>>>           slat (usec): min=17, max=358430, avg=155.11, stdev=2361.49
>>>
>>>           clat (msec): min=9, max=3584, avg=613.88, stdev=168.68
>>>
>>>            lat (msec): min=10, max=3584, avg=614.04, stdev=168.66
>>>
>>>           clat percentiles (msec):
>>>
>>>            |  1.00th=[  375],  5.00th=[  469], 10.00th=[  502],
>>>       20.00th=[  537],
>>>
>>>            | 30.00th=[  553], 40.00th=[  578], 50.00th=[  594],
>>>       60.00th=[  603],
>>>
>>>            | 70.00th=[  627], 80.00th=[  652], 90.00th=[  701],
>>>       95.00th=[  881],
>>>
>>>            | 99.00th=[ 1205], 99.50th=[ 1483], 99.90th=[ 2638],
>>>       99.95th=[ 2671],
>>>
>>>            | 99.99th=[ 2999]
>>>
>>>           bw (KB  /s): min=  260, max=18181, per=6.31%, avg=10189.40,
>>>       stdev=2009.86
>>>
>>>           lat (msec) : 10=0.01%, 20=0.01%, 50=0.01%, 100=0.02%,
>>>       250=0.08%
>>>
>>>           lat (msec) : 500=9.26%, 750=83.21%, 1000=4.09%
>>>
>>>         cpu          : usr=0.22%, sys=0.55%, ctx=719279, majf=0,
>>>       minf=433
>>>
>>>         IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%,
>>>       32=99.8%, >=64=0.0%
>>>
>>>            submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>>       64=0.0%, >=64=0.0%
>>>
>>>            complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%,
>>>       64=0.0%, >=64=0.0%
>>>
>>>            issued    : total=r=0/w=242203/d=0, short=r=0/w=0/d=0,
>>>       drop=r=0/w=0/d=0
>>>
>>>            latency   : target=0, window=0, percentile=100.00%, depth=32
>>>
>>>       Run status group 0 (all jobs):
>>>
>>>         WRITE: io=47305MB, aggrb=161367KB/s, minb=161367KB/s,
>>>       maxb=161367KB/s, mint=300189msec, maxt=300189msec
>>>
>>>       Disk stats (read/write):
>>>
>>>         rbd0: ios=1/287809, merge=0/18393, ticks=50/5887593,
>>>       in_queue=5887504, util=99.91%
>>>
>>>       The 4k block performance does not interest me so much but is
>>>       given as a reference. I am more looking for throughput, but anyway, 
>>> the
>>>       numbers seem quite low.
>>>
>>>       Let's take IOPS, assuming the spinners can do 50 (4k) synced
>>>       sustained IOPS (I hope they can do more ^^), we should be around 
>>> 50x84/3 =
>>>       1400 IOPS, which is far from rados bench (538) and fio (847). And
>>>       surprisingly fio numbers are greater than rados.
>>>
>>>       So I don't know wether I am missing something here or if
>>>       something is going wrong (maybe both !).
>>>
>>>       Any input would be very valuable.
>>>
>>>       Thank you,
>>>
>>>       Adrien << File: ATT00001.txt >>
>>>
>>>
>>>
>>
>>
>> --
>>
>> -----------------------------------------------------------------------------------------
>> Adrien GILLARD
>>
>> +33 (0)6 29 06 16 31
>> [email protected]
>>
>
>
>
> --
>
> -----------------------------------------------------------------------------------------
> Adrien GILLARD
>
> +33 (0)6 29 06 16 31
> [email protected]
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>


-- 
-----------------------------------------------------------------------------------------
Adrien GILLARD

+33 (0)6 29 06 16 31
[email protected]

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] New cluster performance analysis

Reply via email to