Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

kevin parrikar Sat, 07 Jan 2017 01:34:37 -0800

Thanks Maged for your suggestion.

I have executed rbd bench and here is the result,please have a look at it



rbd bench-write image01  --pool=rbd --io-threads=32 --io-size 4096
--io-pattern rand --rbd_cache=false

bench-write  io_size 4096 io_threads 32 bytes 1073741824 pattern rand
  SEC       OPS   OPS/SEC   BYTES/SEC
    1      4750   4750.19  19456758.28
    2      7152   3068.49  12568516.09
    4      7220   1564.41  6407837.20
    5      8941   1794.35  7349666.74
    6     11938   1994.94  8171294.61
    7     12932   1365.21  5591891.85
^C

not sure why it skipped "3" from SEC .

I suppose this also shows slow performance.


Any idea where could be the issue?

I use LSI 9260-4i controller (firmware 12.13.0.-0154) on both the nodes
with write back enabled . i am not sure if this controller is suitable for
ceph.

Regards,
Kevin

On Sat, Jan 7, 2017 at 1:23 PM, Maged Mokhtar <[email protected]> wrote:

> The numbers are very low. I would first benchmark the system without the
> vm client using rbd 4k test such as:
>
> rbd bench-write image01  --pool=rbd --io-threads=32 --io-size 4096
> --io-pattern rand --rbd_cache=false
>
>
>
> -------- Original message --------
> From: kevin parrikar <[email protected]>
> Date: 07/01/2017 05:48 (GMT+02:00)
> To: Christian Balzer <[email protected]>
> Cc: [email protected]
> Subject: Re: [ceph-users] Analysing ceph performance with SSD journal,
> 10gbe NIC and 2 replicas -Hammer release
>
> i really need some help here :(
>
> replaced all 7.2 rpm SAS disks with new Samsung 840 evo 512Gb SSD with no
> seperate journal Disk .Now both OSD nodes are with 2 ssd disks  with a
> replica of *2* .
> Total number of OSD process in the cluster is *4*.with all SSD.
>
>
> But throughput has gone down from 1.4 MB/s to 1.3 MB/s for 4k writes and
> for 4M it has gone down from 140MB/s to 126MB/s .
>
> now atop no longer shows OSD device as 100% busy..
>
> How ever i can see both ceph-osd process in atop with 53% and 47% disk
> utilization.
>
>  PID                         RDDSK          WRDSK           WCANCL
> DSK     CMD        1/2
> 20771                          0K                648.8M             0K
>           53%    ceph-osd
> 19547                          0K                576.7M             0K
>           47%    ceph-osd
>
>
> OSD disks(ssd) utilization from atop
>
> DSK |  sdc | busy  6%  | read  0  | write  517  | KiB/r   0  | KiB/w  293
> | MBr/s 0.00  | MBw/s 148.18  | avq   9.44  | avio 0.12 ms  |
>
> DSK |  sdd | busy   5% | read   0 | write   336 | KiB/r   0  | KiB/w   292
> | MBr/s 0.00 | MBw/s  96.12  | avq     7.62  | avio 0.15 ms  |
>
>
> Queue Depth of OSD disks
>  cat /sys/block/sdd/device//queue_depth
> 256
>
> atop inside virtual machine:[4 CPU/3Gb RAM]
> DSK |   vdc  | busy     96%  | read     0  | write  256  | KiB/r   0  |
> KiB/w  512  | MBr/s   0.00  | MBw/s 128.00  | avq    7.96  | avio 3.77 ms  |
>
>
> Both Guest and Host are using deadline I/O scheduler
>
>
> Virtual Machine Configuration:
>
>  </disk>
>     <disk type='network' device='disk'>
>       <driver name='qemu' type='raw' cache='writeback'/>
>       <auth username='compute'>
>         <secret type='ceph' uuid='a5d0dd94-57c4-ae55-ffe0-7e3732a24455'/>
>       </auth>
>       <source protocol='rbd' name='volumes/volume-449da0e7-
> 6223-457c-b2c6-b5e112099212'>
>         <host name='172.16.1.8' port='6789'/>
>         <host name='172.16.1.11' port='6789'/>
>         <host name='172.16.1.12' port='6789'/>
>       </source>
>       <target dev='vdb' bus='virtio'/>
>       <serial>449da0e7-6223-457c-b2c6-b5e112099212</serial>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x07'
> function='0x0'/>
>     </disk>
>
>
>
> ceph.conf
>
>  cat /etc/ceph/ceph.conf
>
> [global]
> fsid = c4e1a523-9017-492e-9c30-8350eba1bd51
> mon_initial_members = node-16 node-30 node-31
> mon_host = 172.16.1.11 172.16.1.12 172.16.1.8
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> filestore_xattr_use_omap = true
> log_to_syslog_level = info
> log_to_syslog = True
> osd_pool_default_size = 2
> osd_pool_default_min_size = 1
> osd_pool_default_pg_num = 64
> public_network = 172.16.1.0/24
> log_to_syslog_facility = LOG_LOCAL0
> osd_journal_size = 2048
> auth_supported = cephx
> osd_pool_default_pgp_num = 64
> osd_mkfs_type = xfs
> cluster_network = 172.16.1.0/24
> osd_recovery_max_active = 1
> osd_max_backfills = 1
>
>
> [client]
> rbd_cache_writethrough_until_flush = True
> rbd_cache = True
>
> [client.radosgw.gateway]
> rgw_keystone_accepted_roles = _member_, Member, admin, swiftoperator
> keyring = /etc/ceph/keyring.radosgw.gateway
> rgw_frontends = fastcgi socket_port=9000 socket_host=127.0.0.1
> rgw_socket_path = /tmp/radosgw.sock
> rgw_keystone_revocation_interval = 1000000
>
> Any guidance on where to look for issues.
>
> Regards,
> Kevin
>
> On Fri, Jan 6, 2017 at 4:42 PM, kevin parrikar <[email protected]>
> wrote:
>
>> Thanks Christian for your valuable comments,each comment is a new
>> learning for me.
>> Please see inline
>>
>> On Fri, Jan 6, 2017 at 9:32 AM, Christian Balzer <[email protected]> wrote:
>>
>>>
>>> Hello,
>>>
>>> On Fri, 6 Jan 2017 08:40:36 +0530 kevin parrikar wrote:
>>>
>>> > Hello All,
>>> >
>>> > I have setup a ceph cluster based on 0.94.6 release in  2 servers each
>>> with
>>> > 80Gb intel s3510 and 2x3 Tb 7.2 SATA disks,16 CPU,24G RAM
>>> > which is connected to a 10G switch with a replica of 2 [ i will add 3
>>> more
>>> > servers to the cluster] and 3 seperate monitor nodes which are vms.
>>> >
>>> I'd go to the latest hammer, this version has a lethal cache-tier bug if
>>> you should decide to try that.
>>>
>>> 80Gb Intel DC S3510 are a) slow and b) have only 0.3 DWPD.
>>> You're going to wear those out quickly and if not replaced in time loose
>>> data.
>>>
>>> 2 HDDs give you a theoretical speed of something like 300MB/s sustained,
>>> when used a OSDs I'd expect the usual 50-60MB/s per OSD due to
>>> seeks, journal (file system) and leveldb overheads.
>>> Which perfectly matches your results.
>>>
>>
>> Hmmmm that makes sense ,its hitting 7.2 rpm OSD's peak write speed.I was
>> in an assumption that ssd Journal to OSD will happen slowly at a later time
>> and hence  i could use slower and cheaper disks for OSD.But in practise it
>> looks like many articles in the internet that talks about faster journal
>> and slower OSD dont seems to be correct.
>>
>> Will adding more OSD disks per node improve the overall performance?
>>
>>  i can add 4 more disks to each node,but all are 7.2 rpm disks .I am
>> expecting some kind of parallel writes on these disks and magically
>> improves performance :D
>>
>> This is my second experiment with Ceph last time i gave up and purchased
>> another costly solution from a vendor.But this time i am determined to fix
>> all issues and bring up a solid cluster .
>> Last time clsuter was  giving a throughput of around 900kbps for 1G
>> writes from virtual machine and now things have improved ,its giving 1.4
>> Mbps but still far slower than the target of 24Mbps.
>>
>> Expecting to make some progress with the help of experts here :)
>>
>>>
>>> > rbd_cache is enabled in configurations,XFS filesystem,LSI 92465-4i raid
>>> > card with 512Mb cache [ssd is in writeback mode wth BBU]
>>> >
>>> >
>>> > Before installing ceph, i tried to check max throughpit of intel 3500
>>> 80G
>>> > SSD using block size of 4M [i read somewhere that ceph uses 4m
>>> objects] and
>>> > it was giving 220mbps {dd if=/dev/zero of=/dev/sdb bs=4M count=1000
>>> > oflag=direct}
>>> >
>>> Irrelevant, sustained sequential writes will be limited by what your OSDs
>>> (HDDs) can sustain.
>>>
>>> > *Observation:*
>>> > Now the cluster is up and running and from the vm i am trying to write
>>> a 4g
>>> > file to its volume using dd if=/dev/zero of=/dev/sdb bs=4M count=1000
>>> > oflag=direct .It takes aroud 39 seconds to write.
>>> >
>>> >  during this time ssd journal was showing disk write of 104M on both
>>> the
>>> > ceph servers (dstat sdb) and compute node a network transfer rate of
>>> ~110M
>>> > on its 10G storage interface(dstat -nN eth2]
>>> >
>>> As I said, sounds about right.
>>>
>>> >
>>> > my questions are:
>>> >
>>> >
>>> >    - Is this the best throughput ceph can offer or can anything in my
>>> >    environment be optmised to get  more performance? [iperf shows a max
>>> >    throughput 9.8Gbits/s]
>>> >
>>> Not your network.
>>>
>>> Watch your nodes with atop and you will note that your HDDs are maxed
>>> out.
>>>
>>> >
>>> >
>>> >    - I guess Network/SSD is under utilized and it can handle more
>>> writes
>>> >    how can this be improved to send more data over network to ssd?
>>> >
>>> As jiajia wrote, a cache-tier might give you some speed boosts.
>>> But with those SSDs I'd advise against it, both too small and too low
>>> endurance.
>>>
>>> >
>>> >
>>> >    - rbd kernel module wasn't loaded on compute node,i loaded it
>>> manually
>>> >    using "modprobe" and later destroyed/re-created vms,but this
>>> doesnot give
>>> >    any performance boost. So librbd and RBD are equally fast?
>>> >
>>> Irrelevant and confusing.
>>> Your VMs will use on or the other depending on how they are configured.
>>>
>>> >
>>> >
>>> >    - Samsung evo 840 512Gb shows a throughput of 500Mbps for 4M writes
>>> [dd
>>> >    if=/dev/zero of=/dev/sdb bs=4M count=1000 oflag=direct] and for 4Kb
>>> it was
>>> >    equally fast as that of intel S3500 80gb .Does changing my SSD from
>>> intel
>>> >    s3500 100Gb to Samsung 840 500Gb make any performance  difference
>>> here just
>>> >    because for 4M wirtes samsung 840 evo is faster?Can Ceph utilize
>>> this extra
>>> >    speed.Since samsung evo 840 is faster in 4M writes.
>>> >
>>> Those SSDs would be an even worse choice for endurance/reliability
>>> reasons, though their larger size offsets that a bit.
>>>
>>> Unless you have a VERY good understanding and data on how much your
>>> cluster is going to write, pick at the very least SSDs with 3+ DWPD
>>> endurance like the DC S3610s.
>>> In very light loaded cases DC S3520 with 1DWPD may be OK, but again, you
>>> need to know what you're doing here.
>>>
>>> Christian
>>> >
>>> > Can somebody help me understand this better.
>>> >
>>> > Regards,
>>> > Kevin
>>>
>>>
>>> --
>>> Christian Balzer        Network/Systems Engineer
>>> [email protected]           Global OnLine Japan/Rakuten Communications
>>> http://www.gol.com/
>>>
>>
>>
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

Reply via email to