Thanks Maged for your suggestion.
I have executed rbd bench and here is the result,please have a look at it
rbd bench-write image01 --pool=rbd --io-threads=32 --io-size 4096
--io-pattern rand --rbd_cache=false
bench-write io_size 4096 io_threads 32 bytes 1073741824 pattern rand
SEC OPS OPS/SEC BYTES/SEC
1 4750 4750.19 19456758.28
2 7152 3068.49 12568516.09
4 7220 1564.41 6407837.20
5 8941 1794.35 7349666.74
6 11938 1994.94 8171294.61
7 12932 1365.21 5591891.85
^C
not sure why it skipped "3" from SEC .
I suppose this also shows slow performance.
Any idea where could be the issue?
I use LSI 9260-4i controller (firmware 12.13.0.-0154) on both the nodes
with write back enabled . i am not sure if this controller is suitable for
ceph.
Regards,
Kevin
On Sat, Jan 7, 2017 at 1:23 PM, Maged Mokhtar <[email protected]> wrote:
> The numbers are very low. I would first benchmark the system without the
> vm client using rbd 4k test such as:
>
> rbd bench-write image01 --pool=rbd --io-threads=32 --io-size 4096
> --io-pattern rand --rbd_cache=false
>
>
>
> -------- Original message --------
> From: kevin parrikar <[email protected]>
> Date: 07/01/2017 05:48 (GMT+02:00)
> To: Christian Balzer <[email protected]>
> Cc: [email protected]
> Subject: Re: [ceph-users] Analysing ceph performance with SSD journal,
> 10gbe NIC and 2 replicas -Hammer release
>
> i really need some help here :(
>
> replaced all 7.2 rpm SAS disks with new Samsung 840 evo 512Gb SSD with no
> seperate journal Disk .Now both OSD nodes are with 2 ssd disks with a
> replica of *2* .
> Total number of OSD process in the cluster is *4*.with all SSD.
>
>
> But throughput has gone down from 1.4 MB/s to 1.3 MB/s for 4k writes and
> for 4M it has gone down from 140MB/s to 126MB/s .
>
> now atop no longer shows OSD device as 100% busy..
>
> How ever i can see both ceph-osd process in atop with 53% and 47% disk
> utilization.
>
> PID RDDSK WRDSK WCANCL
> DSK CMD 1/2
> 20771 0K 648.8M 0K
> 53% ceph-osd
> 19547 0K 576.7M 0K
> 47% ceph-osd
>
>
> OSD disks(ssd) utilization from atop
>
> DSK | sdc | busy 6% | read 0 | write 517 | KiB/r 0 | KiB/w 293
> | MBr/s 0.00 | MBw/s 148.18 | avq 9.44 | avio 0.12 ms |
>
> DSK | sdd | busy 5% | read 0 | write 336 | KiB/r 0 | KiB/w 292
> | MBr/s 0.00 | MBw/s 96.12 | avq 7.62 | avio 0.15 ms |
>
>
> Queue Depth of OSD disks
> cat /sys/block/sdd/device//queue_depth
> 256
>
> atop inside virtual machine:[4 CPU/3Gb RAM]
> DSK | vdc | busy 96% | read 0 | write 256 | KiB/r 0 |
> KiB/w 512 | MBr/s 0.00 | MBw/s 128.00 | avq 7.96 | avio 3.77 ms |
>
>
> Both Guest and Host are using deadline I/O scheduler
>
>
> Virtual Machine Configuration:
>
> </disk>
> <disk type='network' device='disk'>
> <driver name='qemu' type='raw' cache='writeback'/>
> <auth username='compute'>
> <secret type='ceph' uuid='a5d0dd94-57c4-ae55-ffe0-7e3732a24455'/>
> </auth>
> <source protocol='rbd' name='volumes/volume-449da0e7-
> 6223-457c-b2c6-b5e112099212'>
> <host name='172.16.1.8' port='6789'/>
> <host name='172.16.1.11' port='6789'/>
> <host name='172.16.1.12' port='6789'/>
> </source>
> <target dev='vdb' bus='virtio'/>
> <serial>449da0e7-6223-457c-b2c6-b5e112099212</serial>
> <address type='pci' domain='0x0000' bus='0x00' slot='0x07'
> function='0x0'/>
> </disk>
>
>
>
> ceph.conf
>
> cat /etc/ceph/ceph.conf
>
> [global]
> fsid = c4e1a523-9017-492e-9c30-8350eba1bd51
> mon_initial_members = node-16 node-30 node-31
> mon_host = 172.16.1.11 172.16.1.12 172.16.1.8
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> filestore_xattr_use_omap = true
> log_to_syslog_level = info
> log_to_syslog = True
> osd_pool_default_size = 2
> osd_pool_default_min_size = 1
> osd_pool_default_pg_num = 64
> public_network = 172.16.1.0/24
> log_to_syslog_facility = LOG_LOCAL0
> osd_journal_size = 2048
> auth_supported = cephx
> osd_pool_default_pgp_num = 64
> osd_mkfs_type = xfs
> cluster_network = 172.16.1.0/24
> osd_recovery_max_active = 1
> osd_max_backfills = 1
>
>
> [client]
> rbd_cache_writethrough_until_flush = True
> rbd_cache = True
>
> [client.radosgw.gateway]
> rgw_keystone_accepted_roles = _member_, Member, admin, swiftoperator
> keyring = /etc/ceph/keyring.radosgw.gateway
> rgw_frontends = fastcgi socket_port=9000 socket_host=127.0.0.1
> rgw_socket_path = /tmp/radosgw.sock
> rgw_keystone_revocation_interval = 1000000
>
> Any guidance on where to look for issues.
>
> Regards,
> Kevin
>
> On Fri, Jan 6, 2017 at 4:42 PM, kevin parrikar <[email protected]>
> wrote:
>
>> Thanks Christian for your valuable comments,each comment is a new
>> learning for me.
>> Please see inline
>>
>> On Fri, Jan 6, 2017 at 9:32 AM, Christian Balzer <[email protected]> wrote:
>>
>>>
>>> Hello,
>>>
>>> On Fri, 6 Jan 2017 08:40:36 +0530 kevin parrikar wrote:
>>>
>>> > Hello All,
>>> >
>>> > I have setup a ceph cluster based on 0.94.6 release in 2 servers each
>>> with
>>> > 80Gb intel s3510 and 2x3 Tb 7.2 SATA disks,16 CPU,24G RAM
>>> > which is connected to a 10G switch with a replica of 2 [ i will add 3
>>> more
>>> > servers to the cluster] and 3 seperate monitor nodes which are vms.
>>> >
>>> I'd go to the latest hammer, this version has a lethal cache-tier bug if
>>> you should decide to try that.
>>>
>>> 80Gb Intel DC S3510 are a) slow and b) have only 0.3 DWPD.
>>> You're going to wear those out quickly and if not replaced in time loose
>>> data.
>>>
>>> 2 HDDs give you a theoretical speed of something like 300MB/s sustained,
>>> when used a OSDs I'd expect the usual 50-60MB/s per OSD due to
>>> seeks, journal (file system) and leveldb overheads.
>>> Which perfectly matches your results.
>>>
>>
>> Hmmmm that makes sense ,its hitting 7.2 rpm OSD's peak write speed.I was
>> in an assumption that ssd Journal to OSD will happen slowly at a later time
>> and hence i could use slower and cheaper disks for OSD.But in practise it
>> looks like many articles in the internet that talks about faster journal
>> and slower OSD dont seems to be correct.
>>
>> Will adding more OSD disks per node improve the overall performance?
>>
>> i can add 4 more disks to each node,but all are 7.2 rpm disks .I am
>> expecting some kind of parallel writes on these disks and magically
>> improves performance :D
>>
>> This is my second experiment with Ceph last time i gave up and purchased
>> another costly solution from a vendor.But this time i am determined to fix
>> all issues and bring up a solid cluster .
>> Last time clsuter was giving a throughput of around 900kbps for 1G
>> writes from virtual machine and now things have improved ,its giving 1.4
>> Mbps but still far slower than the target of 24Mbps.
>>
>> Expecting to make some progress with the help of experts here :)
>>
>>>
>>> > rbd_cache is enabled in configurations,XFS filesystem,LSI 92465-4i raid
>>> > card with 512Mb cache [ssd is in writeback mode wth BBU]
>>> >
>>> >
>>> > Before installing ceph, i tried to check max throughpit of intel 3500
>>> 80G
>>> > SSD using block size of 4M [i read somewhere that ceph uses 4m
>>> objects] and
>>> > it was giving 220mbps {dd if=/dev/zero of=/dev/sdb bs=4M count=1000
>>> > oflag=direct}
>>> >
>>> Irrelevant, sustained sequential writes will be limited by what your OSDs
>>> (HDDs) can sustain.
>>>
>>> > *Observation:*
>>> > Now the cluster is up and running and from the vm i am trying to write
>>> a 4g
>>> > file to its volume using dd if=/dev/zero of=/dev/sdb bs=4M count=1000
>>> > oflag=direct .It takes aroud 39 seconds to write.
>>> >
>>> > during this time ssd journal was showing disk write of 104M on both
>>> the
>>> > ceph servers (dstat sdb) and compute node a network transfer rate of
>>> ~110M
>>> > on its 10G storage interface(dstat -nN eth2]
>>> >
>>> As I said, sounds about right.
>>>
>>> >
>>> > my questions are:
>>> >
>>> >
>>> > - Is this the best throughput ceph can offer or can anything in my
>>> > environment be optmised to get more performance? [iperf shows a max
>>> > throughput 9.8Gbits/s]
>>> >
>>> Not your network.
>>>
>>> Watch your nodes with atop and you will note that your HDDs are maxed
>>> out.
>>>
>>> >
>>> >
>>> > - I guess Network/SSD is under utilized and it can handle more
>>> writes
>>> > how can this be improved to send more data over network to ssd?
>>> >
>>> As jiajia wrote, a cache-tier might give you some speed boosts.
>>> But with those SSDs I'd advise against it, both too small and too low
>>> endurance.
>>>
>>> >
>>> >
>>> > - rbd kernel module wasn't loaded on compute node,i loaded it
>>> manually
>>> > using "modprobe" and later destroyed/re-created vms,but this
>>> doesnot give
>>> > any performance boost. So librbd and RBD are equally fast?
>>> >
>>> Irrelevant and confusing.
>>> Your VMs will use on or the other depending on how they are configured.
>>>
>>> >
>>> >
>>> > - Samsung evo 840 512Gb shows a throughput of 500Mbps for 4M writes
>>> [dd
>>> > if=/dev/zero of=/dev/sdb bs=4M count=1000 oflag=direct] and for 4Kb
>>> it was
>>> > equally fast as that of intel S3500 80gb .Does changing my SSD from
>>> intel
>>> > s3500 100Gb to Samsung 840 500Gb make any performance difference
>>> here just
>>> > because for 4M wirtes samsung 840 evo is faster?Can Ceph utilize
>>> this extra
>>> > speed.Since samsung evo 840 is faster in 4M writes.
>>> >
>>> Those SSDs would be an even worse choice for endurance/reliability
>>> reasons, though their larger size offsets that a bit.
>>>
>>> Unless you have a VERY good understanding and data on how much your
>>> cluster is going to write, pick at the very least SSDs with 3+ DWPD
>>> endurance like the DC S3610s.
>>> In very light loaded cases DC S3520 with 1DWPD may be OK, but again, you
>>> need to know what you're doing here.
>>>
>>> Christian
>>> >
>>> > Can somebody help me understand this better.
>>> >
>>> > Regards,
>>> > Kevin
>>>
>>>
>>> --
>>> Christian Balzer Network/Systems Engineer
>>> [email protected] Global OnLine Japan/Rakuten Communications
>>> http://www.gol.com/
>>>
>>
>>
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com