Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

kevin parrikar Fri, 06 Jan 2017 19:49:35 -0800

i really need some help here :(

replaced all 7.2 rpm SAS disks with new Samsung 840 evo 512Gb SSD with no
seperate journal Disk .Now both OSD nodes are with 2 ssd disks  with a
replica of *2* .
Total number of OSD process in the cluster is *4*.with all SSD.



But throughput has gone down from 1.4 MB/s to 1.3 MB/s for 4k writes and
for 4M it has gone down from 140MB/s to 126MB/s .

now atop no longer shows OSD device as 100% busy..

How ever i can see both ceph-osd process in atop with 53% and 47% disk
utilization.

 PID                         RDDSK          WRDSK           WCANCL
DSK     CMD        1/2
20771                          0K                648.8M             0K
          53%    ceph-osd
19547                          0K                576.7M             0K
          47%    ceph-osd


OSD disks(ssd) utilization from atop

DSK |  sdc | busy  6%  | read  0  | write  517  | KiB/r   0  | KiB/w  293 |
MBr/s 0.00  | MBw/s 148.18  | avq   9.44  | avio 0.12 ms  |

DSK |  sdd | busy   5% | read   0 | write   336 | KiB/r   0  | KiB/w   292
| MBr/s 0.00 | MBw/s  96.12  | avq     7.62  | avio 0.15 ms  |


Queue Depth of OSD disks
 cat /sys/block/sdd/device//queue_depth
256

atop inside virtual machine:[4 CPU/3Gb RAM]
DSK |   vdc  | busy     96%  | read     0  | write  256  | KiB/r   0  |
KiB/w  512  | MBr/s   0.00  | MBw/s 128.00  | avq    7.96  | avio 3.77 ms  |


Both Guest and Host are using deadline I/O scheduler


Virtual Machine Configuration:

 </disk>
    <disk type='network' device='disk'>
      <driver name='qemu' type='raw' cache='writeback'/>
      <auth username='compute'>
        <secret type='ceph' uuid='a5d0dd94-57c4-ae55-ffe0-7e3732a24455'/>
      </auth>
      <source protocol='rbd'
name='volumes/volume-449da0e7-6223-457c-b2c6-b5e112099212'>
        <host name='172.16.1.8' port='6789'/>
        <host name='172.16.1.11' port='6789'/>
        <host name='172.16.1.12' port='6789'/>
      </source>
      <target dev='vdb' bus='virtio'/>
      <serial>449da0e7-6223-457c-b2c6-b5e112099212</serial>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07'
function='0x0'/>
    </disk>



ceph.conf

 cat /etc/ceph/ceph.conf

[global]
fsid = c4e1a523-9017-492e-9c30-8350eba1bd51
mon_initial_members = node-16 node-30 node-31
mon_host = 172.16.1.11 172.16.1.12 172.16.1.8
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
log_to_syslog_level = info
log_to_syslog = True
osd_pool_default_size = 2
osd_pool_default_min_size = 1
osd_pool_default_pg_num = 64
public_network = 172.16.1.0/24
log_to_syslog_facility = LOG_LOCAL0
osd_journal_size = 2048
auth_supported = cephx
osd_pool_default_pgp_num = 64
osd_mkfs_type = xfs
cluster_network = 172.16.1.0/24
osd_recovery_max_active = 1
osd_max_backfills = 1


[client]
rbd_cache_writethrough_until_flush = True
rbd_cache = True

[client.radosgw.gateway]
rgw_keystone_accepted_roles = _member_, Member, admin, swiftoperator
keyring = /etc/ceph/keyring.radosgw.gateway
rgw_frontends = fastcgi socket_port=9000 socket_host=127.0.0.1
rgw_socket_path = /tmp/radosgw.sock
rgw_keystone_revocation_interval = 1000000

Any guidance on where to look for issues.

Regards,
Kevin

On Fri, Jan 6, 2017 at 4:42 PM, kevin parrikar <[email protected]>
wrote:

> Thanks Christian for your valuable comments,each comment is a new learning
> for me.
> Please see inline
>
> On Fri, Jan 6, 2017 at 9:32 AM, Christian Balzer <[email protected]> wrote:
>
>>
>> Hello,
>>
>> On Fri, 6 Jan 2017 08:40:36 +0530 kevin parrikar wrote:
>>
>> > Hello All,
>> >
>> > I have setup a ceph cluster based on 0.94.6 release in  2 servers each
>> with
>> > 80Gb intel s3510 and 2x3 Tb 7.2 SATA disks,16 CPU,24G RAM
>> > which is connected to a 10G switch with a replica of 2 [ i will add 3
>> more
>> > servers to the cluster] and 3 seperate monitor nodes which are vms.
>> >
>> I'd go to the latest hammer, this version has a lethal cache-tier bug if
>> you should decide to try that.
>>
>> 80Gb Intel DC S3510 are a) slow and b) have only 0.3 DWPD.
>> You're going to wear those out quickly and if not replaced in time loose
>> data.
>>
>> 2 HDDs give you a theoretical speed of something like 300MB/s sustained,
>> when used a OSDs I'd expect the usual 50-60MB/s per OSD due to
>> seeks, journal (file system) and leveldb overheads.
>> Which perfectly matches your results.
>>
>
> Hmmmm that makes sense ,its hitting 7.2 rpm OSD's peak write speed.I was
> in an assumption that ssd Journal to OSD will happen slowly at a later time
> and hence  i could use slower and cheaper disks for OSD.But in practise it
> looks like many articles in the internet that talks about faster journal
> and slower OSD dont seems to be correct.
>
> Will adding more OSD disks per node improve the overall performance?
>
>  i can add 4 more disks to each node,but all are 7.2 rpm disks .I am
> expecting some kind of parallel writes on these disks and magically
> improves performance :D
>
> This is my second experiment with Ceph last time i gave up and purchased
> another costly solution from a vendor.But this time i am determined to fix
> all issues and bring up a solid cluster .
> Last time clsuter was  giving a throughput of around 900kbps for 1G writes
> from virtual machine and now things have improved ,its giving 1.4 Mbps but
> still far slower than the target of 24Mbps.
>
> Expecting to make some progress with the help of experts here :)
>
>>
>> > rbd_cache is enabled in configurations,XFS filesystem,LSI 92465-4i raid
>> > card with 512Mb cache [ssd is in writeback mode wth BBU]
>> >
>> >
>> > Before installing ceph, i tried to check max throughpit of intel 3500
>> 80G
>> > SSD using block size of 4M [i read somewhere that ceph uses 4m objects]
>> and
>> > it was giving 220mbps {dd if=/dev/zero of=/dev/sdb bs=4M count=1000
>> > oflag=direct}
>> >
>> Irrelevant, sustained sequential writes will be limited by what your OSDs
>> (HDDs) can sustain.
>>
>> > *Observation:*
>> > Now the cluster is up and running and from the vm i am trying to write
>> a 4g
>> > file to its volume using dd if=/dev/zero of=/dev/sdb bs=4M count=1000
>> > oflag=direct .It takes aroud 39 seconds to write.
>> >
>> >  during this time ssd journal was showing disk write of 104M on both the
>> > ceph servers (dstat sdb) and compute node a network transfer rate of
>> ~110M
>> > on its 10G storage interface(dstat -nN eth2]
>> >
>> As I said, sounds about right.
>>
>> >
>> > my questions are:
>> >
>> >
>> >    - Is this the best throughput ceph can offer or can anything in my
>> >    environment be optmised to get  more performance? [iperf shows a max
>> >    throughput 9.8Gbits/s]
>> >
>> Not your network.
>>
>> Watch your nodes with atop and you will note that your HDDs are maxed out.
>>
>> >
>> >
>> >    - I guess Network/SSD is under utilized and it can handle more writes
>> >    how can this be improved to send more data over network to ssd?
>> >
>> As jiajia wrote, a cache-tier might give you some speed boosts.
>> But with those SSDs I'd advise against it, both too small and too low
>> endurance.
>>
>> >
>> >
>> >    - rbd kernel module wasn't loaded on compute node,i loaded it
>> manually
>> >    using "modprobe" and later destroyed/re-created vms,but this doesnot
>> give
>> >    any performance boost. So librbd and RBD are equally fast?
>> >
>> Irrelevant and confusing.
>> Your VMs will use on or the other depending on how they are configured.
>>
>> >
>> >
>> >    - Samsung evo 840 512Gb shows a throughput of 500Mbps for 4M writes
>> [dd
>> >    if=/dev/zero of=/dev/sdb bs=4M count=1000 oflag=direct] and for 4Kb
>> it was
>> >    equally fast as that of intel S3500 80gb .Does changing my SSD from
>> intel
>> >    s3500 100Gb to Samsung 840 500Gb make any performance  difference
>> here just
>> >    because for 4M wirtes samsung 840 evo is faster?Can Ceph utilize
>> this extra
>> >    speed.Since samsung evo 840 is faster in 4M writes.
>> >
>> Those SSDs would be an even worse choice for endurance/reliability
>> reasons, though their larger size offsets that a bit.
>>
>> Unless you have a VERY good understanding and data on how much your
>> cluster is going to write, pick at the very least SSDs with 3+ DWPD
>> endurance like the DC S3610s.
>> In very light loaded cases DC S3520 with 1DWPD may be OK, but again, you
>> need to know what you're doing here.
>>
>> Christian
>> >
>> > Can somebody help me understand this better.
>> >
>> > Regards,
>> > Kevin
>>
>>
>> --
>> Christian Balzer        Network/Systems Engineer
>> [email protected]           Global OnLine Japan/Rakuten Communications
>> http://www.gol.com/
>>
>
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

Reply via email to