i really need some help here :(
replaced all 7.2 rpm SAS disks with new Samsung 840 evo 512Gb SSD with no
seperate journal Disk .Now both OSD nodes are with 2 ssd disks with a
replica of *2* .
Total number of OSD process in the cluster is *4*.with all SSD.
But throughput has gone down from 1.4 MB/s to 1.3 MB/s for 4k writes and
for 4M it has gone down from 140MB/s to 126MB/s .
now atop no longer shows OSD device as 100% busy..
How ever i can see both ceph-osd process in atop with 53% and 47% disk
utilization.
PID RDDSK WRDSK WCANCL
DSK CMD 1/2
20771 0K 648.8M 0K
53% ceph-osd
19547 0K 576.7M 0K
47% ceph-osd
OSD disks(ssd) utilization from atop
DSK | sdc | busy 6% | read 0 | write 517 | KiB/r 0 | KiB/w 293 |
MBr/s 0.00 | MBw/s 148.18 | avq 9.44 | avio 0.12 ms |
DSK | sdd | busy 5% | read 0 | write 336 | KiB/r 0 | KiB/w 292
| MBr/s 0.00 | MBw/s 96.12 | avq 7.62 | avio 0.15 ms |
Queue Depth of OSD disks
cat /sys/block/sdd/device//queue_depth
256
atop inside virtual machine:[4 CPU/3Gb RAM]
DSK | vdc | busy 96% | read 0 | write 256 | KiB/r 0 |
KiB/w 512 | MBr/s 0.00 | MBw/s 128.00 | avq 7.96 | avio 3.77 ms |
Both Guest and Host are using deadline I/O scheduler
Virtual Machine Configuration:
</disk>
<disk type='network' device='disk'>
<driver name='qemu' type='raw' cache='writeback'/>
<auth username='compute'>
<secret type='ceph' uuid='a5d0dd94-57c4-ae55-ffe0-7e3732a24455'/>
</auth>
<source protocol='rbd'
name='volumes/volume-449da0e7-6223-457c-b2c6-b5e112099212'>
<host name='172.16.1.8' port='6789'/>
<host name='172.16.1.11' port='6789'/>
<host name='172.16.1.12' port='6789'/>
</source>
<target dev='vdb' bus='virtio'/>
<serial>449da0e7-6223-457c-b2c6-b5e112099212</serial>
<address type='pci' domain='0x0000' bus='0x00' slot='0x07'
function='0x0'/>
</disk>
ceph.conf
cat /etc/ceph/ceph.conf
[global]
fsid = c4e1a523-9017-492e-9c30-8350eba1bd51
mon_initial_members = node-16 node-30 node-31
mon_host = 172.16.1.11 172.16.1.12 172.16.1.8
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
log_to_syslog_level = info
log_to_syslog = True
osd_pool_default_size = 2
osd_pool_default_min_size = 1
osd_pool_default_pg_num = 64
public_network = 172.16.1.0/24
log_to_syslog_facility = LOG_LOCAL0
osd_journal_size = 2048
auth_supported = cephx
osd_pool_default_pgp_num = 64
osd_mkfs_type = xfs
cluster_network = 172.16.1.0/24
osd_recovery_max_active = 1
osd_max_backfills = 1
[client]
rbd_cache_writethrough_until_flush = True
rbd_cache = True
[client.radosgw.gateway]
rgw_keystone_accepted_roles = _member_, Member, admin, swiftoperator
keyring = /etc/ceph/keyring.radosgw.gateway
rgw_frontends = fastcgi socket_port=9000 socket_host=127.0.0.1
rgw_socket_path = /tmp/radosgw.sock
rgw_keystone_revocation_interval = 1000000
Any guidance on where to look for issues.
Regards,
Kevin
On Fri, Jan 6, 2017 at 4:42 PM, kevin parrikar <[email protected]>
wrote:
> Thanks Christian for your valuable comments,each comment is a new learning
> for me.
> Please see inline
>
> On Fri, Jan 6, 2017 at 9:32 AM, Christian Balzer <[email protected]> wrote:
>
>>
>> Hello,
>>
>> On Fri, 6 Jan 2017 08:40:36 +0530 kevin parrikar wrote:
>>
>> > Hello All,
>> >
>> > I have setup a ceph cluster based on 0.94.6 release in 2 servers each
>> with
>> > 80Gb intel s3510 and 2x3 Tb 7.2 SATA disks,16 CPU,24G RAM
>> > which is connected to a 10G switch with a replica of 2 [ i will add 3
>> more
>> > servers to the cluster] and 3 seperate monitor nodes which are vms.
>> >
>> I'd go to the latest hammer, this version has a lethal cache-tier bug if
>> you should decide to try that.
>>
>> 80Gb Intel DC S3510 are a) slow and b) have only 0.3 DWPD.
>> You're going to wear those out quickly and if not replaced in time loose
>> data.
>>
>> 2 HDDs give you a theoretical speed of something like 300MB/s sustained,
>> when used a OSDs I'd expect the usual 50-60MB/s per OSD due to
>> seeks, journal (file system) and leveldb overheads.
>> Which perfectly matches your results.
>>
>
> Hmmmm that makes sense ,its hitting 7.2 rpm OSD's peak write speed.I was
> in an assumption that ssd Journal to OSD will happen slowly at a later time
> and hence i could use slower and cheaper disks for OSD.But in practise it
> looks like many articles in the internet that talks about faster journal
> and slower OSD dont seems to be correct.
>
> Will adding more OSD disks per node improve the overall performance?
>
> i can add 4 more disks to each node,but all are 7.2 rpm disks .I am
> expecting some kind of parallel writes on these disks and magically
> improves performance :D
>
> This is my second experiment with Ceph last time i gave up and purchased
> another costly solution from a vendor.But this time i am determined to fix
> all issues and bring up a solid cluster .
> Last time clsuter was giving a throughput of around 900kbps for 1G writes
> from virtual machine and now things have improved ,its giving 1.4 Mbps but
> still far slower than the target of 24Mbps.
>
> Expecting to make some progress with the help of experts here :)
>
>>
>> > rbd_cache is enabled in configurations,XFS filesystem,LSI 92465-4i raid
>> > card with 512Mb cache [ssd is in writeback mode wth BBU]
>> >
>> >
>> > Before installing ceph, i tried to check max throughpit of intel 3500
>> 80G
>> > SSD using block size of 4M [i read somewhere that ceph uses 4m objects]
>> and
>> > it was giving 220mbps {dd if=/dev/zero of=/dev/sdb bs=4M count=1000
>> > oflag=direct}
>> >
>> Irrelevant, sustained sequential writes will be limited by what your OSDs
>> (HDDs) can sustain.
>>
>> > *Observation:*
>> > Now the cluster is up and running and from the vm i am trying to write
>> a 4g
>> > file to its volume using dd if=/dev/zero of=/dev/sdb bs=4M count=1000
>> > oflag=direct .It takes aroud 39 seconds to write.
>> >
>> > during this time ssd journal was showing disk write of 104M on both the
>> > ceph servers (dstat sdb) and compute node a network transfer rate of
>> ~110M
>> > on its 10G storage interface(dstat -nN eth2]
>> >
>> As I said, sounds about right.
>>
>> >
>> > my questions are:
>> >
>> >
>> > - Is this the best throughput ceph can offer or can anything in my
>> > environment be optmised to get more performance? [iperf shows a max
>> > throughput 9.8Gbits/s]
>> >
>> Not your network.
>>
>> Watch your nodes with atop and you will note that your HDDs are maxed out.
>>
>> >
>> >
>> > - I guess Network/SSD is under utilized and it can handle more writes
>> > how can this be improved to send more data over network to ssd?
>> >
>> As jiajia wrote, a cache-tier might give you some speed boosts.
>> But with those SSDs I'd advise against it, both too small and too low
>> endurance.
>>
>> >
>> >
>> > - rbd kernel module wasn't loaded on compute node,i loaded it
>> manually
>> > using "modprobe" and later destroyed/re-created vms,but this doesnot
>> give
>> > any performance boost. So librbd and RBD are equally fast?
>> >
>> Irrelevant and confusing.
>> Your VMs will use on or the other depending on how they are configured.
>>
>> >
>> >
>> > - Samsung evo 840 512Gb shows a throughput of 500Mbps for 4M writes
>> [dd
>> > if=/dev/zero of=/dev/sdb bs=4M count=1000 oflag=direct] and for 4Kb
>> it was
>> > equally fast as that of intel S3500 80gb .Does changing my SSD from
>> intel
>> > s3500 100Gb to Samsung 840 500Gb make any performance difference
>> here just
>> > because for 4M wirtes samsung 840 evo is faster?Can Ceph utilize
>> this extra
>> > speed.Since samsung evo 840 is faster in 4M writes.
>> >
>> Those SSDs would be an even worse choice for endurance/reliability
>> reasons, though their larger size offsets that a bit.
>>
>> Unless you have a VERY good understanding and data on how much your
>> cluster is going to write, pick at the very least SSDs with 3+ DWPD
>> endurance like the DC S3610s.
>> In very light loaded cases DC S3520 with 1DWPD may be OK, but again, you
>> need to know what you're doing here.
>>
>> Christian
>> >
>> > Can somebody help me understand this better.
>> >
>> > Regards,
>> > Kevin
>>
>>
>> --
>> Christian Balzer Network/Systems Engineer
>> [email protected] Global OnLine Japan/Rakuten Communications
>> http://www.gol.com/
>>
>
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com