Re: [ceph-users] Bad Write-Performance on Ceph/Possible bottlenecks?

Wido den Hollander Fri, 04 Jul 2014 03:09:07 -0700

On 07/04/2014 11:33 AM, Daniel Schwager wrote:

Hi,


I think, the problem is the rbd device. It's only ONE device.

I fully agree. Ceph excels in parallel performance. You should runmultiple fio instances in parallel on different RBD devices and evenbetter on different clients.


Then you will see a big difference.

Wido

fio --bs=1M --rw=randwrite --ioengine=libaio --direct=1 --iodepth=32

--runtime=60 --name=/dev/rbd/pool1/bench1

Try to create e.g. 20 (small) rbd devices, putting them all in a lvm vg,
creating a logical volume (Raid0) with

20 stripes and e.g. stripeSize 1MB (better bandwith) or 4kb (better io)
- or use md-raid0 (it's maybe 10% faster - but not that flexible):

# create disks

for i in `seq -f "%02.f" 0 19` ; do rbd create --size 40000
vmware/vol6-$i.dsk ; done

emacs -nw /etc/lvm/lvm.conf

types = [ "rbd", 16 ]

# rbd map ....

# pvcreate

for i in `seq -f "%02.f" 0 19` ; do pvcreate /dev/rbd/vmware/vol6-$i.dsk
; done

# vcreate VG

vgcreate VG_RBD20x40_VOL6 /dev/rbd/vmware/vol6-00.dsk

for i in `seq -f "%02.f" 1 19` ; do vgextend VG_RBD20x40_VOL6
/dev/rbd/vmware/vol6-$i.dsk ; done

# lvcreate raid0

# -i, --stripes Stripes - This is equal to the number of physical
volumes to scatter the logical volume.

# -I, --stripesize StripeSize - Gives the number of kilobytes for the
granularity of the stripes, 2^n, (n = 2 to 9)

# 20 stripes und 4k StripeSize

lvcreate -i20 -I1024 -L700000m-n VmProd06VG_RBD20x40_VOL6

Now, try to run fio against /dev/mapper/ VG_RBD20x40_VOL6-VmProd06

I think, the performance will be about 10GBi.

regards

Danny

*From:*ceph-users [mailto:[email protected]] *On Behalf
Of *Marco Allevato
*Sent:* Friday, July 04, 2014 11:13 AM
*To:* [email protected]
*Subject:* [ceph-users] Bad Write-Performance on Ceph/Possible bottlenecks?

Hello Ceph-Community,

I’m writing here because we have a bad write-performance on our
Ceph-Cluster of about

_As an overview the technical details of our Cluster:_

3 x monitoring-Servers; each with 2 x 1 Gbit/s NIC configured as Bond
(Link Aggregation-Mode)

5 x datastore-Servers; each with 10 x 4 TB HDDs serving as OSDs, as
Journal we use a 15 GB LVM on an 256 GB SSD-Raid1; 2 x 10 Gbit/s NIC
configured as Bond (Link Aggregation-Mode)

__

_ceph.conf_

[global]

auth_service_required = cephx

filestore_xattr_use_omap = true

auth_client_required = cephx

auth_cluster_required = cephx

mon_host = 172.30.30.8,172.30.30.9

mon_initial_members = monitoring1, monitoring2, monitoring3

fsid = 5f22ab94-8d96-48c2-88d3-cff7bad443a9

public network = 172.30.30.0/24

[mon.monitoring1]

         host = monitoring1

         addr = 172.30.30.8:6789

[mon.monitoring2]

         host = monitoring2

         addr = 172.30.30.9:6789

[mon.monitoring3]

         host = monitoring3

         addr = 172.30.30.10:6789

[filestore]

        filestore max sync interval = 10

[osd]

         osd recovery max active = 1

         osd journal size = 15360

         osd op threads = 40

         osd disk threads = 40

[osd.0]

         host = datastore1

[osd.1]

         host = datastore1

[osd.2]

         host = datastore1

[osd.3]

         host = datastore1

[osd.4]

         host = datastore1

[osd.5]

         host = datastore1

[osd.6]

         host = datastore1

[osd.7]

         host = datastore1

[osd.8]

         host = datastore1

[osd.9]

         host = datastore1

[osd.10]

         host = datastore2

[osd.11]

         host = datastore2

[osd.11]

         host = datastore2

[osd.12]

         host = datastore2

[osd.13]

         host = datastore2

[osd.14]

         host = datastore2

[osd.15]

         host = datastore2

[osd.16]

         host = datastore2

[osd.17]

         host = datastore2

[osd.18]

         host = datastore2

[osd.19]

         host = datastore2

[osd.20]

         host = datastore3

[osd.21]

         host = datastore3

[osd.22]

         host = datastore3

[osd.23]

         host = datastore3

[osd.24]

         host = datastore3

[osd.25]

         host = datastore3

[osd.26]

         host = datastore3

[osd.27]

         host = datastore3

[osd.28]

         host = datastore3

[osd.29]

         host = datastore3

[osd.30]

         host = datastore4

[osd.31]

         host = datastore4

[osd.32]

         host = datastore4

[osd.33]

         host = datastore4

[osd.34]

         host = datastore4

[osd.35]

         host = datastore4

[osd.36]

         host = datastore4

[osd.37]

         host = datastore4

[osd.38]

         host = datastore4

[osd.39]

         host = datastore4

[osd.0]

         host = datastore5

[osd.40]

         host = datastore5

[osd.41]

         host = datastore5

[osd.42]

         host = datastore5

[osd.43]

         host = datastore5

[osd.44]

         host = datastore5

[osd.45]

         host = datastore5

[osd.46]

         host = datastore5

[osd.47]

         host = datastore5

[osd.48]

         host = datastore5

We have 3 pools:

-> 2 x 1000 pgs with 2 Replicas distributing the data equally to two
racks (Used for datastore 1-4)

-> 1 x 100 pgs without replication; data only stored on datastore 5.
This Pool is used to compare the performance on local disks without
networking

Here are the performance values, which I get using fio-Bench on a 32GB rbd:

__

_On 1000 pgs-Pool with distribution_

fio --bs=1M --rw=randwrite --ioengine=libaio --direct=1 --iodepth=32
--runtime=60 --name=/dev/rbd/pool1/bench1

fio-2.0.13

Starting 1 process

Jobs: 1 (f=1): [w] [100.0% done] [0K/312.0M/0K /s] [0 /312 /0  iops]
[eta 00m:00s]

/dev/rbd/pool1/bench1: (groupid=0, jobs=1): err= 0: pid=21675: Fri Jul
4 11:03:52 2014

   write: io=21071MB, bw=358989KB/s, iops=350 , runt= 60104msec

     slat (usec): min=127 , max=8040 , avg=511.49, stdev=216.27

     clat (msec): min=5 , max=4018 , avg=90.74, stdev=215.83

      lat (msec): min=6 , max=4018 , avg=91.25, stdev=215.83

     clat percentiles (msec):

      |  1.00th=[    8],  5.00th=[    9], 10.00th=[   11], 20.00th=[   15],

      | 30.00th=[   21], 40.00th=[   30], 50.00th=[   45], 60.00th=[   63],

      | 70.00th=[   83], 80.00th=[  105], 90.00th=[  129], 95.00th=[  190],

      | 99.00th=[ 1254], 99.50th=[ 1680], 99.90th=[ 2409], 99.95th=[ 2638],

      | 99.99th=[ 3556]

     bw (KB/s)  : min=68210, max=479232, per=100.00%, avg=368399.55,
stdev=84457.12

     lat (msec) : 10=9.50%, 20=20.02%, 50=23.56%, 100=24.56%, 250=18.09%

     lat (msec) : 500=1.39%, 750=0.81%, 1000=0.65%, 2000=1.13%, >=2000=0.29%

   cpu          : usr=11.17%, sys=7.46%, ctx=17772, majf=0, minf=24

   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.9%,
 >=64=0.0%

      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
 >=64=0.0%

      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
 >=64=0.0%

      issued    : total=r=0/w=21071/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):

   WRITE: io=21071MB, *aggrb=358989KB/s, minb=358989KB/s,
maxb=358989KB/s, mint=60104msec, maxt=60104msec*

**

**

_On 100 pgs-Pool without distribution:_

__

WRITE: io=5884.0MB, *aggrb=297953KB/s, minb=297953KB/s, maxb=297953KB/s,
mint=20222msec, maxt=20222msec*

Do you have any suggestion on how to improve the performace?

While Reading on the internet, typical write-rates should be around
800-1000 Mb/sec if using 10 Gbit/s-Connection with a similar setup.

Thanks in advance

--

Marco Allevato
Projektteam

Network Engineering GmbH
Maximilianstrasse 93
D-67346 Speyer



_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bad Write-Performance on Ceph/Possible bottlenecks?

Reply via email to