Re: [ceph-users] Bad Write-Performance on Ceph/Possible bottlenecks?

Mark Nelson Fri, 04 Jul 2014 06:55:15 -0700

On 07/04/2014 04:13 AM, Marco Allevato wrote:

Hello Ceph-Community,


I’m writing here because we have a bad write-performance on our
Ceph-Cluster of about

_As an overview the technical details of our Cluster:_

3 x monitoring-Servers; each with 2 x 1 Gbit/s NIC configured as Bond
(Link Aggregation-Mode)

5 x datastore-Servers; each with 10 x 4 TB HDDs serving as OSDs, as
Journal we use a 15 GB LVM on an 256 GB SSD-Raid1; 2 x 10 Gbit/s NIC
configured as Bond (Link Aggregation-Mode)

What SSDs are you using? If this is just a single pair of typicalconsumer grade 256GB SSDs, you are likely oversubscribing them quite abit. A RAID1 of two of them won't achieve anywhere near the sequentialwrite performance the disks can do. Since SSD failure tends to be kindof clumpy based on the number of writes, I'm not really fond of doingSSD journal devices in a RAID1. I'd rather reduce write wear and justuse them individually with half the journals on each SSD. Betterperformance too.


__

_ceph.conf_

[global]

auth_service_required = cephx

filestore_xattr_use_omap = true

auth_client_required = cephx

auth_cluster_required = cephx

mon_host = 172.30.30.8,172.30.30.9

mon_initial_members = monitoring1, monitoring2, monitoring3

fsid = 5f22ab94-8d96-48c2-88d3-cff7bad443a9

public network = 172.30.30.0/24

[mon.monitoring1]

         host = monitoring1

         addr = 172.30.30.8:6789

[mon.monitoring2]

         host = monitoring2

         addr = 172.30.30.9:6789

[mon.monitoring3]

         host = monitoring3

         addr = 172.30.30.10:6789

[filestore]

        filestore max sync interval = 10

This is tough to get right imho. Less frequent synchronization givesyou more time to potentially coalesce write ops, but also may causespikier behaviour which we like to avoid. Probably requires justtesting it and seeing what happens in your setup.


[osd]

         osd recovery max active = 1

         osd journal size = 15360

         osd op threads = 40

         osd disk threads = 40

Having so many disk threads and op threads probably isn't going to helpyou. I'd reduce them unless you've actually tested that this is fasterthan the default. In some cases increasing the op threads does improveperformance, but ultimately you probably don't want to go too high. Ioften set it to 8, but I suspect it may be platform dependent and howmany CPU cores and disks you have. Basically you want the minimumnumber of threads that lets you achieve saturation.


Some other things to consider:

1) Are you using Kernel RBD or userland RBD? Have you considered usingRBD Cache? This can help dramatically, especially with smallersequential writes. In your case you are testing 1MB random writes, soit may not help as much, but it's still worth trying.

2) If you have few PGs in the pool and *lots* of objects (say many largevolumes, and/or small RBD object sizes), you may benefit from increasingthe filestore merge threshold and filestore split multiple parameters.These govern how the filestore splits directories that objects arestored in under the OSD per PG. It appears that increasing theper-directory limits can reduce performance degradation as the number ofobjects increases, but may slow down lookup operations. Currently thedefault is 300 objects per directory however which I believe may be toolow. Here's what I like to use:


        filestore merge threshold = 40
        filestore split multiple = 8

3) you might find that disabling in-memory debugging may help, thoughthis typically helps more with small random IOs.


        debug_lockdep = "0/0"
        debug_context = "0/0"
        debug_crush = "0/0"
        debug_mds = "0/0"
        debug_mds_balancer = "0/0"
        debug_mds_locker = "0/0"
        debug_mds_log = "0/0"
        debug_mds_log_expire = "0/0"
        debug_mds_migrator = "0/0"
        debug_buffer = "0/0"
        debug_timer = "0/0"
        debug_filer = "0/0"
        debug_objecter = "0/0"
        debug_rados = "0/0"
        debug_rbd = "0/0"
        debug_journaler = "0/0"
        debug_objectcacher = "0/0"
        debug_client = "0/0"
        debug_osd = "0/0"
        debug_optracker = "0/0"
        debug_objclass = "0/0"
        debug_filestore = "0/0"
        debug_journal = "0/0"
        debug_ms = "0/0"
        debug_mon = "0/0"
        debug_monc = "0/0"
        debug_paxos = "0/0"
        debug_tp = "0/0"
        debug_auth = "0/0"
        debug_finisher = "0/0"
        debug_heartbeatmap = "0/0"
        debug_perfcounter = "0/0"
        debug_rgw = "0/0"
        debug_hadoop = "0/0"
        debug_asok = "0/0"
        debug_throttle = "0/0"


[osd.0]

         host = datastore1

[osd.1]

         host = datastore1

[osd.2]

         host = datastore1

[osd.3]

         host = datastore1

[osd.4]

         host = datastore1

[osd.5]

         host = datastore1

[osd.6]

         host = datastore1

[osd.7]

         host = datastore1

[osd.8]

         host = datastore1

[osd.9]

         host = datastore1

[osd.10]

         host = datastore2

[osd.11]

         host = datastore2

[osd.11]

         host = datastore2

[osd.12]

         host = datastore2

[osd.13]

         host = datastore2

[osd.14]

         host = datastore2

[osd.15]

         host = datastore2

[osd.16]

         host = datastore2

[osd.17]

         host = datastore2

[osd.18]

         host = datastore2

[osd.19]

         host = datastore2

[osd.20]

         host = datastore3

[osd.21]

         host = datastore3

[osd.22]

         host = datastore3

[osd.23]

         host = datastore3

[osd.24]

         host = datastore3

[osd.25]

         host = datastore3

[osd.26]

         host = datastore3

[osd.27]

         host = datastore3

[osd.28]

         host = datastore3

[osd.29]

         host = datastore3

[osd.30]

         host = datastore4

[osd.31]

         host = datastore4

[osd.32]

         host = datastore4

[osd.33]

         host = datastore4

[osd.34]

         host = datastore4

[osd.35]

         host = datastore4

[osd.36]

         host = datastore4

[osd.37]

         host = datastore4

[osd.38]

         host = datastore4

[osd.39]

         host = datastore4

[osd.0]

         host = datastore5

[osd.40]

         host = datastore5

[osd.41]

         host = datastore5

[osd.42]

         host = datastore5

[osd.43]

         host = datastore5

[osd.44]

         host = datastore5

[osd.45]

         host = datastore5

[osd.46]

         host = datastore5

[osd.47]

         host = datastore5

[osd.48]

         host = datastore5

You appear to have osd.0 defined twice on datastore0 and datastore5.Not sure if that was a transcription mistake or actually in the cephconfiguration?


We have 3 pools:

-> 2 x 1000 pgs with 2 Replicas distributing the data equally to two
racks (Used for datastore 1-4)

1000 PGs isn't imho enough for good distribution over 40 OSDs. I'd upthis to around 4096.


-> 1 x 100 pgs without replication; data only stored on datastore 5.
This Pool is used to compare the performance on local disks without
networking

100 PGs *really* isn't enough. For 10 OSDs I'd at least do 1024.Remember the filestore split multiple parameter from above? The fewerPGs you have, the more important this becomes because you are shovingmore objects into each PG and causing directory splits to happen muchfaster.

Consider that a 32GB volume is 32768MB, and we have 4MB RBD objects.that's 8192 objects. Split that over 80 OSDs and you end up with around80 objects per PG on average. Consider this isn't a particularly gooddistribution though, so some PGs are going to have more objects thanothers. Now in this case you probably haven't hit the point wheredirectories start splitting, but had you used replication, biggervolumes, or multiple volumes on multiple clients, you very easily couldhave started invoking directory splits on the OSDs.

At some point you have to be careful increasing the number of PGsbecause it puts more load on the monitors, but typically this won't be aproblem until you get up closer to 100,000 PGs or more. I'vesuccessfully tested a cluster with 1 million PGs.


Here are the performance values, which I get using fio-Bench on a 32GB rbd:

__

_On 1000 pgs-Pool with distribution_

fio --bs=1M --rw=randwrite --ioengine=libaio --direct=1 --iodepth=32
--runtime=60 --name=/dev/rbd/pool1/bench1

fio-2.0.13

Starting 1 process

Jobs: 1 (f=1): [w] [100.0% done] [0K/312.0M/0K /s] [0 /312 /0  iops]
[eta 00m:00s]

/dev/rbd/pool1/bench1: (groupid=0, jobs=1): err= 0: pid=21675: Fri Jul
4 11:03:52 2014

   write: io=21071MB, bw=358989KB/s, iops=350 , runt= 60104msec

     slat (usec): min=127 , max=8040 , avg=511.49, stdev=216.27

     clat (msec): min=5 , max=4018 , avg=90.74, stdev=215.83

      lat (msec): min=6 , max=4018 , avg=91.25, stdev=215.83

     clat percentiles (msec):

      |  1.00th=[    8],  5.00th=[    9], 10.00th=[   11], 20.00th=[   15],

      | 30.00th=[   21], 40.00th=[   30], 50.00th=[   45], 60.00th=[   63],

      | 70.00th=[   83], 80.00th=[  105], 90.00th=[  129], 95.00th=[  190],

      | 99.00th=[ 1254], 99.50th=[ 1680], 99.90th=[ 2409], 99.95th=[ 2638],

      | 99.99th=[ 3556]

     bw (KB/s)  : min=68210, max=479232, per=100.00%, avg=368399.55,
stdev=84457.12

     lat (msec) : 10=9.50%, 20=20.02%, 50=23.56%, 100=24.56%, 250=18.09%

     lat (msec) : 500=1.39%, 750=0.81%, 1000=0.65%, 2000=1.13%, >=2000=0.29%

   cpu          : usr=11.17%, sys=7.46%, ctx=17772, majf=0, minf=24

   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.9%,
 >=64=0.0%

      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
 >=64=0.0%

      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
 >=64=0.0%

      issued    : total=r=0/w=21071/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):

   WRITE: io=21071MB, *aggrb=358989KB/s, minb=358989KB/s,
maxb=358989KB/s, mint=60104msec, maxt=60104msec*

**

**

_On 100 pgs-Pool without distribution:_

__

WRITE: io=5884.0MB, *aggrb=297953KB/s, minb=297953KB/s, maxb=297953KB/s,
mint=20222msec, maxt=20222msec*

Do you have any suggestion on how to improve the performace?

Just a general note, do you expect 1MB random writes to be a typicalworkload for your cluster? Having said that, what I'm noticing here isthat in the single-node setup with no replication, you get around300MB/s. If you have a single SSD RAID1 for your journals, 300MB/ssounds about right for many ~250GB models.

For the 40 OSD pool, you are doing 2x replication, so getting anaggregate of about 180MB/s per node which is quite a bit lower. On theother hand, the amount of concurrency you have isn't really very highfor that many disks and you will lose some aggregate performance whenreplication is enabled.

The first thing I would do is just try your test with more concurrency.If you don't have more clients, just increase the number of fio jobs.I'd also increase the number of PGs and possibly the split multiplierwhen you do this! Next I'd try enabling RBD cache and making a coupleof the ceph.conf tunable tweaks mentioned above. Finally I wouldseriously consider switching the SSD journal configuration so that halfof the journals are on each SSD, directly placed in raw partition.


While Reading on the internet, typical write-rates should be around
800-1000 Mb/sec if using 10 Gbit/s-Connection with a similar setup.

You won't get that unless you have very fast SSDs or I'mmisunderstanding your SSD journal setup.


Thanks in advance


Hope this helps!


--

Marco Allevato
Projektteam

Network Engineering GmbH
Maximilianstrasse 93
D-67346 Speyer



_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bad Write-Performance on Ceph/Possible bottlenecks?

Reply via email to