On 08/29/2014 06:10 AM, Dan Van Der Ster wrote:
Hi Sebastien,
Here’s my recipe for max IOPS on a _testing_ instance with SSDs:

   osd op threads = 2

With SSDs, In the past I've seen increasing the osd op thread count can help random reads.

   osd disk threads = 2
   journal max write bytes = 10485760000
   journal queue max bytes = 10485760000
   journal max write entries = 10000
   journal queue max ops = 50000
   filestore op threads = 2
   filestore max sync interval = 60
   filestore queue max ops = 50000
   filestore queue max bytes = 10485760000
   filestore queue committing max bytes = 10485760000
   filestore queue committing max ops = 50000
   filestore wbthrottle xfs bytes start flusher = 4194304000
   filestore wbthrottle xfs bytes hard limit = 41943040000
   filestore wbthrottle xfs ios start flusher = 50000
   filestore wbthrottle xfs ios hard limit = 500000
   filestore wbthrottle xfs inodes start flusher = 50000
   filestore wbthrottle xfs inodes hard limit = 500000

It's also probably worth trying disabling all in-memory logging. Unfortunately I don't think we have a global flag to do this, so you have to do it on a per-log basis which is annoying.


Basically the goal there is to ensure no IOs are blocked from entering any 
queue. (And don’t run that in production!)
IIRC I can get up to around 5000 IOPS to a single fio/rbd client. Related to 
the sync interval, I was also playing with vm.dirty_expire_centisecs and 
vm.dirty_writeback_centisecs to disable the background page flushing (disables 
FileStore flushing). That way, the only disk activity becomes the journal 
writes. You can confirm that with iostat.

Another thing that comes to mind is at some point a single fio/librbd client 
will be the bottleneck. Did you try running two simultaneous fio’s (then adding 
the results)?

Cheers, Dan

-- Dan van der Ster || Data & Storage Services || CERN IT Department --


On 28 Aug 2014, at 18:11, Sebastien Han <sebastien....@enovance.com> wrote:

Hey all,

It has been a while since the last thread performance related on the ML :p
I’ve been running some experiment to see how much I can get from an SSD on a 
Ceph cluster.
To achieve that I did something pretty simple:

* Debian wheezy 7.6
* kernel from debian 3.14-0.bpo.2-amd64
* 1 cluster, 3 mons (i’d like to keep this realistic since in a real deployment 
i’ll use 3)
* 1 OSD backed by an SSD (journal and osd data on the same device)
* 1 replica count of 1
* partitions are perfectly aligned
* io scheduler is set to noon but deadline was showing the same results
* no updatedb running

About the box:

* 32GB of RAM
* 12 cores with HT @ 2,4 GHz
* WB cache is enabled on the controller
* 10Gbps network (doesn’t help here)

The SSD is a 200G Intel DC S3700 and is capable of delivering around 29K iops 
with random 4k writes (my fio results)
As a benchmark tool I used fio with the rbd engine (thanks deutsche telekom 
guys!).

O_DIECT and D_SYNC don’t seem to be a problem for the SSD:

# dd if=/dev/urandom of=rand.file bs=4k count=65536
65536+0 records in
65536+0 records out
268435456 bytes (268 MB) copied, 29.5477 s, 9.1 MB/s

# du -sh rand.file
256M    rand.file

# dd if=rand.file of=/dev/sdo bs=4k count=65536 oflag=dsync,direct
65536+0 records in
65536+0 records out
268435456 bytes (268 MB) copied, 2.73628 s, 98.1 MB/s

See my ceph.conf:

[global]
  auth cluster required = cephx
  auth service required = cephx
  auth client required = cephx
  fsid = 857b8609-8c9b-499e-9161-2ea67ba51c97
  osd pool default pg num = 4096
  osd pool default pgp num = 4096
  osd pool default size = 2
  osd crush chooseleaf type = 0

   debug lockdep = 0/0
        debug context = 0/0
        debug crush = 0/0
        debug buffer = 0/0
        debug timer = 0/0
        debug journaler = 0/0
        debug osd = 0/0
        debug optracker = 0/0
        debug objclass = 0/0
        debug filestore = 0/0
        debug journal = 0/0
        debug ms = 0/0
        debug monc = 0/0
        debug tp = 0/0
        debug auth = 0/0
        debug finisher = 0/0
        debug heartbeatmap = 0/0
        debug perfcounter = 0/0
        debug asok = 0/0
        debug throttle = 0/0

[mon]
  mon osd down out interval = 600
  mon osd min down reporters = 13
    [mon.ceph-01]
    host = ceph-01
    mon addr = 172.20.20.171
      [mon.ceph-02]
    host = ceph-02
    mon addr = 172.20.20.172
      [mon.ceph-03]
    host = ceph-03
    mon addr = 172.20.20.173

        debug lockdep = 0/0
        debug context = 0/0
        debug crush = 0/0
        debug buffer = 0/0
        debug timer = 0/0
        debug journaler = 0/0
        debug osd = 0/0
        debug optracker = 0/0
        debug objclass = 0/0
        debug filestore = 0/0
        debug journal = 0/0
        debug ms = 0/0
        debug monc = 0/0
        debug tp = 0/0
        debug auth = 0/0
        debug finisher = 0/0
        debug heartbeatmap = 0/0
        debug perfcounter = 0/0
        debug asok = 0/0
        debug throttle = 0/0

[osd]
  osd mkfs type = xfs
osd mkfs options xfs = -f -i size=2048
osd mount options xfs = rw,noatime,logbsize=256k,delaylog
  osd journal size = 20480
  cluster_network = 172.20.20.0/24
  public_network = 172.20.20.0/24
  osd mon heartbeat interval = 30
  # Performance tuning
  filestore merge threshold = 40
  filestore split multiple = 8
  osd op threads = 8
  # Recovery tuning
  osd recovery max active = 1
  osd max backfills = 1
  osd recovery op priority = 1


        debug lockdep = 0/0
        debug context = 0/0
        debug crush = 0/0
        debug buffer = 0/0
        debug timer = 0/0
        debug journaler = 0/0
        debug osd = 0/0
        debug optracker = 0/0
        debug objclass = 0/0
        debug filestore = 0/0
        debug journal = 0/0
        debug ms = 0/0
        debug monc = 0/0
        debug tp = 0/0
        debug auth = 0/0
        debug finisher = 0/0
        debug heartbeatmap = 0/0
        debug perfcounter = 0/0
        debug asok = 0/0
        debug throttle = 0/0

Disabling all debugging made me win 200/300 more IOPS.

See my fio template:

[global]
#logging
#write_iops_log=write_iops_log
#write_bw_log=write_bw_log
#write_lat_log=write_lat_lo

time_based
runtime=60

ioengine=rbd
clientname=admin
pool=test
rbdname=fio
invalidate=0    # mandatory
#rw=randwrite
rw=write
bs=4k
#bs=32m
size=5G
group_reporting

[rbd_iodepth32]
iodepth=32
direct=1

See my rio output:

rbd_iodepth32: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32
fio-2.1.11-14-gb74e
Starting 1 process
rbd engine: RBD version: 0.1.8
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/12876KB/0KB /s] [0/3219/0 iops] [eta 
00m:00s]
rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=32116: Thu Aug 28 00:28:26 2014
  write: io=771448KB, bw=12855KB/s, iops=3213, runt= 60010msec
    slat (usec): min=42, max=1578, avg=66.50, stdev=16.96
    clat (msec): min=1, max=28, avg= 9.85, stdev= 1.48
     lat (msec): min=1, max=28, avg= 9.92, stdev= 1.47
    clat percentiles (usec):
     |  1.00th=[ 6368],  5.00th=[ 8256], 10.00th=[ 8640], 20.00th=[ 9152],
     | 30.00th=[ 9408], 40.00th=[ 9664], 50.00th=[ 9792], 60.00th=[10048],
     | 70.00th=[10176], 80.00th=[10560], 90.00th=[10944], 95.00th=[11456],
     | 99.00th=[13120], 99.50th=[16768], 99.90th=[25984], 99.95th=[27008],
     | 99.99th=[28032]
    bw (KB  /s): min=11864, max=13808, per=100.00%, avg=12864.36, stdev=407.35
    lat (msec) : 2=0.03%, 4=0.54%, 10=59.79%, 20=39.24%, 50=0.41%
  cpu          : usr=19.15%, sys=4.69%, ctx=326309, majf=0, minf=426088
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=33.9%, 32=66.1%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=99.6%, 8=0.4%, 16=0.1%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=192862/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: io=771448KB, aggrb=12855KB/s, minb=12855KB/s, maxb=12855KB/s, 
mint=60010msec, maxt=60010msec

Disk stats (read/write):
    dm-1: ios=0/49, merge=0/0, ticks=0/12, in_queue=12, util=0.01%, 
aggrios=0/22, aggrmerge=0/27, aggrticks=0/12, aggrin_queue=12, aggrutil=0.01%
  sda: ios=0/22, merge=0/27, ticks=0/12, in_queue=12, util=0.01%

I tried to tweak several parameters like:

filestore_wbthrottle_xfs_ios_start_flusher = 10000
filestore_wbthrottle_xfs_ios_hard_limit = 10000
filestore_wbthrottle_btrfs_ios_start_flusher = 10000
filestore_wbthrottle_btrfs_ios_hard_limit = 10000
filestore queue max ops = 2000

But didn’t any improvement.

Then I tried other things:

* Increasing the io_depth up to 256 or 512 gave me between 50 to 100 more IOPS 
but it’s not a realistic workload anymore and not that significant.
* adding another SSD for the journal, still getting 3,2K IOPS
* I tried with rbd bench and I also got 3K IOPS
* I ran the test on a client machine and then locally on the server, still 
getting 3,2K IOPS
* put the journal in memory, still getting 3,2K IOPS
* with 2 clients running the test in parallel I got a total of 3,6K IOPS but I 
don’t seem to be able to go over
* I tried is to add another OSD to that SSD, so I had 2 OSD and 2 journals on 1 
SSD, got 4,5K IOPS YAY!

Given the results of the last time it seems that something is limiting the 
number of IOPS per OSD process.

Running the test on a client or locally didn’t show any difference.
So it looks to me that there is some contention within Ceph that might cause 
this.

I also ran perf and looked at the output, everything looks decent, but someone 
might want to have a look at it :).

We have been able to reproduce this on 3 distinct platforms with some 
deviations (because of the hardware) but the behaviour is the same.
Any thoughts will be highly appreciated, only getting 3,2k out of an 29K IOPS 
SSD is a bit frustrating :).

Cheers.
––––
Sébastien Han
Cloud Architect

"Always give 100%. Unless you're giving blood."

Phone: +33 (0)1 49 70 99 72
Mail: sebastien....@enovance.com
Address : 11 bis, rue Roquépine - 75008 Paris
Web : www.enovance.com - Twitter : @enovance

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to