Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

Kasper Dieter Tue, 23 Sep 2014 22:57:29 -0700

On Thu, Sep 18, 2014 at 03:36:48PM +0200, Alexandre DERUMIER wrote:
> >>Have anyone ever testing multi volume performance on a *FULL* SSD setup?
> 
> I known that Stefan Priebe run full ssd clusters in production, and have done 
> benchmark. 
> (Ad far I remember, he have benched  around 20k peak with dumpling)
> 
> >>We are able to get ~18K IOPS for 4K random read on a single volume with fio 
> >>(with rbd engine) on a 12x DC3700 Setup, but only able to get ~23K (peak) 
> >>IOPS even with multiple volumes.
> >>Seems the maximum random write performance we can get on the entire cluster 
> >>is quite close to single volume performance.
> Firefly or Giant ?


Seems the max. possible 4k seq-write IOPS you can get is around ~20K IOPS,
independent if 2 or 400 OSDs, independent if SAS or SSD, independent if 3 or 9 
storage nodes.
The CPU is the limiting resource, because of the overhead in the code.

My IO Subsystem would be able to handle 2 Mio IOPS on 4K writes with repli=2
. 9 Storage nodes
. in total  18x P3700 Intel PCIe-SSDs over NVMe (each 150k random write IOPS on 
4K)
. in total 357x SAS 2.5" via 18x LSI MegaRAID-2208
. 10 GbE to the 9 client nodes
. 56GbIB as Cluster interconnect

There was an improvement between 0.80.x and 0.81,
but then the performance droped again ...
(see attachment)

-Dieter


> 
> I'll do benchs with 6 osd dc3500 tomorrow to compare firefly and giant.
> 
> ----- Mail original -----
> 
> De: "Jian Zhang" <jian.zh...@intel.com>
> À: "Sebastien Han" <sebastien....@enovance.com>, "Alexandre DERUMIER" 
> <aderum...@odiso.com>
> Cc: ceph-users@lists.ceph.com
> Envoyé: Jeudi 18 Septembre 2014 08:12:32
> Objet: RE: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K 
> IOPS
> 
> Have anyone ever testing multi volume performance on a *FULL* SSD setup?
> We are able to get ~18K IOPS for 4K random read on a single volume with fio 
> (with rbd engine) on a 12x DC3700 Setup, but only able to get ~23K (peak) 
> IOPS even with multiple volumes.
> Seems the maximum random write performance we can get on the entire cluster 
> is quite close to single volume performance.
> 
> Thanks
> Jian
> 
> 
> -----Original Message-----
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Sebastien Han
> Sent: Tuesday, September 16, 2014 9:33 PM
> To: Alexandre DERUMIER
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K 
> IOPS
> 
> Hi,
> 
> Thanks for keeping us updated on this subject.
> dsync is definitely killing the ssd.
> 
> I don't have much to add, I'm just surprised that you're only getting 5299 
> with 0.85 since I've been able to get 6,4K, well I was using the 200GB model, 
> that might explain this.
> 
> 
> On 12 Sep 2014, at 16:32, Alexandre DERUMIER <aderum...@odiso.com> wrote:
> 
> > here the results for the intel s3500
> > ------------------------------------
> > max performance is with ceph 0.85 + optracker disabled.
> > intel s3500 don't have d_sync problem like crucial
> >
> > %util show almost 100% for read and write, so maybe the ssd disk 
> > performance is the limit.
> >
> > I have some stec zeusram 8GB in stock (I used them for zfs zil), I'll try 
> > to bench them next week.
> >
> >
> >
> >
> >
> >
> > INTEL s3500
> > -----------
> > raw disk
> > --------
> >
> > randread: fio --filename=/dev/sdb --direct=1 --rw=randread --bs=4k
> > --iodepth=32 --group_reporting --invalidate=0 --name=abc
> > --ioengine=aio bw=288207KB/s, iops=72051
> >
> > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
> > w_await svctm %util
> > sdb 0,00 0,00 73454,00 0,00 293816,00 0,00 8,00 30,96 0,42 0,42 0,00 0,01 
> > 99,90
> >
> > randwrite: fio --filename=/dev/sdb --direct=1 --rw=randwrite --bs=4k
> > --iodepth=32 --group_reporting --invalidate=0 --name=abc --ioengine=aio 
> > --sync=1 bw=48131KB/s, iops=12032
> > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
> > w_await svctm %util
> > sdb 0,00 0,00 0,00 24120,00 0,00 48240,00 4,00 2,08 0,09 0,00 0,09 0,04 
> > 100,00
> >
> >
> > ceph 0.80
> > ---------
> > randread: no tuning: bw=24578KB/s, iops=6144
> >
> >
> > randwrite: bw=10358KB/s, iops=2589
> > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
> > w_await svctm %util
> > sdb 0,00 373,00 0,00 8878,00 0,00 34012,50 7,66 1,63 0,18 0,00 0,18 0,06 
> > 50,90
> >
> >
> > ceph 0.85 :
> > ---------
> >
> > randread : bw=41406KB/s, iops=10351
> > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
> > w_await svctm %util
> > sdb 2,00 0,00 10425,00 0,00 41816,00 0,00 8,02 1,36 0,13 0,13 0,00 0,07 
> > 75,90
> >
> > randwrite : bw=17204KB/s, iops=4301
> >
> > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
> > w_await svctm %util
> > sdb 0,00 333,00 0,00 9788,00 0,00 57909,00 11,83 1,46 0,15 0,00 0,15 0,07 
> > 67,80
> >
> >
> > ceph 0.85 tuning op_tracker=false
> > ----------------
> >
> > randread : bw=86537KB/s, iops=21634
> > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
> > w_await svctm %util
> > sdb 25,00 0,00 21428,00 0,00 86444,00 0,00 8,07 3,13 0,15 0,15 0,00 0,05 
> > 98,00
> >
> > randwrite: bw=21199KB/s, iops=5299
> > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
> > w_await svctm %util
> > sdb 0,00 1563,00 0,00 9880,00 0,00 75223,50 15,23 2,09 0,21 0,00 0,21 0,07 
> > 80,00
> >
> >
> > ----- Mail original -----
> >
> > De: "Alexandre DERUMIER" <aderum...@odiso.com>
> > À: "Cedric Lemarchand" <ced...@yipikai.org>
> > Cc: ceph-users@lists.ceph.com
> > Envoyé: Vendredi 12 Septembre 2014 08:15:08
> > Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over
> > 3, 2K IOPS
> >
> > results of fio on rbd with kernel patch
> >
> >
> >
> > fio rbd crucial m550 1 osd 0.85 (osd_enable_op_tracker true or false, same 
> > result):
> > ---------------------------
> > bw=12327KB/s, iops=3081
> >
> > So no much better than before, but this time, iostat show only 15%
> > utils, and latencies are lower
> >
> > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await
> > r_await w_await svctm %util sdb 0,00 29,00 0,00 3075,00 0,00 36748,50
> > 23,90 0,29 0,10 0,00 0,10 0,05 15,20
> >
> >
> > So, the write bottleneck seem to be in ceph.
> >
> >
> >
> > I will send s3500 result today
> >
> > ----- Mail original -----
> >
> > De: "Alexandre DERUMIER" <aderum...@odiso.com>
> > À: "Cedric Lemarchand" <ced...@yipikai.org>
> > Cc: ceph-users@lists.ceph.com
> > Envoyé: Vendredi 12 Septembre 2014 07:58:05
> > Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over
> > 3, 2K IOPS
> >
> >>> For crucial, I'll try to apply the patch from stefan priebe, to
> >>> ignore flushes (as crucial m550 have supercaps)
> >>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-November/03
> >>> 5707.html
> > Here the results, disable cache flush
> >
> > crucial m550
> > ------------
> > #fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2
> > --group_reporting --invalidate=0 --name=ab --sync=1 bw=177575KB/s,
> > iops=44393
> >
> >
> > ----- Mail original -----
> >
> > De: "Alexandre DERUMIER" <aderum...@odiso.com>
> > À: "Cedric Lemarchand" <ced...@yipikai.org>
> > Cc: ceph-users@lists.ceph.com
> > Envoyé: Vendredi 12 Septembre 2014 04:55:21
> > Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over
> > 3, 2K IOPS
> >
> > Hi,
> > seem that intel s3500 perform a lot better with o_dsync
> >
> > crucial m550
> > ------------
> > #fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2
> > --group_reporting --invalidate=0 --name=ab --sync=1 bw=1249.9KB/s,
> > iops=312
> >
> > intel s3500
> > -----------
> > fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2
> > --group_reporting --invalidate=0 --name=ab --sync=1 #bw=41794KB/s,
> > iops=10448
> >
> > ok, so 30x faster.
> >
> >
> >
> > For crucial, I have try to apply the patch from stefan priebe, to
> > ignore flushes (as crucial m550 have supercaps)
> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-November/0357
> > 07.html Coming from zfs, this sound like "zfs_nocacheflush"
> >
> > Now results:
> >
> > crucial m550
> > ------------
> > #fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2
> > --group_reporting --invalidate=0 --name=ab --sync=1 bw=177575KB/s,
> > iops=44393
> >
> >
> >
> > fio rbd crucial m550 1 osd 0.85 (osd_enable_op_tracker true or false, same 
> > result):
> > ---------------------------
> > bw=12327KB/s, iops=3081
> >
> > So no much better than before, but this time, iostat show only 15%
> > utils, and latencies are lower
> >
> > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await
> > r_await w_await svctm %util sdb 0,00 29,00 0,00 3075,00 0,00 36748,50
> > 23,90 0,29 0,10 0,00 0,10 0,05 15,20
> >
> >
> > So, the write bottleneck seem to be in ceph.
> >
> >
> >
> > I will send s3500 result today
> >
> > ----- Mail original -----
> >
> > De: "Cedric Lemarchand" <ced...@yipikai.org>
> > À: ceph-users@lists.ceph.com
> > Envoyé: Jeudi 11 Septembre 2014 21:23:23
> > Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over
> > 3, 2K IOPS
> >
> >
> > Le 11/09/2014 19:33, Cedric Lemarchand a écrit :
> >> Le 11/09/2014 08:20, Alexandre DERUMIER a écrit :
> >>> Hi Sebastien,
> >>>
> >>> here my first results with crucial m550 (I'll send result with intel 
> >>> s3500 later):
> >>>
> >>> - 3 nodes
> >>> - dell r620 without expander backplane
> >>> - sas controller : lsi LSI 9207 (no hardware raid or cache)
> >>> - 2 x E5-2603v2 1.8GHz (4cores)
> >>> - 32GB ram
> >>> - network : 2xgigabit link lacp + 2xgigabit lacp for cluster replication.
> >>>
> >>> -os : debian wheezy, with kernel 3.10
> >>>
> >>> os + ceph mon : 2x intel s3500 100gb linux soft raid osd : crucial
> >>> m550 (1TB).
> >>>
> >>>
> >>> 3mon in the ceph cluster,
> >>> and 1 osd (journal and datas on same disk)
> >>>
> >>>
> >>> ceph.conf
> >>> ---------
> >>> debug_lockdep = 0/0
> >>> debug_context = 0/0
> >>> debug_crush = 0/0
> >>> debug_buffer = 0/0
> >>> debug_timer = 0/0
> >>> debug_filer = 0/0
> >>> debug_objecter = 0/0
> >>> debug_rados = 0/0
> >>> debug_rbd = 0/0
> >>> debug_journaler = 0/0
> >>> debug_objectcatcher = 0/0
> >>> debug_client = 0/0
> >>> debug_osd = 0/0
> >>> debug_optracker = 0/0
> >>> debug_objclass = 0/0
> >>> debug_filestore = 0/0
> >>> debug_journal = 0/0
> >>> debug_ms = 0/0
> >>> debug_monc = 0/0
> >>> debug_tp = 0/0
> >>> debug_auth = 0/0
> >>> debug_finisher = 0/0
> >>> debug_heartbeatmap = 0/0
> >>> debug_perfcounter = 0/0
> >>> debug_asok = 0/0
> >>> debug_throttle = 0/0
> >>> debug_mon = 0/0
> >>> debug_paxos = 0/0
> >>> debug_rgw = 0/0
> >>> osd_op_threads = 5
> >>> filestore_op_threads = 4
> >>>
> >>> ms_nocrc = true
> >>> cephx sign messages = false
> >>> cephx require signatures = false
> >>>
> >>> ms_dispatch_throttle_bytes = 0
> >>>
> >>> #0.85
> >>> throttler_perf_counter = false
> >>> filestore_fd_cache_size = 64
> >>> filestore_fd_cache_shards = 32
> >>> osd_op_num_threads_per_shard = 1
> >>> osd_op_num_shards = 25
> >>> osd_enable_op_tracker = true
> >>>
> >>>
> >>>
> >>> Fio disk 4K benchmark
> >>> ------------------
> >>> rand read 4k : fio --filename=/dev/sdb --direct=1 --rw=randread
> >>> --bs=4k --iodepth=32 --group_reporting --invalidate=0 --name=abc
> >>> --ioengine=aio bw=271755KB/s, iops=67938
> >>>
> >>> rand write 4k : fio --filename=/dev/sdb --direct=1 --rw=randwrite
> >>> --bs=4k --iodepth=32 --group_reporting --invalidate=0 --name=abc
> >>> --ioengine=aio bw=228293KB/s, iops=57073
> >>>
> >>>
> >>>
> >>> fio osd benchmark (through librbd)
> >>> ----------------------------------
> >>> [global]
> >>> ioengine=rbd
> >>> clientname=admin
> >>> pool=test
> >>> rbdname=test
> >>> invalidate=0 # mandatory
> >>> rw=randwrite
> >>> rw=randread
> >>> bs=4k
> >>> direct=1
> >>> numjobs=4
> >>> group_reporting=1
> >>>
> >>> [rbd_iodepth32]
> >>> iodepth=32
> >>>
> >>>
> >>>
> >>> FIREFLY RESULTS
> >>> ----------------
> >>> fio randwrite : bw=5009.6KB/s, iops=1252
> >>>
> >>> fio randread: bw=37820KB/s, iops=9455
> >>>
> >>>
> >>>
> >>> O.85 RESULTS
> >>> ------------
> >>>
> >>> fio randwrite : bw=11658KB/s, iops=2914
> >>>
> >>> fio randread : bw=38642KB/s, iops=9660
> >>>
> >>>
> >>>
> >>> 0.85 + osd_enable_op_tracker=false
> >>> -----------------------------------
> >>> fio randwrite : bw=11630KB/s, iops=2907 fio randread : bw=80606KB/s,
> >>> iops=20151, (cpu 100% - GREAT !)
> >>>
> >>>
> >>>
> >>> So, for read, seem that osd_enable_op_tracker is the bottleneck.
> >>>
> >>>
> >>> Now for write, I really don't understand why it's so low.
> >>>
> >>>
> >>> I have done some iostat:
> >>>
> >>>
> >>> FIO directly on /dev/sdb
> >>> bw=228293KB/s, iops=57073
> >>>
> >>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await
> >>> r_await w_await svctm %util sdb 0,00 0,00 0,00 63613,00 0,00
> >>> 254452,00 8,00 31,24 0,49 0,00 0,49 0,02 100,00
> >>>
> >>>
> >>> FIO directly on osd through librbd
> >>> bw=11658KB/s, iops=2914
> >>>
> >>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await
> >>> r_await w_await svctm %util sdb 0,00 355,00 0,00 5225,00 0,00
> >>> 29678,00 11,36 57,63 11,03 0,00 11,03 0,19 99,70
> >>>
> >>>
> >>> (I don't understand what exactly is %util, 100% in the 2 cases,
> >>> because 10x slower with ceph)
> >> It would be interesting if you could catch the size of writes on SSD
> >> during the bench through librbd (I know nmon can do that)
> > Replying to myself ... I ask a bit quickly in the way we already have
> > this information (29678 / 5225 = 5,68Ko), but this is irrelevant.
> >
> > Cheers
> >
> >>> It could be a dsync problem, result seem pretty poor
> >>>
> >>> # dd if=rand.file of=/dev/sdb bs=4k count=65536 oflag=direct
> >>> 65536+0 enregistrements lus
> >>> 65536+0 enregistrements écrits
> >>> 268435456 octets (268 MB) copiés, 2,77433 s, 96,8 MB/s
> >>>
> >>>
> >>> # dd if=rand.file of=/dev/sdb bs=4k count=65536 oflag=dsync,direct
> >>> ^C17228+0 enregistrements lus
> >>> 17228+0 enregistrements écrits
> >>> 70565888 octets (71 MB) copiés, 70,4098 s, 1,0 MB/s
> >>>
> >>>
> >>>
> >>> I'll do tests with intel s3500 tomorrow to compare
> >>>
> >>> ----- Mail original -----
> >>>
> >>> De: "Sebastien Han" <sebastien....@enovance.com>
> >>> À: "Warren Wang" <warren_w...@cable.comcast.com>
> >>> Cc: ceph-users@lists.ceph.com
> >>> Envoyé: Lundi 8 Septembre 2014 22:58:25
> >>> Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go
> >>> over 3, 2K IOPS
> >>>
> >>> They definitely are Warren!
> >>>
> >>> Thanks for bringing this here :).
> >>>
> >>> On 05 Sep 2014, at 23:02, Wang, Warren <warren_w...@cable.comcast.com> 
> >>> wrote:
> >>>
> >>>> +1 to what Cedric said.
> >>>>
> >>>> Anything more than a few minutes of heavy sustained writes tended to get 
> >>>> our solid state devices into a state where garbage collection could not 
> >>>> keep up. Originally we used small SSDs and did not overprovision the 
> >>>> journals by much. Manufacturers publish their SSD stats, and then in 
> >>>> very small font, state that the attained IOPS are with empty drives, and 
> >>>> the tests are only run for very short amounts of time. Even if the 
> >>>> drives are new, it's a good idea to perform an hdparm secure erase on 
> >>>> them (so that the SSD knows that the blocks are truly unused), and then 
> >>>> overprovision them. You'll know if you have a problem by watching for 
> >>>> utilization and wait data on the journals.
> >>>>
> >>>> One of the other interesting performance issues is that the Intel 10Gbe 
> >>>> NICs + default kernel that we typically use max out around 1million 
> >>>> packets/sec. It's worth tracking this metric to if you are close.
> >>>>
> >>>> I know these aren't necessarily relevant to the test parameters you gave 
> >>>> below, but they're worth keeping in mind.
> >>>>
> >>>> --
> >>>> Warren Wang
> >>>> Comcast Cloud (OpenStack)
> >>>>
> >>>>
> >>>> From: Cedric Lemarchand <ced...@yipikai.org>
> >>>> Date: Wednesday, September 3, 2014 at 5:14 PM
> >>>> To: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
> >>>> Subject: Re: [ceph-users] [Single OSD performance on SSD] Can't go
> >>>> over 3, 2K IOPS
> >>>>
> >>>>
> >>>> Le 03/09/2014 22:11, Sebastien Han a écrit :
> >>>>> Hi Warren,
> >>>>>
> >>>>> What do mean exactly by secure erase? At the firmware level with 
> >>>>> constructor softwares?
> >>>>> SSDs were pretty new so I don't we hit that sort of things. I believe 
> >>>>> that only aged SSDs have this behaviour but I might be wrong.
> >>>>>
> >>>> Sorry I forgot to reply to the real question ;-) So yes it only
> >>>> plays after some times, for your case, if the SSD still delivers write 
> >>>> IOPS specified by the manufacturer, it will doesn't help in any ways.
> >>>>
> >>>> But it seems this practice is nowadays increasingly used.
> >>>>
> >>>> Cheers
> >>>>> On 02 Sep 2014, at 18:23, Wang, Warren
> >>>>> <warren_w...@cable.comcast.com>
> >>>>> wrote:
> >>>>>
> >>>>>
> >>>>>> Hi Sebastien,
> >>>>>>
> >>>>>> Something I didn't see in the thread so far, did you secure erase the 
> >>>>>> SSDs before they got used? I assume these were probably repurposed for 
> >>>>>> this test. We have seen some pretty significant garbage collection 
> >>>>>> issue on various SSD and other forms of solid state storage to the 
> >>>>>> point where we are overprovisioning pretty much every solid state 
> >>>>>> device now. By as much as 50% to handle sustained write operations. 
> >>>>>> Especially important for the journals, as we've found.
> >>>>>>
> >>>>>> Maybe not an issue on the short fio run below, but certainly evident 
> >>>>>> on longer runs or lots of historical data on the drives. The max 
> >>>>>> transaction time looks pretty good for your test. Something to 
> >>>>>> consider though.
> >>>>>>
> >>>>>> Warren
> >>>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: ceph-users [
> >>>>>> mailto:ceph-users-boun...@lists.ceph.com
> >>>>>> ] On Behalf Of Sebastien Han
> >>>>>> Sent: Thursday, August 28, 2014 12:12 PM
> >>>>>> To: ceph-users
> >>>>>> Cc: Mark Nelson
> >>>>>> Subject: [ceph-users] [Single OSD performance on SSD] Can't go
> >>>>>> over 3, 2K IOPS
> >>>>>>
> >>>>>> Hey all,
> >>>>>>
> >>>>>> It has been a while since the last thread performance related on the 
> >>>>>> ML :p I've been running some experiment to see how much I can get from 
> >>>>>> an SSD on a Ceph cluster.
> >>>>>> To achieve that I did something pretty simple:
> >>>>>>
> >>>>>> * Debian wheezy 7.6
> >>>>>> * kernel from debian 3.14-0.bpo.2-amd64
> >>>>>> * 1 cluster, 3 mons (i'd like to keep this realistic since in a
> >>>>>> real deployment i'll use 3)
> >>>>>> * 1 OSD backed by an SSD (journal and osd data on the same
> >>>>>> device)
> >>>>>> * 1 replica count of 1
> >>>>>> * partitions are perfectly aligned
> >>>>>> * io scheduler is set to noon but deadline was showing the same
> >>>>>> results
> >>>>>> * no updatedb running
> >>>>>>
> >>>>>> About the box:
> >>>>>>
> >>>>>> * 32GB of RAM
> >>>>>> * 12 cores with HT @ 2,4 GHz
> >>>>>> * WB cache is enabled on the controller
> >>>>>> * 10Gbps network (doesn't help here)
> >>>>>>
> >>>>>> The SSD is a 200G Intel DC S3700 and is capable of delivering around 
> >>>>>> 29K iops with random 4k writes (my fio results) As a benchmark tool I 
> >>>>>> used fio with the rbd engine (thanks deutsche telekom guys!).
> >>>>>>
> >>>>>> O_DIECT and D_SYNC don't seem to be a problem for the SSD:
> >>>>>>
> >>>>>> # dd if=/dev/urandom of=rand.file bs=4k count=65536
> >>>>>> 65536+0 records in
> >>>>>> 65536+0 records out
> >>>>>> 268435456 bytes (268 MB) copied, 29.5477 s, 9.1 MB/s
> >>>>>>
> >>>>>> # du -sh rand.file
> >>>>>> 256M rand.file
> >>>>>>
> >>>>>> # dd if=rand.file of=/dev/sdo bs=4k count=65536
> >>>>>> oflag=dsync,direct
> >>>>>> 65536+0 records in
> >>>>>> 65536+0 records out
> >>>>>> 268435456 bytes (268 MB) copied, 2.73628 s, 98.1 MB/s
> >>>>>>
> >>>>>> See my ceph.conf:
> >>>>>>
> >>>>>> [global]
> >>>>>> auth cluster required = cephx
> >>>>>> auth service required = cephx
> >>>>>> auth client required = cephx
> >>>>>> fsid = 857b8609-8c9b-499e-9161-2ea67ba51c97
> >>>>>> osd pool default pg num = 4096
> >>>>>> osd pool default pgp num = 4096
> >>>>>> osd pool default size = 2
> >>>>>> osd crush chooseleaf type = 0
> >>>>>>
> >>>>>> debug lockdep = 0/0
> >>>>>> debug context = 0/0
> >>>>>> debug crush = 0/0
> >>>>>> debug buffer = 0/0
> >>>>>> debug timer = 0/0
> >>>>>> debug journaler = 0/0
> >>>>>> debug osd = 0/0
> >>>>>> debug optracker = 0/0
> >>>>>> debug objclass = 0/0
> >>>>>> debug filestore = 0/0
> >>>>>> debug journal = 0/0
> >>>>>> debug ms = 0/0
> >>>>>> debug monc = 0/0
> >>>>>> debug tp = 0/0
> >>>>>> debug auth = 0/0
> >>>>>> debug finisher = 0/0
> >>>>>> debug heartbeatmap = 0/0
> >>>>>> debug perfcounter = 0/0
> >>>>>> debug asok = 0/0
> >>>>>> debug throttle = 0/0
> >>>>>>
> >>>>>> [mon]
> >>>>>> mon osd down out interval = 600
> >>>>>> mon osd min down reporters = 13
> >>>>>> [mon.ceph-01]
> >>>>>> host = ceph-01
> >>>>>> mon addr = 172.20.20.171
> >>>>>> [mon.ceph-02]
> >>>>>> host = ceph-02
> >>>>>> mon addr = 172.20.20.172
> >>>>>> [mon.ceph-03]
> >>>>>> host = ceph-03
> >>>>>> mon addr = 172.20.20.173
> >>>>>>
> >>>>>> debug lockdep = 0/0
> >>>>>> debug context = 0/0
> >>>>>> debug crush = 0/0
> >>>>>> debug buffer = 0/0
> >>>>>> debug timer = 0/0
> >>>>>> debug journaler = 0/0
> >>>>>> debug osd = 0/0
> >>>>>> debug optracker = 0/0
> >>>>>> debug objclass = 0/0
> >>>>>> debug filestore = 0/0
> >>>>>> debug journal = 0/0
> >>>>>> debug ms = 0/0
> >>>>>> debug monc = 0/0
> >>>>>> debug tp = 0/0
> >>>>>> debug auth = 0/0
> >>>>>> debug finisher = 0/0
> >>>>>> debug heartbeatmap = 0/0
> >>>>>> debug perfcounter = 0/0
> >>>>>> debug asok = 0/0
> >>>>>> debug throttle = 0/0
> >>>>>>
> >>>>>> [osd]
> >>>>>> osd mkfs type = xfs
> >>>>>> osd mkfs options xfs = -f -i size=2048 osd mount options xfs =
> >>>>>> rw,noatime,logbsize=256k,delaylog osd journal size = 20480
> >>>>>> cluster_network = 172.20.20.0/24 public_network = 172.20.20.0/24
> >>>>>> osd mon heartbeat interval = 30 # Performance tuning filestore
> >>>>>> merge threshold = 40 filestore split multiple = 8 osd op threads
> >>>>>> = 8 # Recovery tuning osd recovery max active = 1 osd max
> >>>>>> backfills = 1 osd recovery op priority = 1
> >>>>>>
> >>>>>>
> >>>>>> debug lockdep = 0/0
> >>>>>> debug context = 0/0
> >>>>>> debug crush = 0/0
> >>>>>> debug buffer = 0/0
> >>>>>> debug timer = 0/0
> >>>>>> debug journaler = 0/0
> >>>>>> debug osd = 0/0
> >>>>>> debug optracker = 0/0
> >>>>>> debug objclass = 0/0
> >>>>>> debug filestore = 0/0
> >>>>>> debug journal = 0/0
> >>>>>> debug ms = 0/0
> >>>>>> debug monc = 0/0
> >>>>>> debug tp = 0/0
> >>>>>> debug auth = 0/0
> >>>>>> debug finisher = 0/0
> >>>>>> debug heartbeatmap = 0/0
> >>>>>> debug perfcounter = 0/0
> >>>>>> debug asok = 0/0
> >>>>>> debug throttle = 0/0
> >>>>>>
> >>>>>> Disabling all debugging made me win 200/300 more IOPS.
> >>>>>>
> >>>>>> See my fio template:
> >>>>>>
> >>>>>> [global]
> >>>>>> #logging
> >>>>>> #write_iops_log=write_iops_log
> >>>>>> #write_bw_log=write_bw_log
> >>>>>> #write_lat_log=write_lat_lo
> >>>>>>
> >>>>>> time_based
> >>>>>> runtime=60
> >>>>>>
> >>>>>> ioengine=rbd
> >>>>>> clientname=admin
> >>>>>> pool=test
> >>>>>> rbdname=fio
> >>>>>> invalidate=0 # mandatory
> >>>>>> #rw=randwrite
> >>>>>> rw=write
> >>>>>> bs=4k
> >>>>>> #bs=32m
> >>>>>> size=5G
> >>>>>> group_reporting
> >>>>>>
> >>>>>> [rbd_iodepth32]
> >>>>>> iodepth=32
> >>>>>> direct=1
> >>>>>>
> >>>>>> See my rio output:
> >>>>>>
> >>>>>> rbd_iodepth32: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K,
> >>>>>> ioengine=rbd, iodepth=32 fio-2.1.11-14-gb74e Starting 1 process
> >>>>>> rbd engine: RBD version: 0.1.8
> >>>>>> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/12876KB/0KB /s]
> >>>>>> [0/3219/0 iops] [eta 00m:00s]
> >>>>>> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=32116: Thu Aug 28
> >>>>>> 00:28:26 2014
> >>>>>> write: io=771448KB, bw=12855KB/s, iops=3213, runt= 60010msec slat
> >>>>>> (usec): min=42, max=1578, avg=66.50, stdev=16.96 clat (msec):
> >>>>>> min=1, max=28, avg= 9.85, stdev= 1.48 lat (msec): min=1, max=28,
> >>>>>> avg= 9.92, stdev= 1.47 clat percentiles (usec):
> >>>>>> | 1.00th=[ 6368], 5.00th=[ 8256], 10.00th=[ 8640], 20.00th=[
> >>>>>> | 9152], 30.00th=[ 9408], 40.00th=[ 9664], 50.00th=[ 9792],
> >>>>>> | 60.00th=[10048], 70.00th=[10176], 80.00th=[10560],
> >>>>>> | 90.00th=[10944], 95.00th=[11456], 99.00th=[13120],
> >>>>>> | 99.50th=[16768], 99.90th=[25984], 99.95th=[27008],
> >>>>>> | 99.99th=[28032]
> >>>>>> bw (KB /s): min=11864, max=13808, per=100.00%, avg=12864.36,
> >>>>>> stdev=407.35 lat (msec) : 2=0.03%, 4=0.54%, 10=59.79%, 20=39.24%,
> >>>>>> 50=0.41% cpu : usr=19.15%, sys=4.69%, ctx=326309, majf=0,
> >>>>>> minf=426088 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=33.9%,
> >>>>>> 32=66.1%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%,
> >>>>>> 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=99.6%, 8=0.4%,
> >>>>>> 16=0.1%, 32=0.1%, 64=0.0%, >=64=0.0% issued :
> >>>>>> total=r=0/w=192862/d=0, short=r=0/w=0/d=0 latency : target=0,
> >>>>>> window=0, percentile=100.00%, depth=32
> >>>>>>
> >>>>>> Run status group 0 (all jobs):
> >>>>>> WRITE: io=771448KB, aggrb=12855KB/s, minb=12855KB/s,
> >>>>>> maxb=12855KB/s, mint=60010msec, maxt=60010msec
> >>>>>>
> >>>>>> Disk stats (read/write):
> >>>>>> dm-1: ios=0/49, merge=0/0, ticks=0/12, in_queue=12, util=0.01%,
> >>>>>> aggrios=0/22, aggrmerge=0/27, aggrticks=0/12, aggrin_queue=12,
> >>>>>> aggrutil=0.01%
> >>>>>> sda: ios=0/22, merge=0/27, ticks=0/12, in_queue=12, util=0.01%
> >>>>>>
> >>>>>> I tried to tweak several parameters like:
> >>>>>>
> >>>>>> filestore_wbthrottle_xfs_ios_start_flusher = 10000
> >>>>>> filestore_wbthrottle_xfs_ios_hard_limit = 10000
> >>>>>> filestore_wbthrottle_btrfs_ios_start_flusher = 10000
> >>>>>> filestore_wbthrottle_btrfs_ios_hard_limit = 10000 filestore queue
> >>>>>> max ops = 2000
> >>>>>>
> >>>>>> But didn't any improvement.
> >>>>>>
> >>>>>> Then I tried other things:
> >>>>>>
> >>>>>> * Increasing the io_depth up to 256 or 512 gave me between 50 to 100 
> >>>>>> more IOPS but it's not a realistic workload anymore and not that 
> >>>>>> significant.
> >>>>>> * adding another SSD for the journal, still getting 3,2K IOPS
> >>>>>> * I tried with rbd bench and I also got 3K IOPS
> >>>>>> * I ran the test on a client machine and then locally on the
> >>>>>> server, still getting 3,2K IOPS
> >>>>>> * put the journal in memory, still getting 3,2K IOPS
> >>>>>> * with 2 clients running the test in parallel I got a total of
> >>>>>> 3,6K IOPS but I don't seem to be able to go over
> >>>>>> * I tried is to add another OSD to that SSD, so I had 2 OSD and 2 
> >>>>>> journals on 1 SSD, got 4,5K IOPS YAY!
> >>>>>>
> >>>>>> Given the results of the last time it seems that something is limiting 
> >>>>>> the number of IOPS per OSD process.
> >>>>>>
> >>>>>> Running the test on a client or locally didn't show any difference.
> >>>>>> So it looks to me that there is some contention within Ceph that might 
> >>>>>> cause this.
> >>>>>>
> >>>>>> I also ran perf and looked at the output, everything looks decent, but 
> >>>>>> someone might want to have a look at it :).
> >>>>>>
> >>>>>> We have been able to reproduce this on 3 distinct platforms with some 
> >>>>>> deviations (because of the hardware) but the behaviour is the same.
> >>>>>> Any thoughts will be highly appreciated, only getting 3,2k out of an 
> >>>>>> 29K IOPS SSD is a bit frustrating :).
> >>>>>>
> >>>>>> Cheers.
> >>>>>> ----
> >>>>>> Sébastien Han
> >>>>>> Cloud Architect
> >>>>>>
> >>>>>> "Always give 100%. Unless you're giving blood."
> >>>>>>
> >>>>>> Phone: +33 (0)1 49 70 99 72
> >>>>>> Mail:
> >>>>>> sebastien....@enovance.com
> >>>>>>
> >>>>>> Address : 11 bis, rue Roquépine - 75008 Paris Web :
> >>>>>> www.enovance.com
> >>>>>> - Twitter : @enovance
> >>>>>>
> >>>>>>
> >>>>> Cheers.
> >>>>> ----
> >>>>> Sébastien Han
> >>>>> Cloud Architect
> >>>>>
> >>>>> "Always give 100%. Unless you're giving blood."
> >>>>>
> >>>>> Phone: +33 (0)1 49 70 99 72
> >>>>> Mail:
> >>>>> sebastien....@enovance.com
> >>>>>
> >>>>> Address : 11 bis, rue Roquépine - 75008 Paris Web :
> >>>>> www.enovance.com
> >>>>> - Twitter : @enovance
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list
> >>>>>
> >>>>> ceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-u
> >>>>> sers-ceph.com
> >>>> --
> >>>> Cédric
> >>>>
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-users@lists.ceph.com
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>> Cheers.
> >>> ----
> >>> Sébastien Han
> >>> Cloud Architect
> >>>
> >>> "Always give 100%. Unless you're giving blood."
> >>>
> >>> Phone: +33 (0)1 49 70 99 72
> >>> Mail: sebastien....@enovance.com
> >>> Address : 11 bis, rue Roquépine - 75008 Paris Web : www.enovance.com
> >>> - Twitter : @enovance
> >>>
> >>>
> >>> _______________________________________________
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>> _______________________________________________
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > --
> > Cédric
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> Cheers.
> ----
> Sébastien Han
> Cloud Architect
> 
> "Always give 100%. Unless you're giving blood."
> 
> Phone: +33 (0)1 49 70 99 72
> Mail: sebastien....@enovance.com
> Address : 11 bis, rue Roquépine - 75008 Paris Web : www.enovance.com - 
> Twitter : @enovance
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

FJ-20140915-Best-Practice_Distributed-Intelligent-Storage_NVMe-SSD_fast-IC_v8_versions,ksp.pdf
Description: Adobe PDF document

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

Reply via email to