>>I'll do benchs with 6 osd dc3500 tomorrow to compare firefly and giant.
Here the results (big giant improvements)
3 nodes with 2osd, replication x1
network is 2gigabit link with lacp for nodes and client
firefly : no tunning
---------------------
bw=45880KB/s, iops=11469
firefly with tuning:
--------------------
debug lockdep = 0/0
debug context = 0/0
debug crush = 0/0
debug buffer = 0/0
debug timer = 0/0
debug journaler = 0/0
debug osd = 0/0
debug optracker = 0/0
debug objclass = 0/0
debug filestore = 0/0
debug journal = 0/0
debug ms = 0/0
debug monc = 0/0
debug tp = 0/0
debug auth = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug perfcounter = 0/0
debug asok = 0/0
debug throttle = 0/0
osd_op_threads = 5
filestore_op_threads = 4
bw=62094KB/s, iops=15523
giant with same tuning
-----------------------
bw=247073KB/s, iops=61768 !
I think I could reach more, but my 2 gigabit link are satured.
----- Mail original -----
De: "Alexandre DERUMIER" <[email protected]>
À: "Jian Zhang" <[email protected]>
Cc: [email protected]
Envoyé: Jeudi 18 Septembre 2014 15:36:48
Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K
IOPS
>>Have anyone ever testing multi volume performance on a *FULL* SSD setup?
I known that Stefan Priebe run full ssd clusters in production, and have done
benchmark. (Ad far I remember, he have benched around 20k peak with dumpling)
>>We are able to get ~18K IOPS for 4K random read on a single volume with fio
>>(with rbd engine) on a 12x DC3700 Setup, but only able to get ~23K (peak)
>>IOPS even with multiple volumes.
>>Seems the maximum random write performance we can get on the entire cluster
>>is quite close to single volume performance.
Firefly or Giant ?
I'll do benchs with 6 osd dc3500 tomorrow to compare firefly and giant.
----- Mail original -----
De: "Jian Zhang" <[email protected]>
À: "Sebastien Han" <[email protected]>, "Alexandre DERUMIER"
<[email protected]>
Cc: [email protected]
Envoyé: Jeudi 18 Septembre 2014 08:12:32
Objet: RE: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K
IOPS
Have anyone ever testing multi volume performance on a *FULL* SSD setup?
We are able to get ~18K IOPS for 4K random read on a single volume with fio
(with rbd engine) on a 12x DC3700 Setup, but only able to get ~23K (peak) IOPS
even with multiple volumes.
Seems the maximum random write performance we can get on the entire cluster is
quite close to single volume performance.
Thanks
Jian
-----Original Message-----
From: ceph-users [mailto:[email protected]] On Behalf Of
Sebastien Han
Sent: Tuesday, September 16, 2014 9:33 PM
To: Alexandre DERUMIER
Cc: [email protected]
Subject: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K
IOPS
Hi,
Thanks for keeping us updated on this subject.
dsync is definitely killing the ssd.
I don't have much to add, I'm just surprised that you're only getting 5299 with
0.85 since I've been able to get 6,4K, well I was using the 200GB model, that
might explain this.
On 12 Sep 2014, at 16:32, Alexandre DERUMIER <[email protected]> wrote:
> here the results for the intel s3500
> ------------------------------------
> max performance is with ceph 0.85 + optracker disabled.
> intel s3500 don't have d_sync problem like crucial
>
> %util show almost 100% for read and write, so maybe the ssd disk performance
> is the limit.
>
> I have some stec zeusram 8GB in stock (I used them for zfs zil), I'll try to
> bench them next week.
>
>
>
>
>
>
> INTEL s3500
> -----------
> raw disk
> --------
>
> randread: fio --filename=/dev/sdb --direct=1 --rw=randread --bs=4k
> --iodepth=32 --group_reporting --invalidate=0 --name=abc
> --ioengine=aio bw=288207KB/s, iops=72051
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await
> w_await svctm %util
> sdb 0,00 0,00 73454,00 0,00 293816,00 0,00 8,00 30,96 0,42 0,42 0,00 0,01
> 99,90
>
> randwrite: fio --filename=/dev/sdb --direct=1 --rw=randwrite --bs=4k
> --iodepth=32 --group_reporting --invalidate=0 --name=abc --ioengine=aio
> --sync=1 bw=48131KB/s, iops=12032
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await
> w_await svctm %util
> sdb 0,00 0,00 0,00 24120,00 0,00 48240,00 4,00 2,08 0,09 0,00 0,09 0,04
> 100,00
>
>
> ceph 0.80
> ---------
> randread: no tuning: bw=24578KB/s, iops=6144
>
>
> randwrite: bw=10358KB/s, iops=2589
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await
> w_await svctm %util
> sdb 0,00 373,00 0,00 8878,00 0,00 34012,50 7,66 1,63 0,18 0,00 0,18 0,06
> 50,90
>
>
> ceph 0.85 :
> ---------
>
> randread : bw=41406KB/s, iops=10351
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await
> w_await svctm %util
> sdb 2,00 0,00 10425,00 0,00 41816,00 0,00 8,02 1,36 0,13 0,13 0,00 0,07 75,90
>
> randwrite : bw=17204KB/s, iops=4301
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await
> w_await svctm %util
> sdb 0,00 333,00 0,00 9788,00 0,00 57909,00 11,83 1,46 0,15 0,00 0,15 0,07
> 67,80
>
>
> ceph 0.85 tuning op_tracker=false
> ----------------
>
> randread : bw=86537KB/s, iops=21634
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await
> w_await svctm %util
> sdb 25,00 0,00 21428,00 0,00 86444,00 0,00 8,07 3,13 0,15 0,15 0,00 0,05
> 98,00
>
> randwrite: bw=21199KB/s, iops=5299
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await
> w_await svctm %util
> sdb 0,00 1563,00 0,00 9880,00 0,00 75223,50 15,23 2,09 0,21 0,00 0,21 0,07
> 80,00
>
>
> ----- Mail original -----
>
> De: "Alexandre DERUMIER" <[email protected]>
> À: "Cedric Lemarchand" <[email protected]>
> Cc: [email protected]
> Envoyé: Vendredi 12 Septembre 2014 08:15:08
> Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over
> 3, 2K IOPS
>
> results of fio on rbd with kernel patch
>
>
>
> fio rbd crucial m550 1 osd 0.85 (osd_enable_op_tracker true or false, same
> result):
> ---------------------------
> bw=12327KB/s, iops=3081
>
> So no much better than before, but this time, iostat show only 15%
> utils, and latencies are lower
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await
> r_await w_await svctm %util sdb 0,00 29,00 0,00 3075,00 0,00 36748,50
> 23,90 0,29 0,10 0,00 0,10 0,05 15,20
>
>
> So, the write bottleneck seem to be in ceph.
>
>
>
> I will send s3500 result today
>
> ----- Mail original -----
>
> De: "Alexandre DERUMIER" <[email protected]>
> À: "Cedric Lemarchand" <[email protected]>
> Cc: [email protected]
> Envoyé: Vendredi 12 Septembre 2014 07:58:05
> Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over
> 3, 2K IOPS
>
>>> For crucial, I'll try to apply the patch from stefan priebe, to
>>> ignore flushes (as crucial m550 have supercaps)
>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-November/03
>>> 5707.html
> Here the results, disable cache flush
>
> crucial m550
> ------------
> #fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2
> --group_reporting --invalidate=0 --name=ab --sync=1 bw=177575KB/s,
> iops=44393
>
>
> ----- Mail original -----
>
> De: "Alexandre DERUMIER" <[email protected]>
> À: "Cedric Lemarchand" <[email protected]>
> Cc: [email protected]
> Envoyé: Vendredi 12 Septembre 2014 04:55:21
> Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over
> 3, 2K IOPS
>
> Hi,
> seem that intel s3500 perform a lot better with o_dsync
>
> crucial m550
> ------------
> #fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2
> --group_reporting --invalidate=0 --name=ab --sync=1 bw=1249.9KB/s,
> iops=312
>
> intel s3500
> -----------
> fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2
> --group_reporting --invalidate=0 --name=ab --sync=1 #bw=41794KB/s,
> iops=10448
>
> ok, so 30x faster.
>
>
>
> For crucial, I have try to apply the patch from stefan priebe, to
> ignore flushes (as crucial m550 have supercaps)
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-November/0357
> 07.html Coming from zfs, this sound like "zfs_nocacheflush"
>
> Now results:
>
> crucial m550
> ------------
> #fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2
> --group_reporting --invalidate=0 --name=ab --sync=1 bw=177575KB/s,
> iops=44393
>
>
>
> fio rbd crucial m550 1 osd 0.85 (osd_enable_op_tracker true or false, same
> result):
> ---------------------------
> bw=12327KB/s, iops=3081
>
> So no much better than before, but this time, iostat show only 15%
> utils, and latencies are lower
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await
> r_await w_await svctm %util sdb 0,00 29,00 0,00 3075,00 0,00 36748,50
> 23,90 0,29 0,10 0,00 0,10 0,05 15,20
>
>
> So, the write bottleneck seem to be in ceph.
>
>
>
> I will send s3500 result today
>
> ----- Mail original -----
>
> De: "Cedric Lemarchand" <[email protected]>
> À: [email protected]
> Envoyé: Jeudi 11 Septembre 2014 21:23:23
> Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over
> 3, 2K IOPS
>
>
> Le 11/09/2014 19:33, Cedric Lemarchand a écrit :
>> Le 11/09/2014 08:20, Alexandre DERUMIER a écrit :
>>> Hi Sebastien,
>>>
>>> here my first results with crucial m550 (I'll send result with intel s3500
>>> later):
>>>
>>> - 3 nodes
>>> - dell r620 without expander backplane
>>> - sas controller : lsi LSI 9207 (no hardware raid or cache)
>>> - 2 x E5-2603v2 1.8GHz (4cores)
>>> - 32GB ram
>>> - network : 2xgigabit link lacp + 2xgigabit lacp for cluster replication.
>>>
>>> -os : debian wheezy, with kernel 3.10
>>>
>>> os + ceph mon : 2x intel s3500 100gb linux soft raid osd : crucial
>>> m550 (1TB).
>>>
>>>
>>> 3mon in the ceph cluster,
>>> and 1 osd (journal and datas on same disk)
>>>
>>>
>>> ceph.conf
>>> ---------
>>> debug_lockdep = 0/0
>>> debug_context = 0/0
>>> debug_crush = 0/0
>>> debug_buffer = 0/0
>>> debug_timer = 0/0
>>> debug_filer = 0/0
>>> debug_objecter = 0/0
>>> debug_rados = 0/0
>>> debug_rbd = 0/0
>>> debug_journaler = 0/0
>>> debug_objectcatcher = 0/0
>>> debug_client = 0/0
>>> debug_osd = 0/0
>>> debug_optracker = 0/0
>>> debug_objclass = 0/0
>>> debug_filestore = 0/0
>>> debug_journal = 0/0
>>> debug_ms = 0/0
>>> debug_monc = 0/0
>>> debug_tp = 0/0
>>> debug_auth = 0/0
>>> debug_finisher = 0/0
>>> debug_heartbeatmap = 0/0
>>> debug_perfcounter = 0/0
>>> debug_asok = 0/0
>>> debug_throttle = 0/0
>>> debug_mon = 0/0
>>> debug_paxos = 0/0
>>> debug_rgw = 0/0
>>> osd_op_threads = 5
>>> filestore_op_threads = 4
>>>
>>> ms_nocrc = true
>>> cephx sign messages = false
>>> cephx require signatures = false
>>>
>>> ms_dispatch_throttle_bytes = 0
>>>
>>> #0.85
>>> throttler_perf_counter = false
>>> filestore_fd_cache_size = 64
>>> filestore_fd_cache_shards = 32
>>> osd_op_num_threads_per_shard = 1
>>> osd_op_num_shards = 25
>>> osd_enable_op_tracker = true
>>>
>>>
>>>
>>> Fio disk 4K benchmark
>>> ------------------
>>> rand read 4k : fio --filename=/dev/sdb --direct=1 --rw=randread
>>> --bs=4k --iodepth=32 --group_reporting --invalidate=0 --name=abc
>>> --ioengine=aio bw=271755KB/s, iops=67938
>>>
>>> rand write 4k : fio --filename=/dev/sdb --direct=1 --rw=randwrite
>>> --bs=4k --iodepth=32 --group_reporting --invalidate=0 --name=abc
>>> --ioengine=aio bw=228293KB/s, iops=57073
>>>
>>>
>>>
>>> fio osd benchmark (through librbd)
>>> ----------------------------------
>>> [global]
>>> ioengine=rbd
>>> clientname=admin
>>> pool=test
>>> rbdname=test
>>> invalidate=0 # mandatory
>>> rw=randwrite
>>> rw=randread
>>> bs=4k
>>> direct=1
>>> numjobs=4
>>> group_reporting=1
>>>
>>> [rbd_iodepth32]
>>> iodepth=32
>>>
>>>
>>>
>>> FIREFLY RESULTS
>>> ----------------
>>> fio randwrite : bw=5009.6KB/s, iops=1252
>>>
>>> fio randread: bw=37820KB/s, iops=9455
>>>
>>>
>>>
>>> O.85 RESULTS
>>> ------------
>>>
>>> fio randwrite : bw=11658KB/s, iops=2914
>>>
>>> fio randread : bw=38642KB/s, iops=9660
>>>
>>>
>>>
>>> 0.85 + osd_enable_op_tracker=false
>>> -----------------------------------
>>> fio randwrite : bw=11630KB/s, iops=2907 fio randread : bw=80606KB/s,
>>> iops=20151, (cpu 100% - GREAT !)
>>>
>>>
>>>
>>> So, for read, seem that osd_enable_op_tracker is the bottleneck.
>>>
>>>
>>> Now for write, I really don't understand why it's so low.
>>>
>>>
>>> I have done some iostat:
>>>
>>>
>>> FIO directly on /dev/sdb
>>> bw=228293KB/s, iops=57073
>>>
>>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await
>>> r_await w_await svctm %util sdb 0,00 0,00 0,00 63613,00 0,00
>>> 254452,00 8,00 31,24 0,49 0,00 0,49 0,02 100,00
>>>
>>>
>>> FIO directly on osd through librbd
>>> bw=11658KB/s, iops=2914
>>>
>>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await
>>> r_await w_await svctm %util sdb 0,00 355,00 0,00 5225,00 0,00
>>> 29678,00 11,36 57,63 11,03 0,00 11,03 0,19 99,70
>>>
>>>
>>> (I don't understand what exactly is %util, 100% in the 2 cases,
>>> because 10x slower with ceph)
>> It would be interesting if you could catch the size of writes on SSD
>> during the bench through librbd (I know nmon can do that)
> Replying to myself ... I ask a bit quickly in the way we already have
> this information (29678 / 5225 = 5,68Ko), but this is irrelevant.
>
> Cheers
>
>>> It could be a dsync problem, result seem pretty poor
>>>
>>> # dd if=rand.file of=/dev/sdb bs=4k count=65536 oflag=direct
>>> 65536+0 enregistrements lus
>>> 65536+0 enregistrements écrits
>>> 268435456 octets (268 MB) copiés, 2,77433 s, 96,8 MB/s
>>>
>>>
>>> # dd if=rand.file of=/dev/sdb bs=4k count=65536 oflag=dsync,direct
>>> ^C17228+0 enregistrements lus
>>> 17228+0 enregistrements écrits
>>> 70565888 octets (71 MB) copiés, 70,4098 s, 1,0 MB/s
>>>
>>>
>>>
>>> I'll do tests with intel s3500 tomorrow to compare
>>>
>>> ----- Mail original -----
>>>
>>> De: "Sebastien Han" <[email protected]>
>>> À: "Warren Wang" <[email protected]>
>>> Cc: [email protected]
>>> Envoyé: Lundi 8 Septembre 2014 22:58:25
>>> Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go
>>> over 3, 2K IOPS
>>>
>>> They definitely are Warren!
>>>
>>> Thanks for bringing this here :).
>>>
>>> On 05 Sep 2014, at 23:02, Wang, Warren <[email protected]>
>>> wrote:
>>>
>>>> +1 to what Cedric said.
>>>>
>>>> Anything more than a few minutes of heavy sustained writes tended to get
>>>> our solid state devices into a state where garbage collection could not
>>>> keep up. Originally we used small SSDs and did not overprovision the
>>>> journals by much. Manufacturers publish their SSD stats, and then in very
>>>> small font, state that the attained IOPS are with empty drives, and the
>>>> tests are only run for very short amounts of time. Even if the drives are
>>>> new, it's a good idea to perform an hdparm secure erase on them (so that
>>>> the SSD knows that the blocks are truly unused), and then overprovision
>>>> them. You'll know if you have a problem by watching for utilization and
>>>> wait data on the journals.
>>>>
>>>> One of the other interesting performance issues is that the Intel 10Gbe
>>>> NICs + default kernel that we typically use max out around 1million
>>>> packets/sec. It's worth tracking this metric to if you are close.
>>>>
>>>> I know these aren't necessarily relevant to the test parameters you gave
>>>> below, but they're worth keeping in mind.
>>>>
>>>> --
>>>> Warren Wang
>>>> Comcast Cloud (OpenStack)
>>>>
>>>>
>>>> From: Cedric Lemarchand <[email protected]>
>>>> Date: Wednesday, September 3, 2014 at 5:14 PM
>>>> To: "[email protected]" <[email protected]>
>>>> Subject: Re: [ceph-users] [Single OSD performance on SSD] Can't go
>>>> over 3, 2K IOPS
>>>>
>>>>
>>>> Le 03/09/2014 22:11, Sebastien Han a écrit :
>>>>> Hi Warren,
>>>>>
>>>>> What do mean exactly by secure erase? At the firmware level with
>>>>> constructor softwares?
>>>>> SSDs were pretty new so I don't we hit that sort of things. I believe
>>>>> that only aged SSDs have this behaviour but I might be wrong.
>>>>>
>>>> Sorry I forgot to reply to the real question ;-) So yes it only
>>>> plays after some times, for your case, if the SSD still delivers write
>>>> IOPS specified by the manufacturer, it will doesn't help in any ways.
>>>>
>>>> But it seems this practice is nowadays increasingly used.
>>>>
>>>> Cheers
>>>>> On 02 Sep 2014, at 18:23, Wang, Warren
>>>>> <[email protected]>
>>>>> wrote:
>>>>>
>>>>>
>>>>>> Hi Sebastien,
>>>>>>
>>>>>> Something I didn't see in the thread so far, did you secure erase the
>>>>>> SSDs before they got used? I assume these were probably repurposed for
>>>>>> this test. We have seen some pretty significant garbage collection issue
>>>>>> on various SSD and other forms of solid state storage to the point where
>>>>>> we are overprovisioning pretty much every solid state device now. By as
>>>>>> much as 50% to handle sustained write operations. Especially important
>>>>>> for the journals, as we've found.
>>>>>>
>>>>>> Maybe not an issue on the short fio run below, but certainly evident on
>>>>>> longer runs or lots of historical data on the drives. The max
>>>>>> transaction time looks pretty good for your test. Something to consider
>>>>>> though.
>>>>>>
>>>>>> Warren
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: ceph-users [
>>>>>> mailto:[email protected]
>>>>>> ] On Behalf Of Sebastien Han
>>>>>> Sent: Thursday, August 28, 2014 12:12 PM
>>>>>> To: ceph-users
>>>>>> Cc: Mark Nelson
>>>>>> Subject: [ceph-users] [Single OSD performance on SSD] Can't go
>>>>>> over 3, 2K IOPS
>>>>>>
>>>>>> Hey all,
>>>>>>
>>>>>> It has been a while since the last thread performance related on the ML
>>>>>> :p I've been running some experiment to see how much I can get from an
>>>>>> SSD on a Ceph cluster.
>>>>>> To achieve that I did something pretty simple:
>>>>>>
>>>>>> * Debian wheezy 7.6
>>>>>> * kernel from debian 3.14-0.bpo.2-amd64
>>>>>> * 1 cluster, 3 mons (i'd like to keep this realistic since in a
>>>>>> real deployment i'll use 3)
>>>>>> * 1 OSD backed by an SSD (journal and osd data on the same
>>>>>> device)
>>>>>> * 1 replica count of 1
>>>>>> * partitions are perfectly aligned
>>>>>> * io scheduler is set to noon but deadline was showing the same
>>>>>> results
>>>>>> * no updatedb running
>>>>>>
>>>>>> About the box:
>>>>>>
>>>>>> * 32GB of RAM
>>>>>> * 12 cores with HT @ 2,4 GHz
>>>>>> * WB cache is enabled on the controller
>>>>>> * 10Gbps network (doesn't help here)
>>>>>>
>>>>>> The SSD is a 200G Intel DC S3700 and is capable of delivering around 29K
>>>>>> iops with random 4k writes (my fio results) As a benchmark tool I used
>>>>>> fio with the rbd engine (thanks deutsche telekom guys!).
>>>>>>
>>>>>> O_DIECT and D_SYNC don't seem to be a problem for the SSD:
>>>>>>
>>>>>> # dd if=/dev/urandom of=rand.file bs=4k count=65536
>>>>>> 65536+0 records in
>>>>>> 65536+0 records out
>>>>>> 268435456 bytes (268 MB) copied, 29.5477 s, 9.1 MB/s
>>>>>>
>>>>>> # du -sh rand.file
>>>>>> 256M rand.file
>>>>>>
>>>>>> # dd if=rand.file of=/dev/sdo bs=4k count=65536
>>>>>> oflag=dsync,direct
>>>>>> 65536+0 records in
>>>>>> 65536+0 records out
>>>>>> 268435456 bytes (268 MB) copied, 2.73628 s, 98.1 MB/s
>>>>>>
>>>>>> See my ceph.conf:
>>>>>>
>>>>>> [global]
>>>>>> auth cluster required = cephx
>>>>>> auth service required = cephx
>>>>>> auth client required = cephx
>>>>>> fsid = 857b8609-8c9b-499e-9161-2ea67ba51c97
>>>>>> osd pool default pg num = 4096
>>>>>> osd pool default pgp num = 4096
>>>>>> osd pool default size = 2
>>>>>> osd crush chooseleaf type = 0
>>>>>>
>>>>>> debug lockdep = 0/0
>>>>>> debug context = 0/0
>>>>>> debug crush = 0/0
>>>>>> debug buffer = 0/0
>>>>>> debug timer = 0/0
>>>>>> debug journaler = 0/0
>>>>>> debug osd = 0/0
>>>>>> debug optracker = 0/0
>>>>>> debug objclass = 0/0
>>>>>> debug filestore = 0/0
>>>>>> debug journal = 0/0
>>>>>> debug ms = 0/0
>>>>>> debug monc = 0/0
>>>>>> debug tp = 0/0
>>>>>> debug auth = 0/0
>>>>>> debug finisher = 0/0
>>>>>> debug heartbeatmap = 0/0
>>>>>> debug perfcounter = 0/0
>>>>>> debug asok = 0/0
>>>>>> debug throttle = 0/0
>>>>>>
>>>>>> [mon]
>>>>>> mon osd down out interval = 600
>>>>>> mon osd min down reporters = 13
>>>>>> [mon.ceph-01]
>>>>>> host = ceph-01
>>>>>> mon addr = 172.20.20.171
>>>>>> [mon.ceph-02]
>>>>>> host = ceph-02
>>>>>> mon addr = 172.20.20.172
>>>>>> [mon.ceph-03]
>>>>>> host = ceph-03
>>>>>> mon addr = 172.20.20.173
>>>>>>
>>>>>> debug lockdep = 0/0
>>>>>> debug context = 0/0
>>>>>> debug crush = 0/0
>>>>>> debug buffer = 0/0
>>>>>> debug timer = 0/0
>>>>>> debug journaler = 0/0
>>>>>> debug osd = 0/0
>>>>>> debug optracker = 0/0
>>>>>> debug objclass = 0/0
>>>>>> debug filestore = 0/0
>>>>>> debug journal = 0/0
>>>>>> debug ms = 0/0
>>>>>> debug monc = 0/0
>>>>>> debug tp = 0/0
>>>>>> debug auth = 0/0
>>>>>> debug finisher = 0/0
>>>>>> debug heartbeatmap = 0/0
>>>>>> debug perfcounter = 0/0
>>>>>> debug asok = 0/0
>>>>>> debug throttle = 0/0
>>>>>>
>>>>>> [osd]
>>>>>> osd mkfs type = xfs
>>>>>> osd mkfs options xfs = -f -i size=2048 osd mount options xfs =
>>>>>> rw,noatime,logbsize=256k,delaylog osd journal size = 20480
>>>>>> cluster_network = 172.20.20.0/24 public_network = 172.20.20.0/24
>>>>>> osd mon heartbeat interval = 30 # Performance tuning filestore
>>>>>> merge threshold = 40 filestore split multiple = 8 osd op threads
>>>>>> = 8 # Recovery tuning osd recovery max active = 1 osd max
>>>>>> backfills = 1 osd recovery op priority = 1
>>>>>>
>>>>>>
>>>>>> debug lockdep = 0/0
>>>>>> debug context = 0/0
>>>>>> debug crush = 0/0
>>>>>> debug buffer = 0/0
>>>>>> debug timer = 0/0
>>>>>> debug journaler = 0/0
>>>>>> debug osd = 0/0
>>>>>> debug optracker = 0/0
>>>>>> debug objclass = 0/0
>>>>>> debug filestore = 0/0
>>>>>> debug journal = 0/0
>>>>>> debug ms = 0/0
>>>>>> debug monc = 0/0
>>>>>> debug tp = 0/0
>>>>>> debug auth = 0/0
>>>>>> debug finisher = 0/0
>>>>>> debug heartbeatmap = 0/0
>>>>>> debug perfcounter = 0/0
>>>>>> debug asok = 0/0
>>>>>> debug throttle = 0/0
>>>>>>
>>>>>> Disabling all debugging made me win 200/300 more IOPS.
>>>>>>
>>>>>> See my fio template:
>>>>>>
>>>>>> [global]
>>>>>> #logging
>>>>>> #write_iops_log=write_iops_log
>>>>>> #write_bw_log=write_bw_log
>>>>>> #write_lat_log=write_lat_lo
>>>>>>
>>>>>> time_based
>>>>>> runtime=60
>>>>>>
>>>>>> ioengine=rbd
>>>>>> clientname=admin
>>>>>> pool=test
>>>>>> rbdname=fio
>>>>>> invalidate=0 # mandatory
>>>>>> #rw=randwrite
>>>>>> rw=write
>>>>>> bs=4k
>>>>>> #bs=32m
>>>>>> size=5G
>>>>>> group_reporting
>>>>>>
>>>>>> [rbd_iodepth32]
>>>>>> iodepth=32
>>>>>> direct=1
>>>>>>
>>>>>> See my rio output:
>>>>>>
>>>>>> rbd_iodepth32: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K,
>>>>>> ioengine=rbd, iodepth=32 fio-2.1.11-14-gb74e Starting 1 process
>>>>>> rbd engine: RBD version: 0.1.8
>>>>>> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/12876KB/0KB /s]
>>>>>> [0/3219/0 iops] [eta 00m:00s]
>>>>>> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=32116: Thu Aug 28
>>>>>> 00:28:26 2014
>>>>>> write: io=771448KB, bw=12855KB/s, iops=3213, runt= 60010msec slat
>>>>>> (usec): min=42, max=1578, avg=66.50, stdev=16.96 clat (msec):
>>>>>> min=1, max=28, avg= 9.85, stdev= 1.48 lat (msec): min=1, max=28,
>>>>>> avg= 9.92, stdev= 1.47 clat percentiles (usec):
>>>>>> | 1.00th=[ 6368], 5.00th=[ 8256], 10.00th=[ 8640], 20.00th=[
>>>>>> | 9152], 30.00th=[ 9408], 40.00th=[ 9664], 50.00th=[ 9792],
>>>>>> | 60.00th=[10048], 70.00th=[10176], 80.00th=[10560],
>>>>>> | 90.00th=[10944], 95.00th=[11456], 99.00th=[13120],
>>>>>> | 99.50th=[16768], 99.90th=[25984], 99.95th=[27008],
>>>>>> | 99.99th=[28032]
>>>>>> bw (KB /s): min=11864, max=13808, per=100.00%, avg=12864.36,
>>>>>> stdev=407.35 lat (msec) : 2=0.03%, 4=0.54%, 10=59.79%, 20=39.24%,
>>>>>> 50=0.41% cpu : usr=19.15%, sys=4.69%, ctx=326309, majf=0,
>>>>>> minf=426088 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=33.9%,
>>>>>> 32=66.1%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%,
>>>>>> 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=99.6%, 8=0.4%,
>>>>>> 16=0.1%, 32=0.1%, 64=0.0%, >=64=0.0% issued :
>>>>>> total=r=0/w=192862/d=0, short=r=0/w=0/d=0 latency : target=0,
>>>>>> window=0, percentile=100.00%, depth=32
>>>>>>
>>>>>> Run status group 0 (all jobs):
>>>>>> WRITE: io=771448KB, aggrb=12855KB/s, minb=12855KB/s,
>>>>>> maxb=12855KB/s, mint=60010msec, maxt=60010msec
>>>>>>
>>>>>> Disk stats (read/write):
>>>>>> dm-1: ios=0/49, merge=0/0, ticks=0/12, in_queue=12, util=0.01%,
>>>>>> aggrios=0/22, aggrmerge=0/27, aggrticks=0/12, aggrin_queue=12,
>>>>>> aggrutil=0.01%
>>>>>> sda: ios=0/22, merge=0/27, ticks=0/12, in_queue=12, util=0.01%
>>>>>>
>>>>>> I tried to tweak several parameters like:
>>>>>>
>>>>>> filestore_wbthrottle_xfs_ios_start_flusher = 10000
>>>>>> filestore_wbthrottle_xfs_ios_hard_limit = 10000
>>>>>> filestore_wbthrottle_btrfs_ios_start_flusher = 10000
>>>>>> filestore_wbthrottle_btrfs_ios_hard_limit = 10000 filestore queue
>>>>>> max ops = 2000
>>>>>>
>>>>>> But didn't any improvement.
>>>>>>
>>>>>> Then I tried other things:
>>>>>>
>>>>>> * Increasing the io_depth up to 256 or 512 gave me between 50 to 100
>>>>>> more IOPS but it's not a realistic workload anymore and not that
>>>>>> significant.
>>>>>> * adding another SSD for the journal, still getting 3,2K IOPS
>>>>>> * I tried with rbd bench and I also got 3K IOPS
>>>>>> * I ran the test on a client machine and then locally on the
>>>>>> server, still getting 3,2K IOPS
>>>>>> * put the journal in memory, still getting 3,2K IOPS
>>>>>> * with 2 clients running the test in parallel I got a total of
>>>>>> 3,6K IOPS but I don't seem to be able to go over
>>>>>> * I tried is to add another OSD to that SSD, so I had 2 OSD and 2
>>>>>> journals on 1 SSD, got 4,5K IOPS YAY!
>>>>>>
>>>>>> Given the results of the last time it seems that something is limiting
>>>>>> the number of IOPS per OSD process.
>>>>>>
>>>>>> Running the test on a client or locally didn't show any difference.
>>>>>> So it looks to me that there is some contention within Ceph that might
>>>>>> cause this.
>>>>>>
>>>>>> I also ran perf and looked at the output, everything looks decent, but
>>>>>> someone might want to have a look at it :).
>>>>>>
>>>>>> We have been able to reproduce this on 3 distinct platforms with some
>>>>>> deviations (because of the hardware) but the behaviour is the same.
>>>>>> Any thoughts will be highly appreciated, only getting 3,2k out of an 29K
>>>>>> IOPS SSD is a bit frustrating :).
>>>>>>
>>>>>> Cheers.
>>>>>> ----
>>>>>> Sébastien Han
>>>>>> Cloud Architect
>>>>>>
>>>>>> "Always give 100%. Unless you're giving blood."
>>>>>>
>>>>>> Phone: +33 (0)1 49 70 99 72
>>>>>> Mail:
>>>>>> [email protected]
>>>>>>
>>>>>> Address : 11 bis, rue Roquépine - 75008 Paris Web :
>>>>>> www.enovance.com
>>>>>> - Twitter : @enovance
>>>>>>
>>>>>>
>>>>> Cheers.
>>>>> ----
>>>>> Sébastien Han
>>>>> Cloud Architect
>>>>>
>>>>> "Always give 100%. Unless you're giving blood."
>>>>>
>>>>> Phone: +33 (0)1 49 70 99 72
>>>>> Mail:
>>>>> [email protected]
>>>>>
>>>>> Address : 11 bis, rue Roquépine - 75008 Paris Web :
>>>>> www.enovance.com
>>>>> - Twitter : @enovance
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>>
>>>>> [email protected]http://lists.ceph.com/listinfo.cgi/ceph-u
>>>>> sers-ceph.com
>>>> --
>>>> Cédric
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> [email protected]
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> Cheers.
>>> ----
>>> Sébastien Han
>>> Cloud Architect
>>>
>>> "Always give 100%. Unless you're giving blood."
>>>
>>> Phone: +33 (0)1 49 70 99 72
>>> Mail: [email protected]
>>> Address : 11 bis, rue Roquépine - 75008 Paris Web : www.enovance.com
>>> - Twitter : @enovance
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> [email protected]
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> _______________________________________________
>>> ceph-users mailing list
>>> [email protected]
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> --
> Cédric
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Cheers.
----
Sébastien Han
Cloud Architect
"Always give 100%. Unless you're giving blood."
Phone: +33 (0)1 49 70 99 72
Mail: [email protected]
Address : 11 bis, rue Roquépine - 75008 Paris Web : www.enovance.com - Twitter
: @enovance
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com