Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

Alexandre DERUMIER Wed, 10 Sep 2014 23:21:07 -0700

Hi Sebastien,

here my first results with crucial m550 (I'll send result with intel s3500 
later):


- 3 nodes
- dell r620 without expander backplane
- sas controller : lsi LSI 9207 (no hardware raid or cache)
- 2 x E5-2603v2 1.8GHz (4cores)
- 32GB ram
- network : 2xgigabit link lacp + 2xgigabit lacp for cluster replication.

-os : debian wheezy, with kernel 3.10

os + ceph mon : 2x intel s3500 100gb  linux soft raid
osd : crucial m550 (1TB).


3mon in the ceph cluster,
and 1 osd (journal and datas on same disk)


ceph.conf 
---------
          debug_lockdep = 0/0
          debug_context = 0/0
          debug_crush = 0/0
          debug_buffer = 0/0
          debug_timer = 0/0
          debug_filer = 0/0
          debug_objecter = 0/0
          debug_rados = 0/0
          debug_rbd = 0/0
          debug_journaler = 0/0
          debug_objectcatcher = 0/0
          debug_client = 0/0
          debug_osd = 0/0
          debug_optracker = 0/0
          debug_objclass = 0/0
          debug_filestore = 0/0
          debug_journal = 0/0
          debug_ms = 0/0
          debug_monc = 0/0
          debug_tp = 0/0
          debug_auth = 0/0
          debug_finisher = 0/0
          debug_heartbeatmap = 0/0
          debug_perfcounter = 0/0
          debug_asok = 0/0
          debug_throttle = 0/0
          debug_mon = 0/0
          debug_paxos = 0/0
          debug_rgw = 0/0
          osd_op_threads = 5
          filestore_op_threads = 4

         ms_nocrc = true
         cephx sign messages = false
         cephx require signatures = false

         ms_dispatch_throttle_bytes = 0

         #0.85
         throttler_perf_counter = false
         filestore_fd_cache_size = 64
         filestore_fd_cache_shards = 32
         osd_op_num_threads_per_shard = 1
         osd_op_num_shards = 25
         osd_enable_op_tracker = true



Fio disk 4K benchmark
------------------
rand read 4k : fio --filename=/dev/sdb --direct=1 --rw=randread --bs=4k 
--iodepth=32 --group_reporting --invalidate=0 --name=abc --ioengine=aio
bw=271755KB/s, iops=67938 

rand write 4k : fio --filename=/dev/sdb --direct=1 --rw=randwrite --bs=4k 
--iodepth=32 --group_reporting --invalidate=0 --name=abc --ioengine=aio
bw=228293KB/s, iops=57073



fio osd benchmark (through librbd)
----------------------------------
[global]
ioengine=rbd
clientname=admin
pool=test
rbdname=test
invalidate=0    # mandatory
rw=randwrite
rw=randread
bs=4k
direct=1
numjobs=4
group_reporting=1

[rbd_iodepth32]
iodepth=32



FIREFLY RESULTS
----------------
fio randwrite : bw=5009.6KB/s, iops=1252

fio randread: bw=37820KB/s, iops=9455



O.85 RESULTS
------------

fio randwrite : bw=11658KB/s, iops=2914

fio randread : bw=38642KB/s, iops=9660



0.85 + osd_enable_op_tracker=false
-----------------------------------
fio randwrite : bw=11630KB/s, iops=2907
fio randread : bw=80606KB/s, iops=20151,   (cpu 100% - GREAT !)



So, for read, seem that osd_enable_op_tracker is the bottleneck.


Now for write, I really don't understand why it's so low.


I have done some iostat:


FIO directly on /dev/sdb
bw=228293KB/s, iops=57073

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sdb               0,00     0,00    0,00 63613,00     0,00 254452,00     8,00    
31,24    0,49    0,00    0,49   0,02 100,00


FIO directly on osd through librbd
bw=11658KB/s, iops=2914

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sdb               0,00   355,00    0,00 5225,00     0,00 29678,00    11,36    
57,63   11,03    0,00   11,03   0,19  99,70


(I don't understand what exactly is %util, 100% in the 2 cases, because 10x 
slower with ceph)

It could be a dsync problem, result seem pretty poor

# dd if=rand.file of=/dev/sdb bs=4k count=65536 oflag=direct
65536+0 enregistrements lus
65536+0 enregistrements écrits
268435456 octets (268 MB) copiés, 2,77433 s, 96,8 MB/s


# dd if=rand.file of=/dev/sdb bs=4k count=65536 oflag=dsync,direct
^C17228+0 enregistrements lus
17228+0 enregistrements écrits
70565888 octets (71 MB) copiés, 70,4098 s, 1,0 MB/s



I'll do tests with intel s3500 tomorrow to compare

----- Mail original ----- 

De: "Sebastien Han" <[email protected]> 
À: "Warren Wang" <[email protected]> 
Cc: [email protected] 
Envoyé: Lundi 8 Septembre 2014 22:58:25 
Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K 
IOPS 

They definitely are Warren! 

Thanks for bringing this here :). 

On 05 Sep 2014, at 23:02, Wang, Warren <[email protected]> wrote: 

> +1 to what Cedric said. 
> 
> Anything more than a few minutes of heavy sustained writes tended to get our 
> solid state devices into a state where garbage collection could not keep up. 
> Originally we used small SSDs and did not overprovision the journals by much. 
> Manufacturers publish their SSD stats, and then in very small font, state 
> that the attained IOPS are with empty drives, and the tests are only run for 
> very short amounts of time. Even if the drives are new, it's a good idea to 
> perform an hdparm secure erase on them (so that the SSD knows that the blocks 
> are truly unused), and then overprovision them. You'll know if you have a 
> problem by watching for utilization and wait data on the journals. 
> 
> One of the other interesting performance issues is that the Intel 10Gbe NICs 
> + default kernel that we typically use max out around 1million packets/sec. 
> It's worth tracking this metric to if you are close. 
> 
> I know these aren't necessarily relevant to the test parameters you gave 
> below, but they're worth keeping in mind. 
> 
> -- 
> Warren Wang 
> Comcast Cloud (OpenStack) 
> 
> 
> From: Cedric Lemarchand <[email protected]> 
> Date: Wednesday, September 3, 2014 at 5:14 PM 
> To: "[email protected]" <[email protected]> 
> Subject: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K 
> IOPS 
> 
> 
> Le 03/09/2014 22:11, Sebastien Han a écrit : 
>> Hi Warren, 
>> 
>> What do mean exactly by secure erase? At the firmware level with constructor 
>> softwares? 
>> SSDs were pretty new so I don’t we hit that sort of things. I believe that 
>> only aged SSDs have this behaviour but I might be wrong. 
>> 
> Sorry I forgot to reply to the real question ;-) 
> So yes it only plays after some times, for your case, if the SSD still 
> delivers write IOPS specified by the manufacturer, it will doesn't help in 
> any ways. 
> 
> But it seems this practice is nowadays increasingly used. 
> 
> Cheers 
>> On 02 Sep 2014, at 18:23, Wang, Warren <[email protected]> 
>> wrote: 
>> 
>> 
>>> Hi Sebastien, 
>>> 
>>> Something I didn't see in the thread so far, did you secure erase the SSDs 
>>> before they got used? I assume these were probably repurposed for this 
>>> test. We have seen some pretty significant garbage collection issue on 
>>> various SSD and other forms of solid state storage to the point where we 
>>> are overprovisioning pretty much every solid state device now. By as much 
>>> as 50% to handle sustained write operations. Especially important for the 
>>> journals, as we've found. 
>>> 
>>> Maybe not an issue on the short fio run below, but certainly evident on 
>>> longer runs or lots of historical data on the drives. The max transaction 
>>> time looks pretty good for your test. Something to consider though. 
>>> 
>>> Warren 
>>> 
>>> -----Original Message----- 
>>> From: ceph-users [ 
>>> mailto:[email protected] 
>>> ] On Behalf Of Sebastien Han 
>>> Sent: Thursday, August 28, 2014 12:12 PM 
>>> To: ceph-users 
>>> Cc: Mark Nelson 
>>> Subject: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K 
>>> IOPS 
>>> 
>>> Hey all, 
>>> 
>>> It has been a while since the last thread performance related on the ML :p 
>>> I've been running some experiment to see how much I can get from an SSD on 
>>> a Ceph cluster. 
>>> To achieve that I did something pretty simple: 
>>> 
>>> * Debian wheezy 7.6 
>>> * kernel from debian 3.14-0.bpo.2-amd64 
>>> * 1 cluster, 3 mons (i'd like to keep this realistic since in a real 
>>> deployment i'll use 3) 
>>> * 1 OSD backed by an SSD (journal and osd data on the same device) 
>>> * 1 replica count of 1 
>>> * partitions are perfectly aligned 
>>> * io scheduler is set to noon but deadline was showing the same results 
>>> * no updatedb running 
>>> 
>>> About the box: 
>>> 
>>> * 32GB of RAM 
>>> * 12 cores with HT @ 2,4 GHz 
>>> * WB cache is enabled on the controller 
>>> * 10Gbps network (doesn't help here) 
>>> 
>>> The SSD is a 200G Intel DC S3700 and is capable of delivering around 29K 
>>> iops with random 4k writes (my fio results) As a benchmark tool I used fio 
>>> with the rbd engine (thanks deutsche telekom guys!). 
>>> 
>>> O_DIECT and D_SYNC don't seem to be a problem for the SSD: 
>>> 
>>> # dd if=/dev/urandom of=rand.file bs=4k count=65536 
>>> 65536+0 records in 
>>> 65536+0 records out 
>>> 268435456 bytes (268 MB) copied, 29.5477 s, 9.1 MB/s 
>>> 
>>> # du -sh rand.file 
>>> 256M rand.file 
>>> 
>>> # dd if=rand.file of=/dev/sdo bs=4k count=65536 oflag=dsync,direct 
>>> 65536+0 records in 
>>> 65536+0 records out 
>>> 268435456 bytes (268 MB) copied, 2.73628 s, 98.1 MB/s 
>>> 
>>> See my ceph.conf: 
>>> 
>>> [global] 
>>> auth cluster required = cephx 
>>> auth service required = cephx 
>>> auth client required = cephx 
>>> fsid = 857b8609-8c9b-499e-9161-2ea67ba51c97 
>>> osd pool default pg num = 4096 
>>> osd pool default pgp num = 4096 
>>> osd pool default size = 2 
>>> osd crush chooseleaf type = 0 
>>> 
>>> debug lockdep = 0/0 
>>> debug context = 0/0 
>>> debug crush = 0/0 
>>> debug buffer = 0/0 
>>> debug timer = 0/0 
>>> debug journaler = 0/0 
>>> debug osd = 0/0 
>>> debug optracker = 0/0 
>>> debug objclass = 0/0 
>>> debug filestore = 0/0 
>>> debug journal = 0/0 
>>> debug ms = 0/0 
>>> debug monc = 0/0 
>>> debug tp = 0/0 
>>> debug auth = 0/0 
>>> debug finisher = 0/0 
>>> debug heartbeatmap = 0/0 
>>> debug perfcounter = 0/0 
>>> debug asok = 0/0 
>>> debug throttle = 0/0 
>>> 
>>> [mon] 
>>> mon osd down out interval = 600 
>>> mon osd min down reporters = 13 
>>> [mon.ceph-01] 
>>> host = ceph-01 
>>> mon addr = 172.20.20.171 
>>> [mon.ceph-02] 
>>> host = ceph-02 
>>> mon addr = 172.20.20.172 
>>> [mon.ceph-03] 
>>> host = ceph-03 
>>> mon addr = 172.20.20.173 
>>> 
>>> debug lockdep = 0/0 
>>> debug context = 0/0 
>>> debug crush = 0/0 
>>> debug buffer = 0/0 
>>> debug timer = 0/0 
>>> debug journaler = 0/0 
>>> debug osd = 0/0 
>>> debug optracker = 0/0 
>>> debug objclass = 0/0 
>>> debug filestore = 0/0 
>>> debug journal = 0/0 
>>> debug ms = 0/0 
>>> debug monc = 0/0 
>>> debug tp = 0/0 
>>> debug auth = 0/0 
>>> debug finisher = 0/0 
>>> debug heartbeatmap = 0/0 
>>> debug perfcounter = 0/0 
>>> debug asok = 0/0 
>>> debug throttle = 0/0 
>>> 
>>> [osd] 
>>> osd mkfs type = xfs 
>>> osd mkfs options xfs = -f -i size=2048 
>>> osd mount options xfs = rw,noatime,logbsize=256k,delaylog 
>>> osd journal size = 20480 
>>> cluster_network = 172.20.20.0/24 
>>> public_network = 172.20.20.0/24 
>>> osd mon heartbeat interval = 30 
>>> # Performance tuning 
>>> filestore merge threshold = 40 
>>> filestore split multiple = 8 
>>> osd op threads = 8 
>>> # Recovery tuning 
>>> osd recovery max active = 1 
>>> osd max backfills = 1 
>>> osd recovery op priority = 1 
>>> 
>>> 
>>> debug lockdep = 0/0 
>>> debug context = 0/0 
>>> debug crush = 0/0 
>>> debug buffer = 0/0 
>>> debug timer = 0/0 
>>> debug journaler = 0/0 
>>> debug osd = 0/0 
>>> debug optracker = 0/0 
>>> debug objclass = 0/0 
>>> debug filestore = 0/0 
>>> debug journal = 0/0 
>>> debug ms = 0/0 
>>> debug monc = 0/0 
>>> debug tp = 0/0 
>>> debug auth = 0/0 
>>> debug finisher = 0/0 
>>> debug heartbeatmap = 0/0 
>>> debug perfcounter = 0/0 
>>> debug asok = 0/0 
>>> debug throttle = 0/0 
>>> 
>>> Disabling all debugging made me win 200/300 more IOPS. 
>>> 
>>> See my fio template: 
>>> 
>>> [global] 
>>> #logging 
>>> #write_iops_log=write_iops_log 
>>> #write_bw_log=write_bw_log 
>>> #write_lat_log=write_lat_lo 
>>> 
>>> time_based 
>>> runtime=60 
>>> 
>>> ioengine=rbd 
>>> clientname=admin 
>>> pool=test 
>>> rbdname=fio 
>>> invalidate=0 # mandatory 
>>> #rw=randwrite 
>>> rw=write 
>>> bs=4k 
>>> #bs=32m 
>>> size=5G 
>>> group_reporting 
>>> 
>>> [rbd_iodepth32] 
>>> iodepth=32 
>>> direct=1 
>>> 
>>> See my rio output: 
>>> 
>>> rbd_iodepth32: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, 
>>> iodepth=32 fio-2.1.11-14-gb74e Starting 1 process rbd engine: RBD version: 
>>> 0.1.8 
>>> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/12876KB/0KB /s] [0/3219/0 iops] 
>>> [eta 00m:00s] 
>>> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=32116: Thu Aug 28 00:28:26 
>>> 2014 
>>> write: io=771448KB, bw=12855KB/s, iops=3213, runt= 60010msec 
>>> slat (usec): min=42, max=1578, avg=66.50, stdev=16.96 
>>> clat (msec): min=1, max=28, avg= 9.85, stdev= 1.48 
>>> lat (msec): min=1, max=28, avg= 9.92, stdev= 1.47 
>>> clat percentiles (usec): 
>>> | 1.00th=[ 6368], 5.00th=[ 8256], 10.00th=[ 8640], 20.00th=[ 9152], 
>>> | 30.00th=[ 9408], 40.00th=[ 9664], 50.00th=[ 9792], 60.00th=[10048], 
>>> | 70.00th=[10176], 80.00th=[10560], 90.00th=[10944], 95.00th=[11456], 
>>> | 99.00th=[13120], 99.50th=[16768], 99.90th=[25984], 99.95th=[27008], 
>>> | 99.99th=[28032] 
>>> bw (KB /s): min=11864, max=13808, per=100.00%, avg=12864.36, stdev=407.35 
>>> lat (msec) : 2=0.03%, 4=0.54%, 10=59.79%, 20=39.24%, 50=0.41% 
>>> cpu : usr=19.15%, sys=4.69%, ctx=326309, majf=0, minf=426088 
>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=33.9%, 32=66.1%, >=64=0.0% 
>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
>>> complete : 0=0.0%, 4=99.6%, 8=0.4%, 16=0.1%, 32=0.1%, 64=0.0%, >=64=0.0% 
>>> issued : total=r=0/w=192862/d=0, short=r=0/w=0/d=0 
>>> latency : target=0, window=0, percentile=100.00%, depth=32 
>>> 
>>> Run status group 0 (all jobs): 
>>> WRITE: io=771448KB, aggrb=12855KB/s, minb=12855KB/s, maxb=12855KB/s, 
>>> mint=60010msec, maxt=60010msec 
>>> 
>>> Disk stats (read/write): 
>>> dm-1: ios=0/49, merge=0/0, ticks=0/12, in_queue=12, util=0.01%, 
>>> aggrios=0/22, aggrmerge=0/27, aggrticks=0/12, aggrin_queue=12, 
>>> aggrutil=0.01% 
>>> sda: ios=0/22, merge=0/27, ticks=0/12, in_queue=12, util=0.01% 
>>> 
>>> I tried to tweak several parameters like: 
>>> 
>>> filestore_wbthrottle_xfs_ios_start_flusher = 10000 
>>> filestore_wbthrottle_xfs_ios_hard_limit = 10000 
>>> filestore_wbthrottle_btrfs_ios_start_flusher = 10000 
>>> filestore_wbthrottle_btrfs_ios_hard_limit = 10000 filestore queue max ops = 
>>> 2000 
>>> 
>>> But didn't any improvement. 
>>> 
>>> Then I tried other things: 
>>> 
>>> * Increasing the io_depth up to 256 or 512 gave me between 50 to 100 more 
>>> IOPS but it's not a realistic workload anymore and not that significant. 
>>> * adding another SSD for the journal, still getting 3,2K IOPS 
>>> * I tried with rbd bench and I also got 3K IOPS 
>>> * I ran the test on a client machine and then locally on the server, still 
>>> getting 3,2K IOPS 
>>> * put the journal in memory, still getting 3,2K IOPS 
>>> * with 2 clients running the test in parallel I got a total of 3,6K IOPS 
>>> but I don't seem to be able to go over 
>>> * I tried is to add another OSD to that SSD, so I had 2 OSD and 2 journals 
>>> on 1 SSD, got 4,5K IOPS YAY! 
>>> 
>>> Given the results of the last time it seems that something is limiting the 
>>> number of IOPS per OSD process. 
>>> 
>>> Running the test on a client or locally didn't show any difference. 
>>> So it looks to me that there is some contention within Ceph that might 
>>> cause this. 
>>> 
>>> I also ran perf and looked at the output, everything looks decent, but 
>>> someone might want to have a look at it :). 
>>> 
>>> We have been able to reproduce this on 3 distinct platforms with some 
>>> deviations (because of the hardware) but the behaviour is the same. 
>>> Any thoughts will be highly appreciated, only getting 3,2k out of an 29K 
>>> IOPS SSD is a bit frustrating :). 
>>> 
>>> Cheers. 
>>> ---- 
>>> Sébastien Han 
>>> Cloud Architect 
>>> 
>>> "Always give 100%. Unless you're giving blood." 
>>> 
>>> Phone: +33 (0)1 49 70 99 72 
>>> Mail: 
>>> [email protected] 
>>> 
>>> Address : 11 bis, rue Roquépine - 75008 Paris Web : 
>>> www.enovance.com 
>>> - Twitter : @enovance 
>>> 
>>> 
>> Cheers. 
>> –––– 
>> Sébastien Han 
>> Cloud Architect 
>> 
>> "Always give 100%. Unless you're giving blood." 
>> 
>> Phone: +33 (0)1 49 70 99 72 
>> Mail: 
>> [email protected] 
>> 
>> Address : 11 bis, rue Roquépine - 75008 Paris 
>> Web : 
>> www.enovance.com 
>> - Twitter : @enovance 
>> 
>> 
>> 
>> 
>> _______________________________________________ 
>> ceph-users mailing list 
>> 
>> [email protected]http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>  
> 
> -- 
> Cédric 
> 
> _______________________________________________ 
> ceph-users mailing list 
> [email protected] 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 


Cheers. 
–––– 
Sébastien Han 
Cloud Architect 

"Always give 100%. Unless you're giving blood." 

Phone: +33 (0)1 49 70 99 72 
Mail: [email protected] 
Address : 11 bis, rue Roquépine - 75008 Paris 
Web : www.enovance.com - Twitter : @enovance 


_______________________________________________ 
ceph-users mailing list 
[email protected] 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

Reply via email to