RE: Impact of page cache on OSD read performance for SSD

Somnath Roy Mon, 29 Sep 2014 17:24:13 -0700

Hi,
I did some digging on the blktrace output to understand why this read_ahead_kb 
setting is impacting performance in my setup (which is single OSD cluster). 
Here is the result.


The 99% of the ios are performed by the following processes during the 
blocktrace collection window.

1.      For the ceph-osd process (including unknown process which I figured out 
different threads of OSd only):


Events                  Read_ahead_kb = 128     Read_ahead_kb = 0       
Direct_io
Reads Queued                    4140687     4168816                     4042634
Read Dispatches         7734617     5660597                     4839428
Reads Requeued          4574032     1789149                     944688
Reads Completed         2532893     2996269                     3027387
Reads Merges                    6415                                  2         
           0
IO unplugs                              3380175      100911             4042714

2.      Swapper process

Events                  Read_ahead_kb = 128     Read_ahead_kb = 0       
Direct_io
Reads Queued                    0                                    0          
            0
Read Dispatches         1836K                 459028            258743
Reads Requeued          1129K                 254808            132605
Reads Completed         1175K                 937138            891107
Reads Merges                      0                     0               0
IO unplugs                        0                     0               0


Now, if we compare the total amount of reads happened during this time for the 
3 different type of settings..

Events                  Read_ahead_kb = 128             Read_ahead_kb = 0       
Direct_io
Reads Queued                    4140K                           4168K           
4042K
Read Dispatches         10390K                          6363K           5151K
Reads Requeued          6256K                           2194K           1108K
Reads Completed         4134K                           4168K           4042K
Reads Merges                    6415                            2               0
IO unplugs                      3380183                 100924          4042721


Here is my analysis on this.

1.      There are lot more (~4M more than read_ahead_kb =0 ) read dispatch in 
case we set read_ahead_kb = 128
2.      Swapper process (which I think doing the read ahead(?)) is issuing lot 
more reads if read_ahead_kb = 128
3.      Read merges are almost 0 all the cases other than 1st one which says 
the workload is very random (?). The more merges in case of 1st one is probably 
because of read_ahead (?)

Some open question.

      1.  Why reads completed are less ? Is it ceph read complete + swapper 
read complete ?    but, still not matching dispatches ?
      2.  Io unplug is huge in case of read_ahead_kb = 128 and direct io 
compared to read_ahead_kb = 0 , why ? 
      3. Why so many requeued ? 
      4. Requeued + queued = dispatched  ?

Tried to set different kernel parameter like 
nr_requests/scheduler/rq_affinity/vm_cache_pressure etc. , but, still in my 
workload I am constantly getting ~50% improvement by setting read_ahead_kb =0.

I don't have much expertise in the linux block layer , so, reaching out to 
community for the answers/suggestions.

Thanks & Regards
Somnath

-----Original Message-----
From: Somnath Roy 
Sent: Thursday, September 25, 2014 12:11 AM
To: 'Chen, Xiaoxi'; Haomai Wang
Cc: Sage Weil; Milosz Tanski; [email protected]
Subject: RE: Impact of page cache on OSD read performance for SSD

Well, you never know !
It depends upon lot of factors starting from your workload/different kernel 
params/RAID controller etc. etc. I have shared my observation in my environment 
with 4K pseudo random fio_rbd workload. True random, should not kick off 
read_ahead though.
OP_QUEUE optimization is bringing more parallelism in the filestore read , so, 
more read going to disk in parallel may have exposed this.
Anyways, I am in process of analyzing why default read_ahead is causing problem 
for me, will update if I find any..

Thanks & Regards
Somnath

-----Original Message-----
From: Chen, Xiaoxi [mailto:[email protected]]
Sent: Wednesday, September 24, 2014 10:00 PM
To: Somnath Roy; Haomai Wang
Cc: Sage Weil; Milosz Tanski; [email protected]
Subject: RE: Impact of page cache on OSD read performance for SSD

Have you ever seen large readahead_kb would hear random performance?

We usually set it to very large (2M) , the random read performance keep steady, 
even in all SSD setup. Maybe with your optimization code for OP_QUEUE, the 
things may different?

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Somnath Roy
Sent: Thursday, September 25, 2014 11:15 AM
To: Haomai Wang
Cc: Sage Weil; Milosz Tanski; [email protected]
Subject: RE: Impact of page cache on OSD read performance for SSD

It will be definitely hampered.
There will not be a single solution fits all. These parameters needs to be 
tuned based on the workload.

Thanks & Regards
Somnath

-----Original Message-----
From: Haomai Wang [mailto:[email protected]]
Sent: Wednesday, September 24, 2014 7:56 PM
To: Somnath Roy
Cc: Sage Weil; Milosz Tanski; [email protected]
Subject: Re: Impact of page cache on OSD read performance for SSD

On Thu, Sep 25, 2014 at 7:49 AM, Somnath Roy <[email protected]> wrote:
> Hi,
> After going through the blktrace, I think I have figured out what is 
> going on there. I think kernel read_ahead is causing the extra reads 
> in case of buffered read. If I set read_ahead = 0 , the performance I 
> am getting similar (or more when cache hit actually happens) to 
> direct_io :-)

Hmm, BTW if set read_ahead=0, what about seq read performance compared to 
before?

> IMHO, if any user doesn't want these nasty kernel effects and be sure of the 
> random work pattern, we should provide a configurable direct_io read option 
> (Need to quantify direct_io write also) as Sage suggested.
>
> Thanks & Regards
> Somnath
>
>
> -----Original Message-----
> From: Haomai Wang [mailto:[email protected]]
> Sent: Wednesday, September 24, 2014 9:06 AM
> To: Sage Weil
> Cc: Somnath Roy; Milosz Tanski; [email protected]
> Subject: Re: Impact of page cache on OSD read performance for SSD
>
> On Wed, Sep 24, 2014 at 8:38 PM, Sage Weil <[email protected]> wrote:
>> On Wed, 24 Sep 2014, Haomai Wang wrote:
>>> I agree with that direct read will help for disk read. But if read 
>>> data is hot and small enough to fit in memory, page cache is a good 
>>> place to hold data cache. If discard page cache, we need to 
>>> implement a cache to provide with effective lookup impl.
>>
>> This is true for some workloads, but not necessarily true for all.
>> Many clients (notably RBD) will be caching at the client side (in 
>> VM's fs, and possibly in librbd itself) such that caching at the OSD 
>> is largely wasted effort.  For RGW the often is likely true, unless 
>> there is a varnish cache or something in front.
>
> Still now, I don't think librbd cache can meet all the cache demand for rbd 
> usage. Even though we have a effective librbd cache impl, we still need a 
> buffer cache in ObjectStore level just like what database did. Client cache 
> and host cache are both needed.
>
>>
>> We should probably have a direct_io config option for filestore.  But 
>> even better would be some hint from the client about whether it is 
>> caching or not so that FileStore could conditionally cache...
>
> Yes, I remember we already did some early works like it.
>
>>
>> sage
>>
>>  >
>>> BTW, whether to use direct io we can refer to MySQL Innodb engine 
>>> with direct io and PostgreSQL with page cache.
>>>
>>> On Wed, Sep 24, 2014 at 10:29 AM, Somnath Roy <[email protected]> 
>>> wrote:
>>> > Haomai,
>>> > I am considering only about random reads and the changes I made only 
>>> > affecting reads. For write, I have not measured yet. But, yes, page cache 
>>> > may be helpful for write coalescing. Still need to evaluate how it is 
>>> > behaving comparing direct_io on SSD though. I think Ceph code path will 
>>> > be much shorter if we use direct_io in the write path where it is 
>>> > actually executing the transactions. Probably, the sync thread and all 
>>> > will not be needed.
>>> >
>>> > I am trying to analyze where is the extra reads coming from in case of 
>>> > buffered io by using blktrace etc. This should give us a clear 
>>> > understanding what exactly is going on there and it may turn out that 
>>> > tuning kernel parameters only  we can achieve similar performance as 
>>> > direct_io.
>>> >
>>> > Thanks & Regards
>>> > Somnath
>>> >
>>> > -----Original Message-----
>>> > From: Haomai Wang [mailto:[email protected]]
>>> > Sent: Tuesday, September 23, 2014 7:07 PM
>>> > To: Sage Weil
>>> > Cc: Somnath Roy; Milosz Tanski; [email protected]
>>> > Subject: Re: Impact of page cache on OSD read performance for SSD
>>> >
>>> > Good point, but do you have considered that the impaction for write ops? 
>>> > And if skip page cache, FileStore is responsible for data cache?
>>> >
>>> > On Wed, Sep 24, 2014 at 3:29 AM, Sage Weil <[email protected]> wrote:
>>> >> On Tue, 23 Sep 2014, Somnath Roy wrote:
>>> >>> Milosz,
>>> >>> Thanks for the response. I will see if I can get any information out of 
>>> >>> perf.
>>> >>>
>>> >>> Here is my OS information.
>>> >>>
>>> >>> root@emsclient:~# lsb_release -a No LSB modules are available.
>>> >>> Distributor ID: Ubuntu
>>> >>> Description:    Ubuntu 13.10
>>> >>> Release:        13.10
>>> >>> Codename:       saucy
>>> >>> root@emsclient:~# uname -a
>>> >>> Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9
>>> >>> 16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
>>> >>>
>>> >>> BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I 
>>> >>> was able to get almost *2X* performance improvement with direct_io.
>>> >>> It's not only page cache (memory) lookup, in case of buffered_io  the 
>>> >>> following could be problem.
>>> >>>
>>> >>> 1. Double copy (disk -> file buffer cache, file buffer cache -> 
>>> >>> user
>>> >>> buffer)
>>> >>>
>>> >>> 2. As the iostat output shows, it is not reading 4K only, it is 
>>> >>> reading more data from disk as required and in the end it will 
>>> >>> be wasted in case of random workload..
>>> >>
>>> >> It might be worth using blktrace to see what the IOs it is issueing are.
>>> >> Which ones are > 4K and what they point to...
>>> >>
>>> >> sage
>>> >>
>>> >>
>>> >>>
>>> >>> Thanks & Regards
>>> >>> Somnath
>>> >>>
>>> >>> -----Original Message-----
>>> >>> From: Milosz Tanski [mailto:[email protected]]
>>> >>> Sent: Tuesday, September 23, 2014 12:09 PM
>>> >>> To: Somnath Roy
>>> >>> Cc: [email protected]
>>> >>> Subject: Re: Impact of page cache on OSD read performance for 
>>> >>> SSD
>>> >>>
>>> >>> Somnath,
>>> >>>
>>> >>> I wonder if there's a bottleneck or a point of contention for the 
>>> >>> kernel. For a entirely uncached workload I expect the page cache lookup 
>>> >>> to cause a slow down (since the lookup should be wasted). What I 
>>> >>> wouldn't expect is a 45% performance drop. Memory speed should be one 
>>> >>> magnitude faster then a modern SATA SSD drive (so it should be more 
>>> >>> negligible overhead).
>>> >>>
>>> >>> Is there anyway you could perform the same test but monitor what's 
>>> >>> going on with the OSD process using the perf tool? Whatever is the 
>>> >>> default cpu time spent hardware counter is fine. Make sure you have the 
>>> >>> kernel debug info package installed so can get symbol information for 
>>> >>> kernel and module calls. With any luck the diff in perf output in two 
>>> >>> runs will show us the culprit.
>>> >>>
>>> >>> Also, can you tell us what OS/kernel version you're using on the OSD 
>>> >>> machines?
>>> >>>
>>> >>> - Milosz
>>> >>>
>>> >>> On Tue, Sep 23, 2014 at 2:05 PM, Somnath Roy <[email protected]> 
>>> >>> wrote:
>>> >>> > Hi Sage,
>>> >>> > I have created the following setup in order to examine how a single 
>>> >>> > OSD is behaving if say ~80-90% of ios hitting the SSDs.
>>> >>> >
>>> >>> > My test includes the following steps.
>>> >>> >
>>> >>> >         1. Created a single OSD cluster.
>>> >>> >         2. Created two rbd images (110GB each) on 2 different pools.
>>> >>> >         3. Populated entire image, so my working set is ~210GB. My 
>>> >>> > system memory is ~16GB.
>>> >>> >         4. Dumped page cache before every run.
>>> >>> >         5. Ran fio_rbd (QD 32, 8 instances) in parallel on these two 
>>> >>> > images.
>>> >>> >
>>> >>> > Here is my disk iops/bandwidth..
>>> >>> >
>>> >>> >         root@emsclient:~/fio_test# fio rad_resd_disk.job
>>> >>> >         random-reads: (g=0): rw=randread, bs=4K-4K/4K-4K, 
>>> >>> > ioengine=libaio, iodepth=64
>>> >>> >         2.0.8
>>> >>> >         Starting 1 process
>>> >>> >         Jobs: 1 (f=1): [r] [100.0% done] [154.1M/0K /s] [39.7K/0  
>>> >>> > iops] [eta 00m:00s]
>>> >>> >         random-reads: (groupid=0, jobs=1): err= 0: pid=1431
>>> >>> >         read : io=9316.4MB, bw=158994KB/s, iops=39748 , runt= 
>>> >>> > 60002msec
>>> >>> >
>>> >>> > My fio_rbd config..
>>> >>> >
>>> >>> > [global]
>>> >>> > ioengine=rbd
>>> >>> > clientname=admin
>>> >>> > pool=rbd1
>>> >>> > rbdname=ceph_regression_test1
>>> >>> > invalidate=0    # mandatory
>>> >>> > rw=randread
>>> >>> > bs=4k
>>> >>> > direct=1
>>> >>> > time_based
>>> >>> > runtime=2m
>>> >>> > size=109G
>>> >>> > numjobs=8
>>> >>> > [rbd_iodepth32]
>>> >>> > iodepth=32
>>> >>> >
>>> >>> > Now, I have run Giant Ceph on top of that..
>>> >>> >
>>> >>> > 1. OSD config with 25 shards/1 thread per shard :
>>> >>> > -------------------------------------------------------
>>> >>> >
>>> >>> >          avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>> >>> >           22.04    0.00   16.46   45.86    0.00   15.64
>>> >>> >
>>> >>> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s 
>>> >>> > avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>> >>> > sda               0.00     9.00    0.00    6.00     0.00    92.00    
>>> >>> > 30.67     0.01    1.33    0.00    1.33   1.33   0.80
>>> >>> > sdd               0.00     0.00    0.00    0.00     0.00     0.00     
>>> >>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>> >>> > sde               0.00     0.00    0.00    0.00     0.00     0.00     
>>> >>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>> >>> > sdg               0.00     0.00    0.00    0.00     0.00     0.00     
>>> >>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>> >>> > sdf               0.00     0.00    0.00    0.00     0.00     0.00     
>>> >>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>> >>> > sdh             181.00     0.00 34961.00    0.00 176740.00     0.00   
>>> >>> >  10.11   102.71    2.92    2.92    0.00   0.03 100.00
>>> >>> > sdc               0.00     0.00    0.00    0.00     0.00     0.00     
>>> >>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>> >>> > sdb               0.00     0.00    0.00    0.00     0.00     0.00     
>>> >>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>> >>> >
>>> >>> >
>>> >>> > ceph -s:
>>> >>> >  ----------
>>> >>> > root@emsclient:~# ceph -s
>>> >>> >     cluster 94991097-7638-4240-b922-f525300a9026
>>> >>> >      health HEALTH_OK
>>> >>> >      monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, 
>>> >>> > quorum 0 a
>>> >>> >      osdmap e498: 1 osds: 1 up, 1 in
>>> >>> >       pgmap v386366: 832 pgs, 7 pools, 308 GB data, 247 kobjects
>>> >>> >             366 GB used, 1122 GB / 1489 GB avail
>>> >>> >                  832 active+clean
>>> >>> >   client io 75215 kB/s rd, 18803 op/s
>>> >>> >
>>> >>> >  cpu util:
>>> >>> > ----------
>>> >>> >  Gradually decreases from ~21 core (serving from cache) to ~10 core 
>>> >>> > (while serving from disks).
>>> >>> >
>>> >>> >  My Analysis:
>>> >>> > -----------------
>>> >>> >  In this case "All is Well"  till ios are served from cache 
>>> >>> > (XFS is smart enough to cache some data ) . Once started hitting 
>>> >>> > disks and throughput is decreasing. As you can see, disk is giving 
>>> >>> > ~35K iops , but, OSD throughput is only ~18.8K ! So, cache miss in 
>>> >>> > case of buffered io seems to be very  expensive.  Half of the iops 
>>> >>> > are waste. Also, looking at the bandwidth, it is obvious, not 
>>> >>> > everything is 4K read, May be kernel read_ahead is kicking (?).
>>> >>> >
>>> >>> >
>>> >>> > Now, I thought of making ceph disk read as direct_io and do the same 
>>> >>> > experiment. I have changed the FileStore::read to do the direct_io 
>>> >>> > only. Rest kept as is. Here is the result with that.
>>> >>> >
>>> >>> >
>>> >>> > Iostat:
>>> >>> > -------
>>> >>> >
>>> >>> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>> >>> >           24.77    0.00   19.52   21.36    0.00   34.36
>>> >>> >
>>> >>> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s 
>>> >>> > avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>> >>> > sda               0.00     0.00    0.00    0.00     0.00     0.00     
>>> >>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>> >>> > sdd               0.00     0.00    0.00    0.00     0.00     0.00     
>>> >>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>> >>> > sde               0.00     0.00    0.00    0.00     0.00     0.00     
>>> >>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>> >>> > sdg               0.00     0.00    0.00    0.00     0.00     0.00     
>>> >>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>> >>> > sdf               0.00     0.00    0.00    0.00     0.00     0.00     
>>> >>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>> >>> > sdh               0.00     0.00 25295.00    0.00 101180.00     0.00   
>>> >>> >   8.00    12.73    0.50    0.50    0.00   0.04 100.80
>>> >>> > sdc               0.00     0.00    0.00    0.00     0.00     0.00     
>>> >>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>> >>> > sdb               0.00     0.00    0.00    0.00     0.00     0.00     
>>> >>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>> >>> >
>>> >>> > ceph -s:
>>> >>> >  --------
>>> >>> > root@emsclient:~/fio_test# ceph -s
>>> >>> >     cluster 94991097-7638-4240-b922-f525300a9026
>>> >>> >      health HEALTH_OK
>>> >>> >      monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, 
>>> >>> > quorum 0 a
>>> >>> >      osdmap e522: 1 osds: 1 up, 1 in
>>> >>> >       pgmap v386711: 832 pgs, 7 pools, 308 GB data, 247 kobjects
>>> >>> >             366 GB used, 1122 GB / 1489 GB avail
>>> >>> >                  832 active+clean
>>> >>> >   client io 100 MB/s rd, 25618 op/s
>>> >>> >
>>> >>> > cpu util:
>>> >>> > --------
>>> >>> >   ~14 core while serving from disks.
>>> >>> >
>>> >>> >  My Analysis:
>>> >>> >  ---------------
>>> >>> > No surprises here. Whatever is disk throughput ceph throughput is 
>>> >>> > almost matching.
>>> >>> >
>>> >>> >
>>> >>> > Let's tweak the shard/thread settings and see the impact.
>>> >>> >
>>> >>> >
>>> >>> > 2. OSD config with 36 shards and 1 thread/shard:
>>> >>> > -----------------------------------------------------------
>>> >>> >
>>> >>> >    Buffered read:
>>> >>> >    ------------------
>>> >>> >   No change, output is very similar to 25 shards.
>>> >>> >
>>> >>> >
>>> >>> >   direct_io read:
>>> >>> >   ------------------
>>> >>> >        Iostat:
>>> >>> >       ----------
>>> >>> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>> >>> >           33.33    0.00   28.22   23.11    0.00   15.34
>>> >>> >
>>> >>> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s 
>>> >>> > avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>> >>> > sda               0.00     0.00    0.00    2.00     0.00    12.00    
>>> >>> > 12.00     0.00    0.00    0.00    0.00   0.00   0.00
>>> >>> > sdd               0.00     0.00    0.00    0.00     0.00     0.00     
>>> >>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>> >>> > sde               0.00     0.00    0.00    0.00     0.00     0.00     
>>> >>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>> >>> > sdg               0.00     0.00    0.00    0.00     0.00     0.00     
>>> >>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>> >>> > sdf               0.00     0.00    0.00    0.00     0.00     0.00     
>>> >>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>> >>> > sdh               0.00     0.00 31987.00    0.00 127948.00     0.00   
>>> >>> >   8.00    18.06    0.56    0.56    0.00   0.03 100.40
>>> >>> > sdc               0.00     0.00    0.00    0.00     0.00     0.00     
>>> >>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>> >>> > sdb               0.00     0.00    0.00    0.00     0.00     0.00     
>>> >>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>> >>> >
>>> >>> >        ceph -s:
>>> >>> >     --------------
>>> >>> > root@emsclient:~/fio_test# ceph -s
>>> >>> >     cluster 94991097-7638-4240-b922-f525300a9026
>>> >>> >      health HEALTH_OK
>>> >>> >      monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, 
>>> >>> > quorum 0 a
>>> >>> >      osdmap e525: 1 osds: 1 up, 1 in
>>> >>> >       pgmap v386746: 832 pgs, 7 pools, 308 GB data, 247 kobjects
>>> >>> >             366 GB used, 1122 GB / 1489 GB avail
>>> >>> >                  832 active+clean
>>> >>> >   client io 127 MB/s rd, 32763 op/s
>>> >>> >
>>> >>> >         cpu util:
>>> >>> >    --------------
>>> >>> >        ~19 core while serving from disks.
>>> >>> >
>>> >>> >          Analysis:
>>> >>> > ------------------
>>> >>> >         It is scaling with increased number of shards/threads. The 
>>> >>> > parallelism also increased significantly.
>>> >>> >
>>> >>> >
>>> >>> > 3. OSD config with 48 shards and 1 thread/shard:
>>> >>> >  ----------------------------------------------------------
>>> >>> >     Buffered read:
>>> >>> >    -------------------
>>> >>> >     No change, output is very similar to 25 shards.
>>> >>> >
>>> >>> >
>>> >>> >    direct_io read:
>>> >>> >     -----------------
>>> >>> >        Iostat:
>>> >>> >       --------
>>> >>> >
>>> >>> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>> >>> >           37.50    0.00   33.72   20.03    0.00    8.75
>>> >>> >
>>> >>> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s 
>>> >>> > avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>> >>> > sda               0.00     0.00    0.00    0.00     0.00     0.00     
>>> >>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>> >>> > sdd               0.00     0.00    0.00    0.00     0.00     0.00     
>>> >>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>> >>> > sde               0.00     0.00    0.00    0.00     0.00     0.00     
>>> >>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>> >>> > sdg               0.00     0.00    0.00    0.00     0.00     0.00     
>>> >>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>> >>> > sdf               0.00     0.00    0.00    0.00     0.00     0.00     
>>> >>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>> >>> > sdh               0.00     0.00 35360.00    0.00 141440.00     0.00   
>>> >>> >   8.00    22.25    0.62    0.62    0.00   0.03 100.40
>>> >>> > sdc               0.00     0.00    0.00    0.00     0.00     0.00     
>>> >>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>> >>> > sdb               0.00     0.00    0.00    0.00     0.00     0.00     
>>> >>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>> >>> >
>>> >>> >          ceph -s:
>>> >>> >        --------------
>>> >>> > root@emsclient:~/fio_test# ceph -s
>>> >>> >     cluster 94991097-7638-4240-b922-f525300a9026
>>> >>> >      health HEALTH_OK
>>> >>> >      monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, 
>>> >>> > quorum 0 a
>>> >>> >      osdmap e534: 1 osds: 1 up, 1 in
>>> >>> >       pgmap v386830: 832 pgs, 7 pools, 308 GB data, 247 kobjects
>>> >>> >             366 GB used, 1122 GB / 1489 GB avail
>>> >>> >                  832 active+clean
>>> >>> >   client io 138 MB/s rd, 35582 op/s
>>> >>> >
>>> >>> >          cpu util:
>>> >>> >  ----------------
>>> >>> >         ~22.5 core while serving from disks.
>>> >>> >
>>> >>> >           Analysis:
>>> >>> >  --------------------
>>> >>> >         It is scaling with increased number of shards/threads. The 
>>> >>> > parallelism also increased significantly.
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > 4. OSD config with 64 shards and 1 thread/shard:
>>> >>> >  ---------------------------------------------------------
>>> >>> >       Buffered read:
>>> >>> >      ------------------
>>> >>> >      No change, output is very similar to 25 shards.
>>> >>> >
>>> >>> >
>>> >>> >      direct_io read:
>>> >>> >      -------------------
>>> >>> >        Iostat:
>>> >>> >       ---------
>>> >>> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>> >>> >           40.18    0.00   34.84   19.81    0.00    5.18
>>> >>> >
>>> >>> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s 
>>> >>> > avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>> >>> > sda               0.00     0.00    0.00    0.00     0.00     0.00     
>>> >>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>> >>> > sdd               0.00     0.00    0.00    0.00     0.00     0.00     
>>> >>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>> >>> > sde               0.00     0.00    0.00    0.00     0.00     0.00     
>>> >>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>> >>> > sdg               0.00     0.00    0.00    0.00     0.00     0.00     
>>> >>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>> >>> > sdf               0.00     0.00    0.00    0.00     0.00     0.00     
>>> >>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>> >>> > sdh               0.00     0.00 39114.00    0.00 156460.00     0.00   
>>> >>> >   8.00    35.58    0.90    0.90    0.00   0.03 100.40
>>> >>> > sdc               0.00     0.00    0.00    0.00     0.00     0.00     
>>> >>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>> >>> > sdb               0.00     0.00    0.00    0.00     0.00     0.00     
>>> >>> > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>> >>> >
>>> >>> >        ceph -s:
>>> >>> >  ---------------
>>> >>> > root@emsclient:~/fio_test# ceph -s
>>> >>> >     cluster 94991097-7638-4240-b922-f525300a9026
>>> >>> >      health HEALTH_OK
>>> >>> >      monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, 
>>> >>> > quorum 0 a
>>> >>> >      osdmap e537: 1 osds: 1 up, 1 in
>>> >>> >       pgmap v386865: 832 pgs, 7 pools, 308 GB data, 247 kobjects
>>> >>> >             366 GB used, 1122 GB / 1489 GB avail
>>> >>> >                  832 active+clean
>>> >>> >   client io 153 MB/s rd, 39172 op/s
>>> >>> >
>>> >>> >       cpu util:
>>> >>> > ----------------
>>> >>> >     ~24.5 core while serving from disks. ~3% cpu left.
>>> >>> >
>>> >>> >        Analysis:
>>> >>> > ------------------
>>> >>> >       It is scaling with increased number of shards/threads. The 
>>> >>> > parallelism also increased significantly. It is disk bound now.
>>> >>> >
>>> >>> >
>>> >>> > Summary:
>>> >>> >
>>> >>> > So, it seems buffered IO has significant impact on performance in 
>>> >>> > case backend is SSD.
>>> >>> > My question is,  if the workload is very random and storage(SSD) is 
>>> >>> > very huge compare to system memory, shouldn't we always go for 
>>> >>> > direct_io instead of buffered io from Ceph ?
>>> >>> >
>>> >>> > Please share your thoughts/suggestion on this.
>>> >>> >
>>> >>> > Thanks & Regards
>>> >>> > Somnath
>>> >>> >
>>> >>> > ________________________________
>>> >>> >
>>> >>> > PLEASE NOTE: The information contained in this electronic mail 
>>> >>> > message is intended only for the use of the designated recipient(s) 
>>> >>> > named above. If the reader of this message is not the intended 
>>> >>> > recipient, you are hereby notified that you have received this 
>>> >>> > message in error and that any review, dissemination, distribution, or 
>>> >>> > copying of this message is strictly prohibited. If you have received 
>>> >>> > this communication in error, please notify the sender by telephone or 
>>> >>> > e-mail (as shown above) immediately and destroy any and all copies of 
>>> >>> > this message in your possession (whether hard copies or 
>>> >>> > electronically stored copies).
>>> >>> >
>>> >>> > --
>>> >>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> >>> > in the body of a message to [email protected] More 
>>> >>> > majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> Milosz Tanski
>>> >>> CTO
>>> >>> 16 East 34th Street, 15th floor
>>> >>> New York, NY 10016
>>> >>>
>>> >>> p: 646-253-9055
>>> >>> e: [email protected]
>>> >>> N?????r??y??????X???v???)?{.n?????z?]z????ay? ????j ??f???h?????
>>> >>> ?w??? ???j:+v???w???????? ????zZ+???????j"????i
>>> >> --
>>> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> >> in the body of a message to [email protected] More 
>>> >> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >
>>> >
>>> >
>>> > --
>>> > Best Regards,
>>> >
>>> > Wheat
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>>
>>> Wheat
>>>
>>>
>
>
>
> --
> Best Regards,
>
> Wheat



--
Best Regards,

Wheat
  칻 & ~ &   +-  ݶ  w  ˛   m  ^  b  ^n r   z   h    &   G   h ( 階 ݢj"  
 m     z ޖ   f   h   ~ m 
N�����r��y����b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�m��������zZ+�����ݢj"��!�i

RE: Impact of page cache on OSD read performance for SSD

Reply via email to