Good point, but do you have considered that the impaction for write
ops? And if skip page cache, FileStore is responsible for data cache?

On Wed, Sep 24, 2014 at 3:29 AM, Sage Weil <[email protected]> wrote:
> On Tue, 23 Sep 2014, Somnath Roy wrote:
>> Milosz,
>> Thanks for the response. I will see if I can get any information out of perf.
>>
>> Here is my OS information.
>>
>> root@emsclient:~# lsb_release -a
>> No LSB modules are available.
>> Distributor ID: Ubuntu
>> Description:    Ubuntu 13.10
>> Release:        13.10
>> Codename:       saucy
>> root@emsclient:~# uname -a
>> Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46 UTC 2013 
>> x86_64 x86_64 x86_64 GNU/Linux
>>
>> BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I was 
>> able to get almost *2X* performance improvement with direct_io.
>> It's not only page cache (memory) lookup, in case of buffered_io  the 
>> following could be problem.
>>
>> 1. Double copy (disk -> file buffer cache, file buffer cache -> user buffer)
>>
>> 2. As the iostat output shows, it is not reading 4K only, it is reading
>> more data from disk as required and in the end it will be wasted in case
>> of random workload..
>
> It might be worth using blktrace to see what the IOs it is issueing are.
> Which ones are > 4K and what they point to...
>
> sage
>
>
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: Milosz Tanski [mailto:[email protected]]
>> Sent: Tuesday, September 23, 2014 12:09 PM
>> To: Somnath Roy
>> Cc: [email protected]
>> Subject: Re: Impact of page cache on OSD read performance for SSD
>>
>> Somnath,
>>
>> I wonder if there's a bottleneck or a point of contention for the kernel. 
>> For a entirely uncached workload I expect the page cache lookup to cause a 
>> slow down (since the lookup should be wasted). What I wouldn't expect is a 
>> 45% performance drop. Memory speed should be one magnitude faster then a 
>> modern SATA SSD drive (so it should be more negligible overhead).
>>
>> Is there anyway you could perform the same test but monitor what's going on 
>> with the OSD process using the perf tool? Whatever is the default cpu time 
>> spent hardware counter is fine. Make sure you have the kernel debug info 
>> package installed so can get symbol information for kernel and module calls. 
>> With any luck the diff in perf output in two runs will show us the culprit.
>>
>> Also, can you tell us what OS/kernel version you're using on the OSD 
>> machines?
>>
>> - Milosz
>>
>> On Tue, Sep 23, 2014 at 2:05 PM, Somnath Roy <[email protected]> wrote:
>> > Hi Sage,
>> > I have created the following setup in order to examine how a single OSD is 
>> > behaving if say ~80-90% of ios hitting the SSDs.
>> >
>> > My test includes the following steps.
>> >
>> >         1. Created a single OSD cluster.
>> >         2. Created two rbd images (110GB each) on 2 different pools.
>> >         3. Populated entire image, so my working set is ~210GB. My system 
>> > memory is ~16GB.
>> >         4. Dumped page cache before every run.
>> >         5. Ran fio_rbd (QD 32, 8 instances) in parallel on these two 
>> > images.
>> >
>> > Here is my disk iops/bandwidth..
>> >
>> >         root@emsclient:~/fio_test# fio rad_resd_disk.job
>> >         random-reads: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, 
>> > iodepth=64
>> >         2.0.8
>> >         Starting 1 process
>> >         Jobs: 1 (f=1): [r] [100.0% done] [154.1M/0K /s] [39.7K/0  iops] 
>> > [eta 00m:00s]
>> >         random-reads: (groupid=0, jobs=1): err= 0: pid=1431
>> >         read : io=9316.4MB, bw=158994KB/s, iops=39748 , runt=
>> > 60002msec
>> >
>> > My fio_rbd config..
>> >
>> > [global]
>> > ioengine=rbd
>> > clientname=admin
>> > pool=rbd1
>> > rbdname=ceph_regression_test1
>> > invalidate=0    # mandatory
>> > rw=randread
>> > bs=4k
>> > direct=1
>> > time_based
>> > runtime=2m
>> > size=109G
>> > numjobs=8
>> > [rbd_iodepth32]
>> > iodepth=32
>> >
>> > Now, I have run Giant Ceph on top of that..
>> >
>> > 1. OSD config with 25 shards/1 thread per shard :
>> > -------------------------------------------------------
>> >
>> >          avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>> >           22.04    0.00   16.46   45.86    0.00   15.64
>> >
>> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
>> > avgqu-sz   await r_await w_await  svctm  %util
>> > sda               0.00     9.00    0.00    6.00     0.00    92.00    30.67 
>> >     0.01    1.33    0.00    1.33   1.33   0.80
>> > sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00 
>> >     0.00    0.00    0.00    0.00   0.00   0.00
>> > sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00 
>> >     0.00    0.00    0.00    0.00   0.00   0.00
>> > sdg               0.00     0.00    0.00    0.00     0.00     0.00     0.00 
>> >     0.00    0.00    0.00    0.00   0.00   0.00
>> > sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00 
>> >     0.00    0.00    0.00    0.00   0.00   0.00
>> > sdh             181.00     0.00 34961.00    0.00 176740.00     0.00    
>> > 10.11   102.71    2.92    2.92    0.00   0.03 100.00
>> > sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00 
>> >     0.00    0.00    0.00    0.00   0.00   0.00
>> > sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00 
>> >     0.00    0.00    0.00    0.00   0.00   0.00
>> >
>> >
>> > ceph -s:
>> >  ----------
>> > root@emsclient:~# ceph -s
>> >     cluster 94991097-7638-4240-b922-f525300a9026
>> >      health HEALTH_OK
>> >      monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, 
>> > quorum 0 a
>> >      osdmap e498: 1 osds: 1 up, 1 in
>> >       pgmap v386366: 832 pgs, 7 pools, 308 GB data, 247 kobjects
>> >             366 GB used, 1122 GB / 1489 GB avail
>> >                  832 active+clean
>> >   client io 75215 kB/s rd, 18803 op/s
>> >
>> >  cpu util:
>> > ----------
>> >  Gradually decreases from ~21 core (serving from cache) to ~10 core (while 
>> > serving from disks).
>> >
>> >  My Analysis:
>> > -----------------
>> >  In this case "All is Well"  till ios are served from cache (XFS is
>> > smart enough to cache some data ) . Once started hitting disks and 
>> > throughput is decreasing. As you can see, disk is giving ~35K iops , but, 
>> > OSD throughput is only ~18.8K ! So, cache miss in case of buffered io 
>> > seems to be very  expensive.  Half of the iops are waste. Also, looking at 
>> > the bandwidth, it is obvious, not everything is 4K read, May be kernel 
>> > read_ahead is kicking (?).
>> >
>> >
>> > Now, I thought of making ceph disk read as direct_io and do the same 
>> > experiment. I have changed the FileStore::read to do the direct_io only. 
>> > Rest kept as is. Here is the result with that.
>> >
>> >
>> > Iostat:
>> > -------
>> >
>> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>> >           24.77    0.00   19.52   21.36    0.00   34.36
>> >
>> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
>> > avgqu-sz   await r_await w_await  svctm  %util
>> > sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00 
>> >     0.00    0.00    0.00    0.00   0.00   0.00
>> > sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00 
>> >     0.00    0.00    0.00    0.00   0.00   0.00
>> > sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00 
>> >     0.00    0.00    0.00    0.00   0.00   0.00
>> > sdg               0.00     0.00    0.00    0.00     0.00     0.00     0.00 
>> >     0.00    0.00    0.00    0.00   0.00   0.00
>> > sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00 
>> >     0.00    0.00    0.00    0.00   0.00   0.00
>> > sdh               0.00     0.00 25295.00    0.00 101180.00     0.00     
>> > 8.00    12.73    0.50    0.50    0.00   0.04 100.80
>> > sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00 
>> >     0.00    0.00    0.00    0.00   0.00   0.00
>> > sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00 
>> >     0.00    0.00    0.00    0.00   0.00   0.00
>> >
>> > ceph -s:
>> >  --------
>> > root@emsclient:~/fio_test# ceph -s
>> >     cluster 94991097-7638-4240-b922-f525300a9026
>> >      health HEALTH_OK
>> >      monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, 
>> > quorum 0 a
>> >      osdmap e522: 1 osds: 1 up, 1 in
>> >       pgmap v386711: 832 pgs, 7 pools, 308 GB data, 247 kobjects
>> >             366 GB used, 1122 GB / 1489 GB avail
>> >                  832 active+clean
>> >   client io 100 MB/s rd, 25618 op/s
>> >
>> > cpu util:
>> > --------
>> >   ~14 core while serving from disks.
>> >
>> >  My Analysis:
>> >  ---------------
>> > No surprises here. Whatever is disk throughput ceph throughput is almost 
>> > matching.
>> >
>> >
>> > Let's tweak the shard/thread settings and see the impact.
>> >
>> >
>> > 2. OSD config with 36 shards and 1 thread/shard:
>> > -----------------------------------------------------------
>> >
>> >    Buffered read:
>> >    ------------------
>> >   No change, output is very similar to 25 shards.
>> >
>> >
>> >   direct_io read:
>> >   ------------------
>> >        Iostat:
>> >       ----------
>> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>> >           33.33    0.00   28.22   23.11    0.00   15.34
>> >
>> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
>> > avgqu-sz   await r_await w_await  svctm  %util
>> > sda               0.00     0.00    0.00    2.00     0.00    12.00    12.00 
>> >     0.00    0.00    0.00    0.00   0.00   0.00
>> > sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00 
>> >     0.00    0.00    0.00    0.00   0.00   0.00
>> > sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00 
>> >     0.00    0.00    0.00    0.00   0.00   0.00
>> > sdg               0.00     0.00    0.00    0.00     0.00     0.00     0.00 
>> >     0.00    0.00    0.00    0.00   0.00   0.00
>> > sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00 
>> >     0.00    0.00    0.00    0.00   0.00   0.00
>> > sdh               0.00     0.00 31987.00    0.00 127948.00     0.00     
>> > 8.00    18.06    0.56    0.56    0.00   0.03 100.40
>> > sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00 
>> >     0.00    0.00    0.00    0.00   0.00   0.00
>> > sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00 
>> >     0.00    0.00    0.00    0.00   0.00   0.00
>> >
>> >        ceph -s:
>> >     --------------
>> > root@emsclient:~/fio_test# ceph -s
>> >     cluster 94991097-7638-4240-b922-f525300a9026
>> >      health HEALTH_OK
>> >      monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, 
>> > quorum 0 a
>> >      osdmap e525: 1 osds: 1 up, 1 in
>> >       pgmap v386746: 832 pgs, 7 pools, 308 GB data, 247 kobjects
>> >             366 GB used, 1122 GB / 1489 GB avail
>> >                  832 active+clean
>> >   client io 127 MB/s rd, 32763 op/s
>> >
>> >         cpu util:
>> >    --------------
>> >        ~19 core while serving from disks.
>> >
>> >          Analysis:
>> > ------------------
>> >         It is scaling with increased number of shards/threads. The 
>> > parallelism also increased significantly.
>> >
>> >
>> > 3. OSD config with 48 shards and 1 thread/shard:
>> >  ----------------------------------------------------------
>> >     Buffered read:
>> >    -------------------
>> >     No change, output is very similar to 25 shards.
>> >
>> >
>> >    direct_io read:
>> >     -----------------
>> >        Iostat:
>> >       --------
>> >
>> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>> >           37.50    0.00   33.72   20.03    0.00    8.75
>> >
>> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
>> > avgqu-sz   await r_await w_await  svctm  %util
>> > sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00 
>> >     0.00    0.00    0.00    0.00   0.00   0.00
>> > sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00 
>> >     0.00    0.00    0.00    0.00   0.00   0.00
>> > sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00 
>> >     0.00    0.00    0.00    0.00   0.00   0.00
>> > sdg               0.00     0.00    0.00    0.00     0.00     0.00     0.00 
>> >     0.00    0.00    0.00    0.00   0.00   0.00
>> > sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00 
>> >     0.00    0.00    0.00    0.00   0.00   0.00
>> > sdh               0.00     0.00 35360.00    0.00 141440.00     0.00     
>> > 8.00    22.25    0.62    0.62    0.00   0.03 100.40
>> > sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00 
>> >     0.00    0.00    0.00    0.00   0.00   0.00
>> > sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00 
>> >     0.00    0.00    0.00    0.00   0.00   0.00
>> >
>> >          ceph -s:
>> >        --------------
>> > root@emsclient:~/fio_test# ceph -s
>> >     cluster 94991097-7638-4240-b922-f525300a9026
>> >      health HEALTH_OK
>> >      monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, 
>> > quorum 0 a
>> >      osdmap e534: 1 osds: 1 up, 1 in
>> >       pgmap v386830: 832 pgs, 7 pools, 308 GB data, 247 kobjects
>> >             366 GB used, 1122 GB / 1489 GB avail
>> >                  832 active+clean
>> >   client io 138 MB/s rd, 35582 op/s
>> >
>> >          cpu util:
>> >  ----------------
>> >         ~22.5 core while serving from disks.
>> >
>> >           Analysis:
>> >  --------------------
>> >         It is scaling with increased number of shards/threads. The 
>> > parallelism also increased significantly.
>> >
>> >
>> >
>> > 4. OSD config with 64 shards and 1 thread/shard:
>> >  ---------------------------------------------------------
>> >       Buffered read:
>> >      ------------------
>> >      No change, output is very similar to 25 shards.
>> >
>> >
>> >      direct_io read:
>> >      -------------------
>> >        Iostat:
>> >       ---------
>> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>> >           40.18    0.00   34.84   19.81    0.00    5.18
>> >
>> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
>> > avgqu-sz   await r_await w_await  svctm  %util
>> > sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00 
>> >     0.00    0.00    0.00    0.00   0.00   0.00
>> > sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00 
>> >     0.00    0.00    0.00    0.00   0.00   0.00
>> > sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00 
>> >     0.00    0.00    0.00    0.00   0.00   0.00
>> > sdg               0.00     0.00    0.00    0.00     0.00     0.00     0.00 
>> >     0.00    0.00    0.00    0.00   0.00   0.00
>> > sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00 
>> >     0.00    0.00    0.00    0.00   0.00   0.00
>> > sdh               0.00     0.00 39114.00    0.00 156460.00     0.00     
>> > 8.00    35.58    0.90    0.90    0.00   0.03 100.40
>> > sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00 
>> >     0.00    0.00    0.00    0.00   0.00   0.00
>> > sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00 
>> >     0.00    0.00    0.00    0.00   0.00   0.00
>> >
>> >        ceph -s:
>> >  ---------------
>> > root@emsclient:~/fio_test# ceph -s
>> >     cluster 94991097-7638-4240-b922-f525300a9026
>> >      health HEALTH_OK
>> >      monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, 
>> > quorum 0 a
>> >      osdmap e537: 1 osds: 1 up, 1 in
>> >       pgmap v386865: 832 pgs, 7 pools, 308 GB data, 247 kobjects
>> >             366 GB used, 1122 GB / 1489 GB avail
>> >                  832 active+clean
>> >   client io 153 MB/s rd, 39172 op/s
>> >
>> >       cpu util:
>> > ----------------
>> >     ~24.5 core while serving from disks. ~3% cpu left.
>> >
>> >        Analysis:
>> > ------------------
>> >       It is scaling with increased number of shards/threads. The 
>> > parallelism also increased significantly. It is disk bound now.
>> >
>> >
>> > Summary:
>> >
>> > So, it seems buffered IO has significant impact on performance in case 
>> > backend is SSD.
>> > My question is,  if the workload is very random and storage(SSD) is very 
>> > huge compare to system memory, shouldn't we always go for direct_io 
>> > instead of buffered io from Ceph ?
>> >
>> > Please share your thoughts/suggestion on this.
>> >
>> > Thanks & Regards
>> > Somnath
>> >
>> > ________________________________
>> >
>> > PLEASE NOTE: The information contained in this electronic mail message is 
>> > intended only for the use of the designated recipient(s) named above. If 
>> > the reader of this message is not the intended recipient, you are hereby 
>> > notified that you have received this message in error and that any review, 
>> > dissemination, distribution, or copying of this message is strictly 
>> > prohibited. If you have received this communication in error, please 
>> > notify the sender by telephone or e-mail (as shown above) immediately and 
>> > destroy any and all copies of this message in your possession (whether 
>> > hard copies or electronically stored copies).
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> > in the body of a message to [email protected] More majordomo
>> > info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Milosz Tanski
>> CTO
>> 16 East 34th Street, 15th floor
>> New York, NY 10016
>>
>> p: 646-253-9055
>> e: [email protected]
>> N?????r??y??????X???v???)?{.n?????z?]z????ay? ????j ??f???h????? ?w??? 
>> ???j:+v???w???????? ????zZ+???????j"????i
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to