Hi,
I did some digging on the blktrace output to understand why this read_ahead_kb
setting is impacting performance in my setup (which is single OSD cluster).
Here is the result.
The 99% of the ios are performed by the following processes during the
blocktrace collection window.
1. For the ceph-osd process (including unknown process which I figured out
different threads of OSd only):
Events Read_ahead_kb = 128 Read_ahead_kb = 0
Direct_io
Reads Queued 4140687 4168816 4042634
Read Dispatches 7734617 5660597 4839428
Reads Requeued 4574032 1789149 944688
Reads Completed 2532893 2996269 3027387
Reads Merges 6415 2
0
IO unplugs 3380175 100911 4042714
2. Swapper process
Events Read_ahead_kb = 128 Read_ahead_kb = 0
Direct_io
Reads Queued 0 0
0
Read Dispatches 1836K 459028 258743
Reads Requeued 1129K 254808 132605
Reads Completed 1175K 937138 891107
Reads Merges 0 0 0
IO unplugs 0 0 0
Now, if we compare the total amount of reads happened during this time for the
3 different type of settings..
Events Read_ahead_kb = 128 Read_ahead_kb = 0
Direct_io
Reads Queued 4140K 4168K
4042K
Read Dispatches 10390K 6363K 5151K
Reads Requeued 6256K 2194K 1108K
Reads Completed 4134K 4168K 4042K
Reads Merges 6415 2 0
IO unplugs 3380183 100924 4042721
Here is my analysis on this.
1. There are lot more (~4M more than read_ahead_kb =0 ) read dispatch in
case we set read_ahead_kb = 128
2. Swapper process (which I think doing the read ahead(?)) is issuing lot
more reads if read_ahead_kb = 128
3. Read merges are almost 0 all the cases other than 1st one which says
the workload is very random (?). The more merges in case of 1st one is probably
because of read_ahead (?)
Some open question.
1. Why reads completed are less ? Is it ceph read complete + swapper
read complete ? but, still not matching dispatches ?
2. Io unplug is huge in case of read_ahead_kb = 128 and direct io
compared to read_ahead_kb = 0 , why ?
3. Why so many requeued ?
4. Requeued + queued = dispatched ?
Tried to set different kernel parameter like
nr_requests/scheduler/rq_affinity/vm_cache_pressure etc. , but, still in my
workload I am constantly getting ~50% improvement by setting read_ahead_kb =0.
I don't have much expertise in the linux block layer , so, reaching out to
community for the answers/suggestions.
Thanks & Regards
Somnath
-----Original Message-----
From: Somnath Roy
Sent: Thursday, September 25, 2014 12:11 AM
To: 'Chen, Xiaoxi'; Haomai Wang
Cc: Sage Weil; Milosz Tanski; [email protected]
Subject: RE: Impact of page cache on OSD read performance for SSD
Well, you never know !
It depends upon lot of factors starting from your workload/different kernel
params/RAID controller etc. etc. I have shared my observation in my environment
with 4K pseudo random fio_rbd workload. True random, should not kick off
read_ahead though.
OP_QUEUE optimization is bringing more parallelism in the filestore read , so,
more read going to disk in parallel may have exposed this.
Anyways, I am in process of analyzing why default read_ahead is causing problem
for me, will update if I find any..
Thanks & Regards
Somnath
-----Original Message-----
From: Chen, Xiaoxi [mailto:[email protected]]
Sent: Wednesday, September 24, 2014 10:00 PM
To: Somnath Roy; Haomai Wang
Cc: Sage Weil; Milosz Tanski; [email protected]
Subject: RE: Impact of page cache on OSD read performance for SSD
Have you ever seen large readahead_kb would hear random performance?
We usually set it to very large (2M) , the random read performance keep steady,
even in all SSD setup. Maybe with your optimization code for OP_QUEUE, the
things may different?
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Somnath Roy
Sent: Thursday, September 25, 2014 11:15 AM
To: Haomai Wang
Cc: Sage Weil; Milosz Tanski; [email protected]
Subject: RE: Impact of page cache on OSD read performance for SSD
It will be definitely hampered.
There will not be a single solution fits all. These parameters needs to be
tuned based on the workload.
Thanks & Regards
Somnath
-----Original Message-----
From: Haomai Wang [mailto:[email protected]]
Sent: Wednesday, September 24, 2014 7:56 PM
To: Somnath Roy
Cc: Sage Weil; Milosz Tanski; [email protected]
Subject: Re: Impact of page cache on OSD read performance for SSD
On Thu, Sep 25, 2014 at 7:49 AM, Somnath Roy <[email protected]> wrote:
> Hi,
> After going through the blktrace, I think I have figured out what is
> going on there. I think kernel read_ahead is causing the extra reads
> in case of buffered read. If I set read_ahead = 0 , the performance I
> am getting similar (or more when cache hit actually happens) to
> direct_io :-)
Hmm, BTW if set read_ahead=0, what about seq read performance compared to
before?
> IMHO, if any user doesn't want these nasty kernel effects and be sure of the
> random work pattern, we should provide a configurable direct_io read option
> (Need to quantify direct_io write also) as Sage suggested.
>
> Thanks & Regards
> Somnath
>
>
> -----Original Message-----
> From: Haomai Wang [mailto:[email protected]]
> Sent: Wednesday, September 24, 2014 9:06 AM
> To: Sage Weil
> Cc: Somnath Roy; Milosz Tanski; [email protected]
> Subject: Re: Impact of page cache on OSD read performance for SSD
>
> On Wed, Sep 24, 2014 at 8:38 PM, Sage Weil <[email protected]> wrote:
>> On Wed, 24 Sep 2014, Haomai Wang wrote:
>>> I agree with that direct read will help for disk read. But if read
>>> data is hot and small enough to fit in memory, page cache is a good
>>> place to hold data cache. If discard page cache, we need to
>>> implement a cache to provide with effective lookup impl.
>>
>> This is true for some workloads, but not necessarily true for all.
>> Many clients (notably RBD) will be caching at the client side (in
>> VM's fs, and possibly in librbd itself) such that caching at the OSD
>> is largely wasted effort. For RGW the often is likely true, unless
>> there is a varnish cache or something in front.
>
> Still now, I don't think librbd cache can meet all the cache demand for rbd
> usage. Even though we have a effective librbd cache impl, we still need a
> buffer cache in ObjectStore level just like what database did. Client cache
> and host cache are both needed.
>
>>
>> We should probably have a direct_io config option for filestore. But
>> even better would be some hint from the client about whether it is
>> caching or not so that FileStore could conditionally cache...
>
> Yes, I remember we already did some early works like it.
>
>>
>> sage
>>
>> >
>>> BTW, whether to use direct io we can refer to MySQL Innodb engine
>>> with direct io and PostgreSQL with page cache.
>>>
>>> On Wed, Sep 24, 2014 at 10:29 AM, Somnath Roy <[email protected]>
>>> wrote:
>>> > Haomai,
>>> > I am considering only about random reads and the changes I made only
>>> > affecting reads. For write, I have not measured yet. But, yes, page cache
>>> > may be helpful for write coalescing. Still need to evaluate how it is
>>> > behaving comparing direct_io on SSD though. I think Ceph code path will
>>> > be much shorter if we use direct_io in the write path where it is
>>> > actually executing the transactions. Probably, the sync thread and all
>>> > will not be needed.
>>> >
>>> > I am trying to analyze where is the extra reads coming from in case of
>>> > buffered io by using blktrace etc. This should give us a clear
>>> > understanding what exactly is going on there and it may turn out that
>>> > tuning kernel parameters only we can achieve similar performance as
>>> > direct_io.
>>> >
>>> > Thanks & Regards
>>> > Somnath
>>> >
>>> > -----Original Message-----
>>> > From: Haomai Wang [mailto:[email protected]]
>>> > Sent: Tuesday, September 23, 2014 7:07 PM
>>> > To: Sage Weil
>>> > Cc: Somnath Roy; Milosz Tanski; [email protected]
>>> > Subject: Re: Impact of page cache on OSD read performance for SSD
>>> >
>>> > Good point, but do you have considered that the impaction for write ops?
>>> > And if skip page cache, FileStore is responsible for data cache?
>>> >
>>> > On Wed, Sep 24, 2014 at 3:29 AM, Sage Weil <[email protected]> wrote:
>>> >> On Tue, 23 Sep 2014, Somnath Roy wrote:
>>> >>> Milosz,
>>> >>> Thanks for the response. I will see if I can get any information out of
>>> >>> perf.
>>> >>>
>>> >>> Here is my OS information.
>>> >>>
>>> >>> root@emsclient:~# lsb_release -a No LSB modules are available.
>>> >>> Distributor ID: Ubuntu
>>> >>> Description: Ubuntu 13.10
>>> >>> Release: 13.10
>>> >>> Codename: saucy
>>> >>> root@emsclient:~# uname -a
>>> >>> Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9
>>> >>> 16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
>>> >>>
>>> >>> BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I
>>> >>> was able to get almost *2X* performance improvement with direct_io.
>>> >>> It's not only page cache (memory) lookup, in case of buffered_io the
>>> >>> following could be problem.
>>> >>>
>>> >>> 1. Double copy (disk -> file buffer cache, file buffer cache ->
>>> >>> user
>>> >>> buffer)
>>> >>>
>>> >>> 2. As the iostat output shows, it is not reading 4K only, it is
>>> >>> reading more data from disk as required and in the end it will
>>> >>> be wasted in case of random workload..
>>> >>
>>> >> It might be worth using blktrace to see what the IOs it is issueing are.
>>> >> Which ones are > 4K and what they point to...
>>> >>
>>> >> sage
>>> >>
>>> >>
>>> >>>
>>> >>> Thanks & Regards
>>> >>> Somnath
>>> >>>
>>> >>> -----Original Message-----
>>> >>> From: Milosz Tanski [mailto:[email protected]]
>>> >>> Sent: Tuesday, September 23, 2014 12:09 PM
>>> >>> To: Somnath Roy
>>> >>> Cc: [email protected]
>>> >>> Subject: Re: Impact of page cache on OSD read performance for
>>> >>> SSD
>>> >>>
>>> >>> Somnath,
>>> >>>
>>> >>> I wonder if there's a bottleneck or a point of contention for the
>>> >>> kernel. For a entirely uncached workload I expect the page cache lookup
>>> >>> to cause a slow down (since the lookup should be wasted). What I
>>> >>> wouldn't expect is a 45% performance drop. Memory speed should be one
>>> >>> magnitude faster then a modern SATA SSD drive (so it should be more
>>> >>> negligible overhead).
>>> >>>
>>> >>> Is there anyway you could perform the same test but monitor what's
>>> >>> going on with the OSD process using the perf tool? Whatever is the
>>> >>> default cpu time spent hardware counter is fine. Make sure you have the
>>> >>> kernel debug info package installed so can get symbol information for
>>> >>> kernel and module calls. With any luck the diff in perf output in two
>>> >>> runs will show us the culprit.
>>> >>>
>>> >>> Also, can you tell us what OS/kernel version you're using on the OSD
>>> >>> machines?
>>> >>>
>>> >>> - Milosz
>>> >>>
>>> >>> On Tue, Sep 23, 2014 at 2:05 PM, Somnath Roy <[email protected]>
>>> >>> wrote:
>>> >>> > Hi Sage,
>>> >>> > I have created the following setup in order to examine how a single
>>> >>> > OSD is behaving if say ~80-90% of ios hitting the SSDs.
>>> >>> >
>>> >>> > My test includes the following steps.
>>> >>> >
>>> >>> > 1. Created a single OSD cluster.
>>> >>> > 2. Created two rbd images (110GB each) on 2 different pools.
>>> >>> > 3. Populated entire image, so my working set is ~210GB. My
>>> >>> > system memory is ~16GB.
>>> >>> > 4. Dumped page cache before every run.
>>> >>> > 5. Ran fio_rbd (QD 32, 8 instances) in parallel on these two
>>> >>> > images.
>>> >>> >
>>> >>> > Here is my disk iops/bandwidth..
>>> >>> >
>>> >>> > root@emsclient:~/fio_test# fio rad_resd_disk.job
>>> >>> > random-reads: (g=0): rw=randread, bs=4K-4K/4K-4K,
>>> >>> > ioengine=libaio, iodepth=64
>>> >>> > 2.0.8
>>> >>> > Starting 1 process
>>> >>> > Jobs: 1 (f=1): [r] [100.0% done] [154.1M/0K /s] [39.7K/0
>>> >>> > iops] [eta 00m:00s]
>>> >>> > random-reads: (groupid=0, jobs=1): err= 0: pid=1431
>>> >>> > read : io=9316.4MB, bw=158994KB/s, iops=39748 , runt=
>>> >>> > 60002msec
>>> >>> >
>>> >>> > My fio_rbd config..
>>> >>> >
>>> >>> > [global]
>>> >>> > ioengine=rbd
>>> >>> > clientname=admin
>>> >>> > pool=rbd1
>>> >>> > rbdname=ceph_regression_test1
>>> >>> > invalidate=0 # mandatory
>>> >>> > rw=randread
>>> >>> > bs=4k
>>> >>> > direct=1
>>> >>> > time_based
>>> >>> > runtime=2m
>>> >>> > size=109G
>>> >>> > numjobs=8
>>> >>> > [rbd_iodepth32]
>>> >>> > iodepth=32
>>> >>> >
>>> >>> > Now, I have run Giant Ceph on top of that..
>>> >>> >
>>> >>> > 1. OSD config with 25 shards/1 thread per shard :
>>> >>> > -------------------------------------------------------
>>> >>> >
>>> >>> > avg-cpu: %user %nice %system %iowait %steal %idle
>>> >>> > 22.04 0.00 16.46 45.86 0.00 15.64
>>> >>> >
>>> >>> > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
>>> >>> > avgrq-sz avgqu-sz await r_await w_await svctm %util
>>> >>> > sda 0.00 9.00 0.00 6.00 0.00 92.00
>>> >>> > 30.67 0.01 1.33 0.00 1.33 1.33 0.80
>>> >>> > sdd 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > sde 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > sdg 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > sdf 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > sdh 181.00 0.00 34961.00 0.00 176740.00 0.00
>>> >>> > 10.11 102.71 2.92 2.92 0.00 0.03 100.00
>>> >>> > sdc 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > sdb 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> >
>>> >>> >
>>> >>> > ceph -s:
>>> >>> > ----------
>>> >>> > root@emsclient:~# ceph -s
>>> >>> > cluster 94991097-7638-4240-b922-f525300a9026
>>> >>> > health HEALTH_OK
>>> >>> > monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1,
>>> >>> > quorum 0 a
>>> >>> > osdmap e498: 1 osds: 1 up, 1 in
>>> >>> > pgmap v386366: 832 pgs, 7 pools, 308 GB data, 247 kobjects
>>> >>> > 366 GB used, 1122 GB / 1489 GB avail
>>> >>> > 832 active+clean
>>> >>> > client io 75215 kB/s rd, 18803 op/s
>>> >>> >
>>> >>> > cpu util:
>>> >>> > ----------
>>> >>> > Gradually decreases from ~21 core (serving from cache) to ~10 core
>>> >>> > (while serving from disks).
>>> >>> >
>>> >>> > My Analysis:
>>> >>> > -----------------
>>> >>> > In this case "All is Well" till ios are served from cache
>>> >>> > (XFS is smart enough to cache some data ) . Once started hitting
>>> >>> > disks and throughput is decreasing. As you can see, disk is giving
>>> >>> > ~35K iops , but, OSD throughput is only ~18.8K ! So, cache miss in
>>> >>> > case of buffered io seems to be very expensive. Half of the iops
>>> >>> > are waste. Also, looking at the bandwidth, it is obvious, not
>>> >>> > everything is 4K read, May be kernel read_ahead is kicking (?).
>>> >>> >
>>> >>> >
>>> >>> > Now, I thought of making ceph disk read as direct_io and do the same
>>> >>> > experiment. I have changed the FileStore::read to do the direct_io
>>> >>> > only. Rest kept as is. Here is the result with that.
>>> >>> >
>>> >>> >
>>> >>> > Iostat:
>>> >>> > -------
>>> >>> >
>>> >>> > avg-cpu: %user %nice %system %iowait %steal %idle
>>> >>> > 24.77 0.00 19.52 21.36 0.00 34.36
>>> >>> >
>>> >>> > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
>>> >>> > avgrq-sz avgqu-sz await r_await w_await svctm %util
>>> >>> > sda 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > sdd 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > sde 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > sdg 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > sdf 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > sdh 0.00 0.00 25295.00 0.00 101180.00 0.00
>>> >>> > 8.00 12.73 0.50 0.50 0.00 0.04 100.80
>>> >>> > sdc 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > sdb 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> >
>>> >>> > ceph -s:
>>> >>> > --------
>>> >>> > root@emsclient:~/fio_test# ceph -s
>>> >>> > cluster 94991097-7638-4240-b922-f525300a9026
>>> >>> > health HEALTH_OK
>>> >>> > monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1,
>>> >>> > quorum 0 a
>>> >>> > osdmap e522: 1 osds: 1 up, 1 in
>>> >>> > pgmap v386711: 832 pgs, 7 pools, 308 GB data, 247 kobjects
>>> >>> > 366 GB used, 1122 GB / 1489 GB avail
>>> >>> > 832 active+clean
>>> >>> > client io 100 MB/s rd, 25618 op/s
>>> >>> >
>>> >>> > cpu util:
>>> >>> > --------
>>> >>> > ~14 core while serving from disks.
>>> >>> >
>>> >>> > My Analysis:
>>> >>> > ---------------
>>> >>> > No surprises here. Whatever is disk throughput ceph throughput is
>>> >>> > almost matching.
>>> >>> >
>>> >>> >
>>> >>> > Let's tweak the shard/thread settings and see the impact.
>>> >>> >
>>> >>> >
>>> >>> > 2. OSD config with 36 shards and 1 thread/shard:
>>> >>> > -----------------------------------------------------------
>>> >>> >
>>> >>> > Buffered read:
>>> >>> > ------------------
>>> >>> > No change, output is very similar to 25 shards.
>>> >>> >
>>> >>> >
>>> >>> > direct_io read:
>>> >>> > ------------------
>>> >>> > Iostat:
>>> >>> > ----------
>>> >>> > avg-cpu: %user %nice %system %iowait %steal %idle
>>> >>> > 33.33 0.00 28.22 23.11 0.00 15.34
>>> >>> >
>>> >>> > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
>>> >>> > avgrq-sz avgqu-sz await r_await w_await svctm %util
>>> >>> > sda 0.00 0.00 0.00 2.00 0.00 12.00
>>> >>> > 12.00 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > sdd 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > sde 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > sdg 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > sdf 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > sdh 0.00 0.00 31987.00 0.00 127948.00 0.00
>>> >>> > 8.00 18.06 0.56 0.56 0.00 0.03 100.40
>>> >>> > sdc 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > sdb 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> >
>>> >>> > ceph -s:
>>> >>> > --------------
>>> >>> > root@emsclient:~/fio_test# ceph -s
>>> >>> > cluster 94991097-7638-4240-b922-f525300a9026
>>> >>> > health HEALTH_OK
>>> >>> > monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1,
>>> >>> > quorum 0 a
>>> >>> > osdmap e525: 1 osds: 1 up, 1 in
>>> >>> > pgmap v386746: 832 pgs, 7 pools, 308 GB data, 247 kobjects
>>> >>> > 366 GB used, 1122 GB / 1489 GB avail
>>> >>> > 832 active+clean
>>> >>> > client io 127 MB/s rd, 32763 op/s
>>> >>> >
>>> >>> > cpu util:
>>> >>> > --------------
>>> >>> > ~19 core while serving from disks.
>>> >>> >
>>> >>> > Analysis:
>>> >>> > ------------------
>>> >>> > It is scaling with increased number of shards/threads. The
>>> >>> > parallelism also increased significantly.
>>> >>> >
>>> >>> >
>>> >>> > 3. OSD config with 48 shards and 1 thread/shard:
>>> >>> > ----------------------------------------------------------
>>> >>> > Buffered read:
>>> >>> > -------------------
>>> >>> > No change, output is very similar to 25 shards.
>>> >>> >
>>> >>> >
>>> >>> > direct_io read:
>>> >>> > -----------------
>>> >>> > Iostat:
>>> >>> > --------
>>> >>> >
>>> >>> > avg-cpu: %user %nice %system %iowait %steal %idle
>>> >>> > 37.50 0.00 33.72 20.03 0.00 8.75
>>> >>> >
>>> >>> > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
>>> >>> > avgrq-sz avgqu-sz await r_await w_await svctm %util
>>> >>> > sda 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > sdd 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > sde 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > sdg 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > sdf 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > sdh 0.00 0.00 35360.00 0.00 141440.00 0.00
>>> >>> > 8.00 22.25 0.62 0.62 0.00 0.03 100.40
>>> >>> > sdc 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > sdb 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> >
>>> >>> > ceph -s:
>>> >>> > --------------
>>> >>> > root@emsclient:~/fio_test# ceph -s
>>> >>> > cluster 94991097-7638-4240-b922-f525300a9026
>>> >>> > health HEALTH_OK
>>> >>> > monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1,
>>> >>> > quorum 0 a
>>> >>> > osdmap e534: 1 osds: 1 up, 1 in
>>> >>> > pgmap v386830: 832 pgs, 7 pools, 308 GB data, 247 kobjects
>>> >>> > 366 GB used, 1122 GB / 1489 GB avail
>>> >>> > 832 active+clean
>>> >>> > client io 138 MB/s rd, 35582 op/s
>>> >>> >
>>> >>> > cpu util:
>>> >>> > ----------------
>>> >>> > ~22.5 core while serving from disks.
>>> >>> >
>>> >>> > Analysis:
>>> >>> > --------------------
>>> >>> > It is scaling with increased number of shards/threads. The
>>> >>> > parallelism also increased significantly.
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > 4. OSD config with 64 shards and 1 thread/shard:
>>> >>> > ---------------------------------------------------------
>>> >>> > Buffered read:
>>> >>> > ------------------
>>> >>> > No change, output is very similar to 25 shards.
>>> >>> >
>>> >>> >
>>> >>> > direct_io read:
>>> >>> > -------------------
>>> >>> > Iostat:
>>> >>> > ---------
>>> >>> > avg-cpu: %user %nice %system %iowait %steal %idle
>>> >>> > 40.18 0.00 34.84 19.81 0.00 5.18
>>> >>> >
>>> >>> > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
>>> >>> > avgrq-sz avgqu-sz await r_await w_await svctm %util
>>> >>> > sda 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > sdd 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > sde 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > sdg 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > sdf 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > sdh 0.00 0.00 39114.00 0.00 156460.00 0.00
>>> >>> > 8.00 35.58 0.90 0.90 0.00 0.03 100.40
>>> >>> > sdc 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > sdb 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>>> >>> >
>>> >>> > ceph -s:
>>> >>> > ---------------
>>> >>> > root@emsclient:~/fio_test# ceph -s
>>> >>> > cluster 94991097-7638-4240-b922-f525300a9026
>>> >>> > health HEALTH_OK
>>> >>> > monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1,
>>> >>> > quorum 0 a
>>> >>> > osdmap e537: 1 osds: 1 up, 1 in
>>> >>> > pgmap v386865: 832 pgs, 7 pools, 308 GB data, 247 kobjects
>>> >>> > 366 GB used, 1122 GB / 1489 GB avail
>>> >>> > 832 active+clean
>>> >>> > client io 153 MB/s rd, 39172 op/s
>>> >>> >
>>> >>> > cpu util:
>>> >>> > ----------------
>>> >>> > ~24.5 core while serving from disks. ~3% cpu left.
>>> >>> >
>>> >>> > Analysis:
>>> >>> > ------------------
>>> >>> > It is scaling with increased number of shards/threads. The
>>> >>> > parallelism also increased significantly. It is disk bound now.
>>> >>> >
>>> >>> >
>>> >>> > Summary:
>>> >>> >
>>> >>> > So, it seems buffered IO has significant impact on performance in
>>> >>> > case backend is SSD.
>>> >>> > My question is, if the workload is very random and storage(SSD) is
>>> >>> > very huge compare to system memory, shouldn't we always go for
>>> >>> > direct_io instead of buffered io from Ceph ?
>>> >>> >
>>> >>> > Please share your thoughts/suggestion on this.
>>> >>> >
>>> >>> > Thanks & Regards
>>> >>> > Somnath
>>> >>> >
>>> >>> > ________________________________
>>> >>> >
>>> >>> > PLEASE NOTE: The information contained in this electronic mail
>>> >>> > message is intended only for the use of the designated recipient(s)
>>> >>> > named above. If the reader of this message is not the intended
>>> >>> > recipient, you are hereby notified that you have received this
>>> >>> > message in error and that any review, dissemination, distribution, or
>>> >>> > copying of this message is strictly prohibited. If you have received
>>> >>> > this communication in error, please notify the sender by telephone or
>>> >>> > e-mail (as shown above) immediately and destroy any and all copies of
>>> >>> > this message in your possession (whether hard copies or
>>> >>> > electronically stored copies).
>>> >>> >
>>> >>> > --
>>> >>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> >>> > in the body of a message to [email protected] More
>>> >>> > majordomo info at http://vger.kernel.org/majordomo-info.html
>>> >>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> Milosz Tanski
>>> >>> CTO
>>> >>> 16 East 34th Street, 15th floor
>>> >>> New York, NY 10016
>>> >>>
>>> >>> p: 646-253-9055
>>> >>> e: [email protected]
>>> >>> N?????r??y??????X???v???)?{.n?????z?]z????ay? ????j ??f???h?????
>>> >>> ?w??? ???j:+v???w???????? ????zZ+???????j"????i
>>> >> --
>>> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> >> in the body of a message to [email protected] More
>>> >> majordomo info at http://vger.kernel.org/majordomo-info.html
>>> >
>>> >
>>> >
>>> > --
>>> > Best Regards,
>>> >
>>> > Wheat
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>>
>>> Wheat
>>>
>>>
>
>
>
> --
> Best Regards,
>
> Wheat
--
Best Regards,
Wheat
칻 & ~ & +- ݶ w ˛ m ^ b ^n r z h & G h ( 階 ݢj"
m z ޖ f h ~ m
N�����r��y����b�X��ǧv�^�){.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�m��������zZ+�����ݢj"��!�i