Hi Tomas:
   Thanks for checking.

On Fri, Sep 18, 2020 at 9:50 PM Tomas Vondra <tomas.von...@2ndquadrant.com>
wrote:

> >I recently tried something in this direction and the result looks
> >promising based on my limited test.
> >
> >Since the unit of a xxx_cost is "seq_page_cost", then how to detect
> >seq_page_cost is important. In the cost model, the IO cost of a seqscan is
> >rel->pages * seq_page_cost, it doesn't consider any cache (file system
> >cache or
> >shared buffer cache).  However, it assumes the OS will prefetch the IO. So
> >to
> >detect the seq_page_cost, I enabled the prefetch but avoided the file
> system
> >cache. I tested this with 1). drop the cache on the file system. 2). Open
> >the test
> >file without O_DIRECT so that the prefetch can work.
> >
> >To detect the random page read, I read it with pread with a random offset.
> >Since the random offsets may be the same as each other during the test,
> >so even dropping the file system cache at the beginning doesn't work. so
> >I open it with the O_DIRECT option.
> >
> >I also measure the cost of reading a page from a file system cache, during
> >my test, it is about 10% of a seq scan read.
> >
> >After I get the basic numbers about the hardware capability, I let the
> user
> >provide a cache hit ratio (This is a place where we can further improve if
> >this
> >is a right direction).
> >
> >Here is the test result on my hardware.
> >
> >fs_cache_lat = 0.832025us, seq_read_lat = 8.570290us, random_page_lat =
> >73.987732us
> >
> >cache hit ratio: 1.000000 random_page_cost 1.000000
> >cache hit ratio: 0.900000 random_page_cost 5.073692
> >cache hit ratio: 0.500000 random_page_cost 7.957589
> >cache hit ratio: 0.100000 random_page_cost 8.551591
> >cache hit ratio: 0.000000 random_page_cost 8.633049
> >
> >
> >Then I tested the suggested value with the 10GB TPCH
> >workload. I compared the plans with 2 different settings random_page_cost
> =
> >1). 4 is the default value) 2). 8.6  the cache hint ratio = 0 one.  Then
> 11
> >out of the 22
> >queries generated a different plan.  At last I drop the cache (including
> >both
> >file system cache and shared_buffer) before run each query and run the 11
> >queries
> >under the 2 different settings. The execution time is below.
> >
> >
> >|     | random_page_cost=4 | random_page_cost=8.6 |
> >|-----+--------------------+----------------------|
> >| Q1  |           1425.964 |             1121.928 |
> >| Q2  |           2553.072 |             2567.450 |
> >| Q5  |           4397.514 |             1475.343 |
> >| Q6  |          12576.985 |             4622.503 |
> >| Q7  |           3459.777 |             2987.241 |
> >| Q8  |           8360.995 |             8415.311 |
> >| Q9  |           4661.842 |             2930.370 |
> >| Q11 |           4885.289 |             2348.541 |
> >| Q13 |           2610.937 |             1497.776 |
> >| Q20 |          13218.122 |            10985.738 |
> >| Q21 |            264.639 |              262.350 |
> >
> >
> >The attached main.c is the program I used to detect the
> >random_page_cost. result.tar.gz is the test result, you can run a git log
> >first
> >to see the difference on plan or execution stat.
> >
> >Any feedback is welcome. Thanks!
> >
>
> That seems pretty neat. What kind of hardware have you done these tests
> on?


The following is my hardware info.

I have 12 SSD behind the MR9271-8i RAID Controller which has a 1GB buffer.
[1]

root#  lshw -short -C disk
H/W path            Device          Class          Description
==============================================================
/0/100/2/0/2.0.0    /dev/sda        disk           2398GB MR9271-8i
/0/100/2/0/2.1.0    /dev/sdb        disk           5597GB MR9271-8i <-- my
data location


/opt/MegaRAID/MegaCli/MegaCli64 -AdpAllInfo -aALL

Adapter #0

Memory Size      : 1024MB
RAID Level          : Primary-5, Secondary-0, RAID Level Qualifier-3
..
Current Cache Policy: WriteBack, ReadAheadNone, Direct, Write Cache OK if
Bad
BBU
...
                Device Present
                ================
Virtual Drives    : 2
  Degraded        : 0
  Offline         : 0
Physical Devices  : 14
  Disks           : 12
  Critical Disks  : 0
  Failed Disks    : 0


root# /opt/MegaRAID/MegaCli/MegaCli64  -LdPdInfo -a0 | egrep 'Media
Type|Raw Size'
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device

CPU: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz, 32 processors.
Memory: 251 GB
Linux: 3.10.0-327
fs: ext4. mount options: defaults,noatime,nodiratime,nodelalloc,barrier=0
Physical machine.

It's probably worth testing on various other storage systems to see
> how that applies to those.
>
> Yes, I can test more on new hardware once I get it. Now it is still in
progress.
However I can only get a physical machine with SSD or  Virtual machine with
SSD, other types are hard for me right now.


Have you tried existing I/O testing tools, e.g. fio? If your idea is to
> propose some built-in tool (similar to pg_test_fsync) then we probably
> should not rely on external tools, but I wonder if we're getting the
> same numbers.
>

Thanks for this hint, I found more interesting stuff during the comparison.

I define the FIO jobs as below.

random_page_cost.job:
[global]
blocksize=8k
size=1Gi
filesize=1Gi
ioengine=sync
directory=/u01/yizhi/data/fio

[random_page_cost]
direct=1
readwrite=randread


Even it is direct IO, the device cache still plays an important
part. The device cache is filled in preparing the test data file stage.
I invalidate the device cache by writing a new dummy file. At last the avg
latency time is 148 us.


seq.job

[global]
blocksize=8k
size=1Gi
filesize=1Gi
ioengine=sync
directory=/u01/yizhi/data/fio

[seq_page_cost]
buffered=1
readwrite=read

For seq read, We need buffered IO for perfetch, however, we need to bypass
the file
system cache and device cache. fio have no control of such caches, so I did:

1). Run fio to generate the test file.
2). Invalidate device cache first with dd if=/dev/zero of=a_dummy_file
bs=1048576 count=1024
3). drop the file system cache.
4). Run the fio again.

The final avg latency is ~12 us.

This is 1.5 ~ 2 X difference with my previous result. (seq_read_lat =
8.570290us, random_page_lat =
73.987732us)

Here are some changes for my detection program.

|                                                 | seq_read_lat (us) |
random_read_lat (us) |
| FIO                                             |                12 |
             148 |
| Previous main.c                                 |               8.5 |
              74 |
| invalidate_device_cache before each testing     |                 9 |
             150 |
| prepare the test data file with O_DIRECT option |                15 |
             150 |

In invalidate_device_cache, I just create another 1GB data file and read
it. (see invalidate_device_cache function) this is similar as the previous
fio setup.

prepare test data file with O_DIRECT option means in the past, I prepare
the test
file with buffer IO. and before testing, I do invalidate device cache, file
system cache. but the buffered prepared file still get better performance, I
have no idea of it. Since I don't want any cache.  I use O_DIRECT
option at last. The seq_read_lat changed from 9us to 15us.
I still can't find out the 25% difference with the FIO result. (12 us vs 9
us).

At last, the random_page_cost happens to not change very much.

/u/y/g/fdirect> sudo ./main
fs_cache_lat = 0.569031us, seq_read_lat = 18.901749us, random_page_lat =
148.650589us

cache hit ratio: 1.000000 random_page_cost 1.000000
cache hit ratio: 0.900000 random_page_cost 6.401019
cache hit ratio: 0.500000 random_page_cost 7.663772
cache hit ratio: 0.100000 random_page_cost 7.841498
cache hit ratio: 0.000000 random_page_cost 7.864383

This result looks much different from "we should use 1.1 ~ 1.5 for SSD".

The attached is the modified detection program.


[1]
https://www.cdw.com/product/lsi-megaraid-sas-9271-8i-storage-controller-raid-sas-pcie-3.0-x8/4576538#PO

-- 
Best Regards
Andy Fan

Attachment: main.c
Description: Binary data

Reply via email to