Hi Tomas: Thanks for checking. On Fri, Sep 18, 2020 at 9:50 PM Tomas Vondra <tomas.von...@2ndquadrant.com> wrote:
> >I recently tried something in this direction and the result looks > >promising based on my limited test. > > > >Since the unit of a xxx_cost is "seq_page_cost", then how to detect > >seq_page_cost is important. In the cost model, the IO cost of a seqscan is > >rel->pages * seq_page_cost, it doesn't consider any cache (file system > >cache or > >shared buffer cache). However, it assumes the OS will prefetch the IO. So > >to > >detect the seq_page_cost, I enabled the prefetch but avoided the file > system > >cache. I tested this with 1). drop the cache on the file system. 2). Open > >the test > >file without O_DIRECT so that the prefetch can work. > > > >To detect the random page read, I read it with pread with a random offset. > >Since the random offsets may be the same as each other during the test, > >so even dropping the file system cache at the beginning doesn't work. so > >I open it with the O_DIRECT option. > > > >I also measure the cost of reading a page from a file system cache, during > >my test, it is about 10% of a seq scan read. > > > >After I get the basic numbers about the hardware capability, I let the > user > >provide a cache hit ratio (This is a place where we can further improve if > >this > >is a right direction). > > > >Here is the test result on my hardware. > > > >fs_cache_lat = 0.832025us, seq_read_lat = 8.570290us, random_page_lat = > >73.987732us > > > >cache hit ratio: 1.000000 random_page_cost 1.000000 > >cache hit ratio: 0.900000 random_page_cost 5.073692 > >cache hit ratio: 0.500000 random_page_cost 7.957589 > >cache hit ratio: 0.100000 random_page_cost 8.551591 > >cache hit ratio: 0.000000 random_page_cost 8.633049 > > > > > >Then I tested the suggested value with the 10GB TPCH > >workload. I compared the plans with 2 different settings random_page_cost > = > >1). 4 is the default value) 2). 8.6 the cache hint ratio = 0 one. Then > 11 > >out of the 22 > >queries generated a different plan. At last I drop the cache (including > >both > >file system cache and shared_buffer) before run each query and run the 11 > >queries > >under the 2 different settings. The execution time is below. > > > > > >| | random_page_cost=4 | random_page_cost=8.6 | > >|-----+--------------------+----------------------| > >| Q1 | 1425.964 | 1121.928 | > >| Q2 | 2553.072 | 2567.450 | > >| Q5 | 4397.514 | 1475.343 | > >| Q6 | 12576.985 | 4622.503 | > >| Q7 | 3459.777 | 2987.241 | > >| Q8 | 8360.995 | 8415.311 | > >| Q9 | 4661.842 | 2930.370 | > >| Q11 | 4885.289 | 2348.541 | > >| Q13 | 2610.937 | 1497.776 | > >| Q20 | 13218.122 | 10985.738 | > >| Q21 | 264.639 | 262.350 | > > > > > >The attached main.c is the program I used to detect the > >random_page_cost. result.tar.gz is the test result, you can run a git log > >first > >to see the difference on plan or execution stat. > > > >Any feedback is welcome. Thanks! > > > > That seems pretty neat. What kind of hardware have you done these tests > on? The following is my hardware info. I have 12 SSD behind the MR9271-8i RAID Controller which has a 1GB buffer. [1] root# lshw -short -C disk H/W path Device Class Description ============================================================== /0/100/2/0/2.0.0 /dev/sda disk 2398GB MR9271-8i /0/100/2/0/2.1.0 /dev/sdb disk 5597GB MR9271-8i <-- my data location /opt/MegaRAID/MegaCli/MegaCli64 -AdpAllInfo -aALL Adapter #0 Memory Size : 1024MB RAID Level : Primary-5, Secondary-0, RAID Level Qualifier-3 .. Current Cache Policy: WriteBack, ReadAheadNone, Direct, Write Cache OK if Bad BBU ... Device Present ================ Virtual Drives : 2 Degraded : 0 Offline : 0 Physical Devices : 14 Disks : 12 Critical Disks : 0 Failed Disks : 0 root# /opt/MegaRAID/MegaCli/MegaCli64 -LdPdInfo -a0 | egrep 'Media Type|Raw Size' Raw Size: 745.211 GB [0x5d26ceb0 Sectors] Media Type: Solid State Device Raw Size: 745.211 GB [0x5d26ceb0 Sectors] Media Type: Solid State Device Raw Size: 745.211 GB [0x5d26ceb0 Sectors] Media Type: Solid State Device Raw Size: 745.211 GB [0x5d26ceb0 Sectors] Media Type: Solid State Device Raw Size: 745.211 GB [0x5d26ceb0 Sectors] Media Type: Solid State Device Raw Size: 745.211 GB [0x5d26ceb0 Sectors] Media Type: Solid State Device Raw Size: 745.211 GB [0x5d26ceb0 Sectors] Media Type: Solid State Device Raw Size: 745.211 GB [0x5d26ceb0 Sectors] Media Type: Solid State Device Raw Size: 745.211 GB [0x5d26ceb0 Sectors] Media Type: Solid State Device Raw Size: 745.211 GB [0x5d26ceb0 Sectors] Media Type: Solid State Device Raw Size: 745.211 GB [0x5d26ceb0 Sectors] Media Type: Solid State Device Raw Size: 745.211 GB [0x5d26ceb0 Sectors] Media Type: Solid State Device CPU: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz, 32 processors. Memory: 251 GB Linux: 3.10.0-327 fs: ext4. mount options: defaults,noatime,nodiratime,nodelalloc,barrier=0 Physical machine. It's probably worth testing on various other storage systems to see > how that applies to those. > > Yes, I can test more on new hardware once I get it. Now it is still in progress. However I can only get a physical machine with SSD or Virtual machine with SSD, other types are hard for me right now. Have you tried existing I/O testing tools, e.g. fio? If your idea is to > propose some built-in tool (similar to pg_test_fsync) then we probably > should not rely on external tools, but I wonder if we're getting the > same numbers. > Thanks for this hint, I found more interesting stuff during the comparison. I define the FIO jobs as below. random_page_cost.job: [global] blocksize=8k size=1Gi filesize=1Gi ioengine=sync directory=/u01/yizhi/data/fio [random_page_cost] direct=1 readwrite=randread Even it is direct IO, the device cache still plays an important part. The device cache is filled in preparing the test data file stage. I invalidate the device cache by writing a new dummy file. At last the avg latency time is 148 us. seq.job [global] blocksize=8k size=1Gi filesize=1Gi ioengine=sync directory=/u01/yizhi/data/fio [seq_page_cost] buffered=1 readwrite=read For seq read, We need buffered IO for perfetch, however, we need to bypass the file system cache and device cache. fio have no control of such caches, so I did: 1). Run fio to generate the test file. 2). Invalidate device cache first with dd if=/dev/zero of=a_dummy_file bs=1048576 count=1024 3). drop the file system cache. 4). Run the fio again. The final avg latency is ~12 us. This is 1.5 ~ 2 X difference with my previous result. (seq_read_lat = 8.570290us, random_page_lat = 73.987732us) Here are some changes for my detection program. | | seq_read_lat (us) | random_read_lat (us) | | FIO | 12 | 148 | | Previous main.c | 8.5 | 74 | | invalidate_device_cache before each testing | 9 | 150 | | prepare the test data file with O_DIRECT option | 15 | 150 | In invalidate_device_cache, I just create another 1GB data file and read it. (see invalidate_device_cache function) this is similar as the previous fio setup. prepare test data file with O_DIRECT option means in the past, I prepare the test file with buffer IO. and before testing, I do invalidate device cache, file system cache. but the buffered prepared file still get better performance, I have no idea of it. Since I don't want any cache. I use O_DIRECT option at last. The seq_read_lat changed from 9us to 15us. I still can't find out the 25% difference with the FIO result. (12 us vs 9 us). At last, the random_page_cost happens to not change very much. /u/y/g/fdirect> sudo ./main fs_cache_lat = 0.569031us, seq_read_lat = 18.901749us, random_page_lat = 148.650589us cache hit ratio: 1.000000 random_page_cost 1.000000 cache hit ratio: 0.900000 random_page_cost 6.401019 cache hit ratio: 0.500000 random_page_cost 7.663772 cache hit ratio: 0.100000 random_page_cost 7.841498 cache hit ratio: 0.000000 random_page_cost 7.864383 This result looks much different from "we should use 1.1 ~ 1.5 for SSD". The attached is the modified detection program. [1] https://www.cdw.com/product/lsi-megaraid-sas-9271-8i-storage-controller-raid-sas-pcie-3.0-x8/4576538#PO -- Best Regards Andy Fan
main.c
Description: Binary data