On Fri, Sep 25, 2020 at 5:15 PM Ashutosh Bapat <ashutosh.bapat....@gmail.com> wrote:
> On Tue, Sep 22, 2020 at 10:57 AM Andy Fan <zhihui.fan1...@gmail.com> > wrote: > > > > > > My tools set the random_page_cost to 8.6, but based on the fio data, it > should be > > set to 12.3 on the same hardware. and I do see the better plan as well > with 12.3. > > Looks too smooth to believe it is true.. > > > > The attached result_fio_mytool.tar.gz is my test result. You can use > git show HEAD^^ > > is the original plan with 8.6. git show HEAD^ show the plan changes > after we changed > > the random_page_cost. git show HEAD shows the run time statistics > changes for these queries. > > I also uploaded the test tool [1] for this for your double check. > > The scripts seem to start and stop the server, drop caches for every > query. That's where you are seeing that setting random_page_cost to > fio based ratio provides better plans. But in practice, these costs > need to be set on a server where the queries are run concurrently and > repeatedly. That's where the caching behaviour plays an important role. Can we write a tool which can recommend costs for that scenario? I totally agree with this. Actually the first thing I did is to define a proper IO workload. At the very beginning, I used DIRECT_IO for both seq read and random read on my SSD, and then found the result is pretty bad per testing (random_page_cost = ~1.6). then I realized postgresql relies on the prefetch which is disabled by DIRECT_IO. After I fixed this, I tested again with the above scenario (cache hit ratio = 0) to verify my IO model. Per testing, it looks good. I am also thinking if the random_page_cost = 1.1 doesn't provide a good result on my SSD because it ignores the prefects of seq read. After I am OK with my IO model, I test with the way you see above. but I also detect the latency for file system cache hit, which is handled by get_fs_cache_latency_us in my code (I ignored the shared buffer hits for now). and allows user to provides a cache_hit_ratio, the final random_page_cost = (real_random_lat) / real_seq_lat, where real_xxx_lat = cache_hit_ratio * fs_cache_lat + (1 - cache_hit_ratio) * xxx_lat. See function cal_real_lat and cal_random_page_cost. As for the testing with cache considered, I found how to estimate cache hit ratio is hard or how to control a hit ratio to test is hard. Recently I am thinking a method that we can get a page_reads, shared_buffer_hit from pg_kernel and the real io (without the file system cache hit) at os level (just as what iotop/pidstat do). then we can know the shared_buffer hit ratio and file system cache hit ratio (assume it will be stable after a long run). and then do a testing. However this would be another branch of manual work and I still have not got it done until now. I'd not like to share too many details, but "lucky" many cases I have haven't file system cache, that makes things a bit easier. What I am doing right now is to calculate the random_page_cost with the above algorithm with only shared_buffer considered. and test the real benefits with real workload to see how it works. If it works well, I think the only thing left is to handle file system cache. The testing is time consuming since I have to cooperate with many site engineers, so any improvement on the design will be much helpful. > How do the fio based cost perform when the queries are run repeatedly? > > That probably is not good since I have 280G+ file system cache and I have to prepare much more than 280G data size for testing. -- Best Regards Andy Fan