On 2/15/26 09:08, Alexandre Felipe wrote:
> Another quick response, I am impressed
> 
>     I'm quite confused by the scripts you shared, it seems incomplete. The
>     run_regression.py is meant to call purge_cache.sh (which is missing),
>     and the run_benchmark tries to call all sorts of missing .sql scripts.
> 
> 
> That was a confusion I did, I uploaded run_benchmark that was just a
> nonsensical AI generated script in my work area and I didn't upload
> purge_cache.sh. Notice that you can still run the python script with
> evict mode off,pg. The missing script is attached in my previous message. 
>  

The purge_cache.sh script does not use sudo anywhere, so on Linux it has
no effect at all. No idea if "purge" works on MacOS like this.

> 
>     A table that is just 24MB and fits into buffers is a bit useless. It
>     means that even with random pattern (which is generally about the best
>     for prefetching), only about ~1/30 of pages will require I/O. Each page
>     has ~32 items, but only the first item from each page will incur an
>     I/O. 
> 
> 
> OK, I will start it with different parameters, for development I needed
> something that was not too slow, to be able to catch the bugs.
> 
>     On what kind of hardware? How much variance is in the results?
> 
> Is a Mac mini M1. Please check the .PNG it shows the confidence
> intervals of the difference of execution time with the parameter ON and
> OFF, with different settings.
> 

I don't have access to a M1 machine (and it also does not say what type
of storage is it using, which seems pretty important for a patch aiming
to improve I/O behavior). But I tried running this on my ryzen machine
with local SSDs (in RAID0), and with the 100k rows (and fixed handling
of page cache) I get this:

  column_name io_method  evict  n  master_ms  off_ms  on_ms  effect_pct
     periodic    worker    off 10       35.8    35.1   36.5         2.0
     periodic    worker     os 10       49.4    49.9   58.8         8.1
     periodic    worker     pg 10       39.5    39.9   47.1         8.3
       random    worker    off 10       35.9    35.6   35.7         0.2
       random    worker     os 10       49.0    49.0   42.6        -7.0
       random    worker     pg 10       39.6    39.9   40.9         1.2
   sequential    worker    off 10       28.2    27.9   27.7        -0.4
   sequential    worker     os 10       39.3    39.2   34.8        -6.0
   sequential    worker     pg 10       30.1    30.1   29.4        -1.3

  column_name io_method  evict  n  master_ms  off_ms  on_ms  effect_pct
     periodic  io_uring    off 10       35.9    35.8   35.8        -0.1
     periodic  io_uring     os 10       49.3    49.9   50.0         0.1
     periodic  io_uring     pg 10       40.1    39.8   41.7         2.4
       random  io_uring    off 10       35.6    35.2   35.7         0.8
       random  io_uring     os 10       49.1    48.9   46.1        -3.0
       random  io_uring     pg 10       39.8    40.1   42.6         3.1
   sequential  io_uring    off 10       28.0    27.8   28.0         0.4
   sequential  io_uring     os 10       39.8    39.1   40.7         1.9
   sequential  io_uring     pg 10       30.2    30.0   29.6        -0.8

This is on default config with io_workers=12 and data_checksums=off. I'm
not showing results for parallel query, because it's irrelevant.

This also has timings for master, for worker and io_uring (which you
could not get on M1, at least no in MacOS). For "worker" the differences
are much smaller (within 10% in the worst case), and almost non-existent
for io_uring. Which suggests this is likely due to the "signal" overhead
associated with worker, which can be annoying for certain data patterns
(where we end up issuing an I/O for individual blocks at distance 1).

That is not something the index prefetching can "fix", this affects all
places using the read stream facility (it may be true index scans are
more likely to generate such "poor" data patterns). The above is about
the best we can do right now. In the future we might tweak the read
stream heuristics, or invent a special mode for index scans, not sure.

As already suggested, the whole point of index prefetching is improving
cases where we have to do a fair amount of I/O. And the 100k rows is far
too small for that - 24MB can get easily cached in various places
(shared buffers, kernel, the drive itself, ...). The best we can hope
for in this case is not causing regressions. Which I'd say we're mostly
achieving, per the results I just shared.

But let's look at a larger data set, with 1M rows and fillfactor=10.
That inflates the table to 2.6GB, which is a bit more sensible.

  column_name io_method  evict  n  master_ms  off_ms  on_ms    pct
     periodic    worker    off 10      737.5   747.3  766.3    1.3
     periodic    worker     os 10     2949.1  3000.1 1148.1  -44.6
     periodic    worker     pg 10      745.3   740.1  757.0    1.1
       random    worker    off 10     1491.4  1503.2 1103.8  -15.3
       random    worker     os 10    11236.7 12100.0 2363.9  -67.3
       random    worker     pg 10     1523.0  1489.6 1122.2  -14.1
   sequential    worker    off 10      627.6   629.9  377.8  -25.0
   sequential    worker     os 10      963.4   975.0  588.6  -24.7
   sequential    worker     pg 10      630.8   631.4  372.2  -25.8

  column_name io_method  evict  n  master_ms  off_ms  on_ms    pct
     periodic  io_uring    off 10      752.3   753.1  858.3    6.5
     periodic  io_uring     os 10     2956.4  2800.6 1210.0  -39.7
     periodic  io_uring     pg 10      738.5   750.0  858.3    6.7
       random  io_uring    off 10     1502.2  1506.3 1691.6    5.8
       random  io_uring     os 10     1905.9 11159.4 2756.7  -60.4
       random  io_uring     pg 10     1535.6  1512.0 1680.4    5.3
   sequential  io_uring    off 10      628.5   630.6  584.4   -3.8
   sequential  io_uring     os 10      993.1   962.2  937.6   -1.3
   sequential  io_uring     pg 10      625.7   634.6  587.1   -3.9

Clearly, significant benefits, but also some interesting differences in
behavior between worker/io_uring. I haven't looked into the details, but
my guess would be it's due to the different "parallelism" of the two
methods (io_uring is confined to the single process, worker can spread
the work over multiple processes). It does not surprise me, and there is
no single "best" I/O method. And never will be.

Another thing to consider is that 2.6GB table is much larger than shared
buffers, which means it'll never be "fully cached" in PG.


regards

-- 
Tomas Vondra



Reply via email to