Hi,

On 2026-02-15 14:34:07 -0500, Andres Freund wrote:
> debug_io_direct=data, enable_indexscan_prefetch=1, w/ stream->distance * 2 + 1
> ┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
> │                                                                             
> QUERY PLAN                                                                    
>           │
> ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
> │ Index Scan using idx_periodic_100000 on prefetch_test_data_100000  
> (cost=0.29..15351.09 rows=100000 width=8) (actual time=0.316..176.703 
> rows=100000.00 loops=1)    │
> │   Index Searches: 1                                                         
>                                                                               
>           │
> │   Prefetch: distance=707.476 count=11158 stalls=88503 skipped=0 resets=0 
> pauses=26 ungets=0 forwarded=0                                                
>              │
> │             histogram [2,4) => 5, [4,8) => 11, [8,16) => 26, [16,32) => 30, 
> [32,64) => 54, [64,128) => 109, [128,256) => 221, [256,512) => 428, 
> [512,1024) => 10274 │
> │   Buffers: shared hit=96875 read=3400                                       
>                                                                               
>           │
> │   I/O Timings: shared read=33.874                                           
>                                                                               
>           │
> │ Planning:                                                                   
>                                                                               
>           │
> │   Buffers: shared hit=78 read=21                                            
>                                                                               
>           │
> │   I/O Timings: shared read=2.772                                            
>                                                                               
>           │
> │ Planning Time: 3.065 ms                                                     
>                                                                               
>           │
> │ Execution Time: 182.959 ms                                                  
>                                                                               
>           │
> └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
>
> The stall stats are bogus, because they get increased even when we correctly
> are not prefetching due to everything being in shared buffers. I think the
>   if (distance == 1) stats.nstalls++
> would need to be just before the WaitReadBuffers().

The histogram and distance are also somewhat misleading: They measure what the
distance is at the time the next block is determined, but that's not really
informative, as the distance can be much bigger than what we are actually
doing IO wise (to allow for IO combining etc).  The limit for the number of
in-flight IOs will be the limiting factor in a case with random-ish IOs and
it's also really what matters for performance.

For a query with lots of random prefetching, run with
effective_io_concurrency=8, compare:

stats using stream->distance:

 Limit  (cost=0.43..150968.68 rows=1000000 width=8) (actual 
time=0.138..5425.851 rows=1000000.00 loops=1)
   Buffers: shared hit=701118 read=301757
   I/O Timings: shared read=3919.597
   ->  Index Scan using idx_random_10000000 on prefetch_test_data_10000000  
(cost=0.43..1509682.87 rows=10000000 width=8) (actual time=0.137..5331.243 
rows=1000000.00 loops=1)
         Index Searches: 1
         Prefetch: distance=143.999 count=1297237 stalls=6 skipped=3 resets=0 
pauses=0 ungets=297462 forwarded=2
                   histogram [16,32) => 9, [32,64) => 2, [64,128) => 2, 
[128,256) => 1297224
         Buffers: shared hit=701118 read=301757
         I/O Timings: shared read=3919.597
 Planning:
   Buffers: shared hit=96 read=7
   I/O Timings: shared read=0.812
 Planning Time: 1.310 ms
 Execution Time: 5496.622 ms


stats using stream->ios_in_progress:

 Limit  (cost=0.43..150968.68 rows=1000000 width=8) (actual 
time=0.157..5427.138 rows=1000000.00 loops=1)
   Buffers: shared hit=701118 read=301757
   I/O Timings: shared read=3919.857
   ->  Index Scan using idx_random_10000000 on prefetch_test_data_10000000  
(cost=0.43..1509682.87 rows=10000000 width=8) (actual time=0.156..5332.607 
rows=1000000.00 loops=1)
         Index Searches: 1
         Prefetch: distance=6.864 count=1297237 stalls=6 skipped=3 resets=0 
pauses=0 ungets=297462 forwarded=106
                   histogram [1,2) => 1164, [2,4) => 12634, [4,8) => 1283335
         Buffers: shared hit=701118 read=301757
         I/O Timings: shared read=3919.857
 Planning:
   Buffers: shared hit=96 read=7
   I/O Timings: shared read=0.809
 Planning Time: 1.305 ms
 Execution Time: 5502.332 ms


The former sure makes it look like that the bottleneck isn't the number of IOs
in flight, after all the distance is quite deep.  Yet, if you increase
effective_io_concurrency:

stats using stream->distance:

 Limit  (cost=0.43..150968.68 rows=1000000 width=8) (actual 
time=0.138..2769.847 rows=1000000.00 loops=1)
   Buffers: shared hit=701960 read=301786
   I/O Timings: shared read=1100.819
   ->  Index Scan using idx_random_10000000 on prefetch_test_data_10000000  
(cost=0.43..1509682.87 rows=10000000 width=8) (actual time=0.137..2667.780 
rows=1000000.00 loops=1)
         Index Searches: 1
         Prefetch: distance=1038.301 count=1293137 stalls=0 skipped=3 resets=0 
pauses=694 ungets=291801 forwarded=2
                   histogram [16,32) => 16, [32,64) => 18, [64,128) => 33, 
[128,256) => 2, [256,512) => 3, [512,1024) => 125045, [1024,2048) => 1168020
         Buffers: shared hit=701960 read=301786
         I/O Timings: shared read=1100.819
 Planning:
   Buffers: shared hit=96 read=7
   I/O Timings: shared read=0.819
 Planning Time: 1.346 ms
 Execution Time: 2845.891 ms
(14 rows)


stats using stream->ios_in_progress:

 Limit  (cost=0.43..150968.68 rows=1000000 width=8) (actual 
time=0.136..2767.839 rows=1000000.00 loops=1)
   Buffers: shared hit=701960 read=301786
   I/O Timings: shared read=1102.115
   ->  Index Scan using idx_random_10000000 on prefetch_test_data_10000000  
(cost=0.43..1509682.87 rows=10000000 width=8) (actual time=0.135..2665.838 
rows=1000000.00 loops=1)
         Index Searches: 1
         Prefetch: distance=61.054 count=1293134 stalls=0 skipped=3 resets=0 
pauses=694 ungets=291798 forwarded=4
                   histogram [1,2) => 1, [2,4) => 2, [4,8) => 4, [8,16) => 9, 
[16,32) => 8737, [32,64) => 1284379
         Buffers: shared hit=701960 read=301786
         I/O Timings: shared read=1102.115
 Planning:
   Buffers: shared hit=96 read=7
   I/O Timings: shared read=0.819
 Planning Time: 1.334 ms
 Execution Time: 2846.017 ms


If you want to measure in the domain of blocks, rather than, I'd suggest at
least using the number of currently pinned buffers, rather than the
distance. But I think IOs is a much more useful measure.

E.g. for prefetching a index scan on a sequential column, with:

io_effective_io_concurrency=128 io_combine_limit=16

stats using stream->distance:
   Prefetch: distance=18.872 count=320174 stalls=30736 skipped=9687128 resets=0 
pauses=7684 ungets=0 forwarded=0
             histogram [1,2) => 3416, [2,4) => 23052, [4,8) => 53788, [8,16) => 
115244, [16,32) => 124654, [32,64) => 20
   Buffers: shared hit=27325 read=312500
   I/O Timings: shared read=1794.060


stats using stream->pinned_buffers:
   Prefetch: distance=8.259 count=320174 stalls=30736 skipped=9687128 resets=0 
pauses=7684 ungets=0 forwarded=29897
             histogram [1,2) => 23052, [2,4) => 58070, [4,8) => 84497, [8,16) 
=> 105873, [16,32) => 18785
   Buffers: shared hit=27325 read=312500
   I/O Timings: shared read=887.850


stats using stream->ios_in_progress:
   Prefetch: distance=0.059 count=320174 stalls=30736 skipped=9687128 resets=0 
pauses=7684 ungets=0 forwarded=301389
             histogram [1,2) => 18785
   Buffers: shared hit=27325 read=312500
   I/O Timings: shared read=881.723


Only the ios-in_progress one really shows that we are basically not doing
any readahead, due to heapam pausing, thinking we're far enough ahead - but
due to IO combining the batchdistance limit is reached before we can issue
multiple IOs.


io_effective_io_concurrency=128 io_combine_limit=4:

stats using stream->ios_in_progress:
   Prefetch: distance=4.200 count=328882 stalls=1 skipped=9687128 resets=0 
pauses=16391 ungets=0 forwarded=9
             histogram [1,2) => 8, [2,4) => 129775, [4,8) => 199090
   Buffers: shared hit=27325 read=312500
   I/O Timings: shared read=300.284


FWIW, if I change the batchdistance <= 2 check to <= 8, I get good perf even
with io_combine_limit=16:

stats using stream->ios_in_progress:
   Prefetch: distance=2.605 count=315526 stalls=3 skipped=9687128 resets=0 
pauses=3035 ungets=0 forwarded=50
             histogram [1,2) => 72679, [2,4) => 170115, [4,8) => 72682
   Buffers: shared hit=27325 read=312500
   I/O Timings: shared read=125.902

but that was just an experiment.

Greetings,

Andres Freund


Reply via email to