Hi,

On 2026-02-17 15:16:06 -0500, Peter Geoghegan wrote:
> Passing down a LIMIT N hint has proven to be a good idea -- and it
> doesn't really require applying any information related to the read
> stream.

Yea, that seems like something that should obviously be done.


> That's enough to prevent problems in the really extreme cases
> (e.g., nested loop antijoins with a LIMIT 1 on the inner side). The
> problematic merge join I showed you is a not-so-extreme case, which
> makes it trickier. ISTM that taking into consideration the number of
> "consumed" batches will not help that particular merge join query,
> precisely because it's not-so-extreme: the inner index scan consumes
> plenty of batches, but is nevertheless significantly regressed (at
> least when we don't yield at all).

Interestingly, I can't reproduce a regression compared to index prefetching
being disabled.

If I evict just prefetch_customers between runs, I see prefetch + no-yield
being the fastest by a good amount.

If I evict prefetch_customers as well as prefetch_customers_pkey, yielding
wins, but only just about. Which I guess makes sense, the index reads are
synchronous random reads, and we do more of those if we prefetch too
aggressively.


I ran the queries both with pgbench (in a script that evicts the buffers, but
then just looks at the per-statement time for the SELECT, 30 iterations) and
separately interactively with EXPLAIN ANALYZE to get IO stats.


This is with debug_io_direct=data, were you measuring this without DIO? If so,
was the data in the page cache or did you evict it from there?


We really should add a function to pg_prewarm (or pg_buffercache, or ...) that
evicts pages in a targeted way from the kernel page cache... Flushing the
entire kernel pagecache leads to undesirable noise, because it also evicts
filesystem metadata (on somefilesystems at least) etc.


evicting prefetch_customers:

index_prefetch = 0:

    pgbench: 23.970

        ->  Index Scan using prefetch_customers_pkey on prefetch_customers c 
(actual time=0.066..21.465 rows=1858.00 loops=1)
              Filter: (region_id = 4)
              Rows Removed by Filter: 35285
              Index Searches: 1
              Buffers: shared hit=103 read=237
              I/O Timings: shared read=16.696

(the variability here is huge, for some reason, anything between 11.5 and
27ms, despite cpuidling etc being disabled, above is average)


index_prefetch = 1, yield:

    pgbench: 20.829

        ->  Index Scan using prefetch_customers_pkey on prefetch_customers c 
(actual time=0.070..13.147 rows=1858.00 loops=1)
              Filter: (region_id = 4)
              Rows Removed by Filter: 35285
              Index Searches: 1
              Prefetch: distance=7.895 count=266 stalls=1 skipped=31356 
resets=0 pauses=33 ungets=0 forwarded=0
                        histogram [2,4) => 2, [4,8) => 4, [8,16) => 260
              Buffers: shared hit=103 read=237
              I/O Timings: shared read=8.302


index_prefetch = 1, no yield:

    pgbench: 17.384

        ->  Index Scan using prefetch_customers_pkey on prefetch_customers c 
(actual time=0.070..7.354 rows=1858.00 loops=1)
              Filter: (region_id = 4)
              Rows Removed by Filter: 35285
              Index Searches: 1
              Prefetch: distance=295.456 count=419 stalls=1 skipped=59537 
resets=0 pauses=38 ungets=0 forwarded=0
                        histogram [2,4) => 2, [4,8) => 4, [8,16) => 8, [16,32) 
=> 16, [32,64) => 32, [64,128) => 48, [128,256) => 80, [256,512) => 42, [512,10>
              Buffers: shared hit=166 read=385
              I/O Timings: shared read=1.571



evicting prefetch_customers, prefetch_customers_pkey :


index_prefetch = 0:

    pgbench: 31.248

        ->  Index Scan using prefetch_customers_pkey on prefetch_customers c 
(actual time=0.067..21.231 rows=1858.00 loops=1)
              Filter: (region_id = 4)
              Rows Removed by Filter: 35285
              Index Searches: 1
              Buffers: shared hit=2 read=338
              I/O Timings: shared read=16.396


index_prefetch = 1, yield:

    pgbench: 25.823

        ->  Index Scan using prefetch_customers_pkey on prefetch_customers c 
(actual time=0.070..16.397 rows=1858.00 loops=1)
              Filter: (region_id = 4)
              Rows Removed by Filter: 35285
              Index Searches: 1
              Prefetch: distance=7.895 count=266 stalls=1 skipped=31356 
resets=0 pauses=33 ungets=0 forwarded=0
                        histogram [2,4) => 2, [4,8) => 4, [8,16) => 260
              Buffers: shared hit=2 read=338
              I/O Timings: shared read=11.524


index_prefetch = 1, no yield:

    pgbench: 25.923

        ->  Index Scan using prefetch_customers_pkey on prefetch_customers c 
(actual rows=1858.00 loops=1)
              Filter: (region_id = 4)
              Rows Removed by Filter: 35285
              Index Searches: 1
              Prefetch: distance=295.456 count=419 stalls=1 skipped=59537 
resets=0 pauses=38 ungets=0 forwarded=0
                        histogram [2,4) => 2, [4,8) => 4, [8,16) => 8, [16,32) 
=> 16, [32,64) => 32, [64,128) => 48, [128,256) => 80, [256,512) => 42, 
[512,1024) => 187
              Buffers: shared hit=2 read=549
              I/O Timings: shared read=13.195



But it's not hard to see that doing another 211 blocks worth of IO could hurt
noticeably. Particularly if it's non-prefetchable IO, as, I think, is the case
for 63 of those 211 blocks...




> > I assume that there are no mark & restores in the query, given that 
> > presumably
> > the inner side is unique?
>
> Right; this particular query doesn't use mark and restore.
>
> I do have another test query that does use mark and restore (a
> self-join + range conditions), but so far that doesn't seem to be too
> much of a problem. We have to reset the read stream when we restore a
> mark, which creates noticeable overhead. But (it seems) usually not
> enough added overhead for it to really matter.

I guess you'd have to have oddly many duplicate rows for the prefetch on the
inner side, filtering some of those out on the inner side, so you actually
ramp up to a significant distance, wasting some of that effort once due to
finding a matching row in the anti-join.  Probably not a huge issue.


> FWIW when the inner side of a merge join is an index-only scan, and we
> have to mark + restore a lot, the patch is quite a lot faster than
> master -- even when everything is cached. We don't have to repeatedly
> do the same VM lookups on the inner side (we can just use our local VM
> cache instead).

Hah.


FWIW, I got a crash in a mark-restore query. I think I see the problem:

        /*
         * Release all currently loaded batches, being sure to avoid freeing
         * markBatch (unless called with complete, where we're supposed to)
         */
        for (uint8 i = batchringbuf->headBatch; i != batchringbuf->nextBatch; 
i++)
        {
                IndexScanBatch batch = index_scan_batch(scan, i);

                if (complete || batch != markBatch)
                {
                        markBatchFreed = (batch == markBatch);
                        tableam_util_free_batch(scan, batch);
                }
        }

        if (complete && markBatch != NULL && !markBatchFreed)
        {
                /*
                 * We didn't free markBatch because it was no longer loaded in 
ring
                 * buffer.  Do so now instead.
                 */
                tableam_util_free_batch(scan, markBatch);
        }

If, in the loop, there's a batch after the markBatch in the ring, it'll reset
markBatchFreed to false. Which then leads to the batch being freed a second
time.

Greetings,

Andres Freund


Reply via email to