On 8/15/25 17:09, Andres Freund wrote: > Hi, > > On 2025-08-14 19:36:49 -0400, Andres Freund wrote: >> On 2025-08-14 17:55:53 -0400, Peter Geoghegan wrote: >>> On Thu, Aug 14, 2025 at 5:06 PM Peter Geoghegan <p...@bowt.ie> wrote: >>>>> We can optimize that by deferring the StartBufferIO() if we're >>>>> encountering a >>>>> buffer that is undergoing IO, at the cost of some complexity. I'm not >>>>> sure >>>>> real-world queries will often encounter the pattern of the same block >>>>> being >>>>> read in by a read stream multiple times in close proximity sufficiently >>>>> often >>>>> to make that worth it. >>>> >>>> We definitely need to be prepared for duplicate prefetch requests in >>>> the context of index scans. >>> >>> Can you (or anybody else) think of a quick and dirty way of working >>> around the problem on the read stream side? I would like to prioritize >>> getting the patch into a state where its overall performance profile >>> "feels right". From there we can iterate on fixing the underlying >>> issues in more principled ways. >> >> I think I can see a way to fix the issue, below read stream. Basically, >> whenever AsyncReadBuffers() finds a buffer that has ongoing IO, instead of >> waiting, as we do today, copy the wref to the ReadBuffersOperation() and set >> a >> new flag indicating that we are waiting for an IO that was not started by the >> wref. Then, in WaitReadBuffers(), we wait for such foreign started IOs. That >> has to be somewhat different code from today, because we have to deal with >> the >> fact of the "foreign" IO potentially having failed. >> >> I'll try writing a prototype for that tomorrow. I think to actually get that >> into a committable shape we need a test harness (probably a read stream >> controlled by an SQL function that gets an array of buffers). > > Attached is a prototype of this approach. It does seem to fix this issue. >
Thanks. Based on the testing so far, the patch seems to be a substantial improvement. What's needed to make this prototype committable? I assume this is PG19+ improvement, right? It probably affects PG18 too, but it's harder to hit / the impact is not as bad as on PG19. On a related note, my test that generates random datasets / queries, and compares index prefetching with different io_method values found a pretty massive difference between worker and io_uring. I wonder if this might be some issue in io_method=worker. Consider this synthetic dataset: ---------------------------------------------------------------------- create unlogged table t (a bigint, b text) with (fillfactor = 20); insert into t select 1 * a, b from ( select r, a, b, generate_series(0,2-1) AS p from (select row_number() over () AS r, a, b from ( select i AS a, md5(i::text) AS b from generate_series(1, 5000000) s(i) order by (i + 16 * (random() - 0.5)) ) foo ) bar ) baz ORDER BY ((r * 2 + p) + 8 * (random() - 0.5)); create index idx on t(a ASC) with (deduplicate_items=false); vacuum freeze t; analyze t; SELECT * FROM t WHERE a BETWEEN 16150 AND 4540437 ORDER BY a ASC; ---------------------------------------------------------------------- On master (or with index prefetching disabled), this gets executed like this (cold caches): QUERY PLAN ---------------------------------------------------------------------- Index Scan using idx on t (actual rows=9048576.00 loops=1) Index Cond: ((a >= 16150) AND (a <= 4540437)) Index Searches: 1 Buffers: shared hit=2577599 read=455610 Planning: Buffers: shared hit=82 read=21 Planning Time: 5.982 ms Execution Time: 1691.708 ms (8 rows) while with index prefetching (with the aio prototype patch), it looks like this: QUERY PLAN ---------------------------------------------------------------------- Index Scan using idx on t (actual rows=9048576.00 loops=1) Index Cond: ((a >= 16150) AND (a <= 4540437)) Index Searches: 1 Prefetch Distance: 2.032 Prefetch Count: 868165 Prefetch Stalls: 2140228 Prefetch Skips: 6039906 Prefetch Resets: 0 Stream Ungets: 0 Stream Forwarded: 4 Prefetch Histogram: [2,4) => 855753, [4,8) => 12412 Buffers: shared hit=2577599 read=455610 Planning: Buffers: shared hit=78 read=26 dirtied=1 Planning Time: 1.032 ms Execution Time: 3150.578 ms (16 rows) So it's about 2x slower. The prefetch distance collapses, because there's a lot of cache hits (about 50% of requests seem to be hits of already visited blocks). I think that's a problem with how we adjust the distance, but I'll post about that separately. Let's try to simply set io_method=io_uring: QUERY PLAN ---------------------------------------------------------------------- Index Scan using idx on t (actual rows=9048576.00 loops=1) Index Cond: ((a >= 16150) AND (a <= 4540437)) Index Searches: 1 Prefetch Distance: 2.032 Prefetch Count: 868165 Prefetch Stalls: 2140228 Prefetch Skips: 6039906 Prefetch Resets: 0 Stream Ungets: 0 Stream Forwarded: 4 Prefetch Histogram: [2,4) => 855753, [4,8) => 12412 Buffers: shared hit=2577599 read=455610 Planning: Buffers: shared hit=78 read=26 Planning Time: 2.212 ms Execution Time: 1837.615 ms (16 rows) That's much closer to master (and the difference could be mostly noise). I'm not sure what's causing this, but almost all regressions my script is finding look like this - always io_method=worker, with distance close to 2.0. Is this some inherent io_method=worker overhead? regards -- Tomas Vondra