On 8/15/25 17:09, Andres Freund wrote:
> Hi,
> 
> On 2025-08-14 19:36:49 -0400, Andres Freund wrote:
>> On 2025-08-14 17:55:53 -0400, Peter Geoghegan wrote:
>>> On Thu, Aug 14, 2025 at 5:06 PM Peter Geoghegan <p...@bowt.ie> wrote:
>>>>> We can optimize that by deferring the StartBufferIO() if we're 
>>>>> encountering a
>>>>> buffer that is undergoing IO, at the cost of some complexity.  I'm not 
>>>>> sure
>>>>> real-world queries will often encounter the pattern of the same block 
>>>>> being
>>>>> read in by a read stream multiple times in close proximity sufficiently 
>>>>> often
>>>>> to make that worth it.
>>>>
>>>> We definitely need to be prepared for duplicate prefetch requests in
>>>> the context of index scans.
>>>
>>> Can you (or anybody else) think of a quick and dirty way of working
>>> around the problem on the read stream side? I would like to prioritize
>>> getting the patch into a state where its overall performance profile
>>> "feels right". From there we can iterate on fixing the underlying
>>> issues in more principled ways.
>>
>> I think I can see a way to fix the issue, below read stream. Basically,
>> whenever AsyncReadBuffers() finds a buffer that has ongoing IO, instead of
>> waiting, as we do today, copy the wref to the ReadBuffersOperation() and set 
>> a
>> new flag indicating that we are waiting for an IO that was not started by the
>> wref. Then, in WaitReadBuffers(), we wait for such foreign started IOs. That
>> has to be somewhat different code from today, because we have to deal with 
>> the
>> fact of the "foreign" IO potentially having failed.
>>
>> I'll try writing a prototype for that tomorrow. I think to actually get that
>> into a committable shape we need a test harness (probably a read stream
>> controlled by an SQL function that gets an array of buffers).
> 
> Attached is a prototype of this approach. It does seem to fix this issue.
> 

Thanks. Based on the testing so far, the patch seems to be a substantial
improvement. What's needed to make this prototype committable?

I assume this is PG19+ improvement, right? It probably affects PG18 too,
but it's harder to hit / the impact is not as bad as on PG19.


On a related note, my test that generates random datasets / queries, and
compares index prefetching with different io_method values found a
pretty massive difference between worker and io_uring. I wonder if this
might be some issue in io_method=worker.

Consider this synthetic dataset:

----------------------------------------------------------------------
create unlogged table t (a bigint, b text) with (fillfactor = 20);

insert into t
select 1 * a, b from (
  select r, a, b, generate_series(0,2-1) AS p
    from (select row_number() over () AS r, a, b from (
      select i AS a, md5(i::text) AS b
        from generate_series(1, 5000000) s(i)
        order by (i + 16 * (random() - 0.5))
      ) foo
  ) bar
) baz ORDER BY ((r * 2 + p) + 8 * (random() - 0.5));

create index idx on t(a ASC) with (deduplicate_items=false);

vacuum freeze t;
analyze t;

SELECT * FROM t WHERE a BETWEEN 16150 AND 4540437 ORDER BY a ASC;
----------------------------------------------------------------------

On master (or with index prefetching disabled), this gets executed like
this (cold caches):

                                QUERY PLAN
  ----------------------------------------------------------------------
   Index Scan using idx on t  (actual rows=9048576.00 loops=1)
     Index Cond: ((a >= 16150) AND (a <= 4540437))
     Index Searches: 1
     Buffers: shared hit=2577599 read=455610
   Planning:
     Buffers: shared hit=82 read=21
   Planning Time: 5.982 ms
   Execution Time: 1691.708 ms
  (8 rows)

while with index prefetching (with the aio prototype patch), it looks
like this:

                                QUERY PLAN
  ----------------------------------------------------------------------
   Index Scan using idx on t (actual rows=9048576.00 loops=1)
     Index Cond: ((a >= 16150) AND (a <= 4540437))
     Index Searches: 1
     Prefetch Distance: 2.032
     Prefetch Count: 868165
     Prefetch Stalls: 2140228
     Prefetch Skips: 6039906
     Prefetch Resets: 0
     Stream Ungets: 0
     Stream Forwarded: 4
     Prefetch Histogram: [2,4) => 855753, [4,8) => 12412
     Buffers: shared hit=2577599 read=455610
   Planning:
     Buffers: shared hit=78 read=26 dirtied=1
   Planning Time: 1.032 ms
   Execution Time: 3150.578 ms
  (16 rows)

So it's about 2x slower. The prefetch distance collapses, because
there's a lot of cache hits (about 50% of requests seem to be hits of
already visited blocks). I think that's a problem with how we adjust the
distance, but I'll post about that separately.

Let's try to simply set io_method=io_uring:

                                QUERY PLAN
  ----------------------------------------------------------------------
   Index Scan using idx on t  (actual rows=9048576.00 loops=1)
     Index Cond: ((a >= 16150) AND (a <= 4540437))
     Index Searches: 1
     Prefetch Distance: 2.032
     Prefetch Count: 868165
     Prefetch Stalls: 2140228
     Prefetch Skips: 6039906
     Prefetch Resets: 0
     Stream Ungets: 0
     Stream Forwarded: 4
     Prefetch Histogram: [2,4) => 855753, [4,8) => 12412
     Buffers: shared hit=2577599 read=455610
   Planning:
     Buffers: shared hit=78 read=26
   Planning Time: 2.212 ms
   Execution Time: 1837.615 ms
  (16 rows)

That's much closer to master (and the difference could be mostly noise).

I'm not sure what's causing this, but almost all regressions my script
is finding look like this - always io_method=worker, with distance close
to 2.0. Is this some inherent io_method=worker overhead?


regards

-- 
Tomas Vondra



Reply via email to