Re: index prefetching

Tomas Vondra Tue, 12 Aug 2025 09:53:36 -0700

On 8/12/25 13:22, Nazir Bilal Yavuz wrote:
> Hi,
> 
> On Tue, 12 Aug 2025 at 08:07, Thomas Munro <[email protected]> wrote:
>>
>> On Tue, Aug 12, 2025 at 11:42 AM Peter Geoghegan <[email protected]> wrote:
>>> On Mon, Aug 11, 2025 at 5:07 PM Tomas Vondra <[email protected]> wrote:
>>>> I can do some tests with forward vs. backwards scans. Of course, the
>>>> trouble with finding these weird cases is that they may be fairly rare.
>>>> So hitting them is a matter or luck or just happening to generate the
>>>> right data / query. But I'll give it a try and we'll see.
>>>
>>> I was talking more about finding "performance bugs" through a
>>> semi-directed process of trying random things while looking out for
>>> discrepancies. Something like that shouldn't require the usual
>>> "benchmarking rigor", since suspicious inconsistencies should be
>>> fairly obvious once encountered. I expect similar queries to have
>>> similar performance, regardless of superficial differences such as
>>> scan direction, DESC vs ASC column order, etc.
>>
>> I'd be interested to hear more about reverse scans.  Bilal was
>> speculating about backwards I/O combining in read_stream.c a while
>> back, but we didn't have anything interesting to use it yet.  You'll
>> probably see a flood of uncombined 8KB IOs in the pg_aios view while
>> travelling up the heap with cache misses today.  I suspect Linux does
>> reverse sequential prefetching with buffered I/O (less sure about
>> other OSes) which should help but we'd still have more overheads than
>> we could if we combined them, not to mention direct I/O.
> 
> If I remember correctly, I didn't continue working on this as I didn't
> see performance improvement. Right now, my changes don't apply cleanly
> to the current HEAD but I can give it another try if you see value in
> this.
> 
>> Not tested, but something like this might do it:
>>
>>                 /* Can we merge it with the pending read? */
>> -               if (stream->pending_read_nblocks > 0 &&
>> -                       stream->pending_read_blocknum +
>> stream->pending_read_nblocks == blocknum)
>> +               if (stream->pending_read_nblocks > 0)
>>                 {
>> -                       stream->pending_read_nblocks++;
>> -                       continue;
>> +                       if (stream->pending_read_blocknum +
>> stream->pending_read_nblocks ==
>> +                               blocknum)
>> +                       {
>> +                               stream->pending_read_nblocks++;
>> +                               continue;
>> +                       }
>> +                       else if (stream->pending_read_blocknum ==
>> blocknum + 1 &&
>> +                                        stream->forwarded_buffers == 0)
>> +                       {
>> +                               stream->pending_read_blocknum--;
>> +                               stream->pending_read_nblocks++;
>> +                               continue;
>> +                       }
>>                 }
> 
> Unfortunately this doesn't work. We need to handle backwards I/O
> combining in the StartReadBuffersImpl() function too as buffer indexes
> won't have correct blocknums. Also, I think buffer forwarding of split
> backwards I/O should be handled in a couple of places.
>


I'm running some tests looking for these weird changes, not just with
the patches, but on master too. And I don't think b4212231 changed the
situation very much.

FWIW this issue is not caused by the index prefetching patches, I can
reproduce it with master (on b227b0bb4e032e19b3679bedac820eba3ac0d1cf
from yesterday). So maybe we should split this into a separate thread.

Consider for example the dataset built by create.sql - it's randomly
generated, but the idea is that it's correlated, but not perfectly. The
table is ~3.7GB, and it's a cold run - caches dropped + restart).

Anyway, a simple range query look like this:

EXPLAIN (ANALYZE, COSTS OFF)
SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a ASC;

                                QUERY PLAN
------------------------------------------------------------------------
 Index Scan using idx on t
   (actual time=0.584..433.208 rows=1048576.00 loops=1)
   Index Cond: ((a >= 16336) AND (a <= 49103))
   Index Searches: 1
   Buffers: shared hit=7435 read=50872
   I/O Timings: shared read=332.270
 Planning:
   Buffers: shared hit=78 read=23
   I/O Timings: shared read=2.254
 Planning Time: 3.364 ms
 Execution Time: 463.516 ms
(10 rows)

EXPLAIN (ANALYZE, COSTS OFF)
SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a DESC;

                                QUERY PLAN
------------------------------------------------------------------------
 Index Scan Backward using idx on t
   (actual time=0.566..22002.780 rows=1048576.00 loops=1)
   Index Cond: ((a >= 16336) AND (a <= 49103))
   Index Searches: 1
   Buffers: shared hit=36131 read=50872
   I/O Timings: shared read=21217.995
 Planning:
   Buffers: shared hit=82 read=23
   I/O Timings: shared read=2.375
 Planning Time: 3.478 ms
 Execution Time: 22231.755 ms
(10 rows)

That's a pretty massive difference ... this is on my laptop, and the
timing changes quite a bit, but it's always a multiple of the first
query with forward scan.

I did look into pg_aios, but there's only 8kB requests in both cases. I
didn't have time to look closer yet.


regards

--
Tomas Vondra

create.sql
Description: application/sql

Re: index prefetching

Reply via email to