On 7/16/25 16:07, Peter Geoghegan wrote:
> On Wed, Jul 16, 2025 at 9:58 AM Tomas Vondra <to...@vondra.me> wrote:
>>> The "simple" patch has _bt_readpage reset the read stream. That
>>> doesn't make any sense to me. Though it does explain why the "complex"
>>> patch does so many more fadvise calls.
>>>
>>
>> Why it doesn't make sense? The reset_stream_reset() restarts the stream
>> after it got "terminated" on the preceding leaf page (by returning
>> InvalidBlockNumber).
> 
> Resetting the prefetch distance at the end of _bt_readpage doesn't
> make any sense to me. Why there? It makes about as much sense as doing
> so every 7th index tuple. Reaching the end of _bt_readpage isn't
> meaningful -- since it in no way signifies that the scan has been
> terminated (it might have been, but you're not checking that at all).
> 

Again, resetting the prefetch distance is merely a side-effect (and I
agree it's not desirable). The "reset" merely says the stream is able to
produce blocks again - call the "next" callback etc.

>> It'd be better to "pause" the stream somehow, but
>> there's nothing like that yet. We have to terminate it and start again.
> 
> I don't follow.
> 

The read stream can only return blocks generated by the "next" callback.
When we return the block for the last item on a leaf page, we can only
return "InvalidBlockNumber" which means "no more blocks in the stream".
And once we advance to the next leaf, we say "hey, there's more blocks".
Which is what read_stream_reset() does.

It's a bit like what rescan does.

In an ideal world we'd have a function that'd "pause" the stream,
without resetting the distance etc. But we don't have that, and the
reset thing was suggested to me as a workaround.

>> Te pattern of fadvise+pread for the same block seems a bit silly. And
>> this is not just about "sync" method, the other methods will have a
>> similar issue with no starting the I/O earlier. The fadvise is just
>> easier to trace/inspect.
> 
> It's not at all surprising that you're seeing duplicate prefetch
> requests. I have no reason to believe that it's important to suppress
> those ourselves, rather than leaving it up to the OS (though I also
> have no reason to believe that the opposite is true).
> 

True, but in practice those duplicate calls are fairly expensive. Even
just calling fadvise() on data you already have in page cache costs
something (not much, but it's clearly visible for cached queries).


regards

-- 
Tomas Vondra



Reply via email to