Hi,

On 2026-03-24 21:34:51 -0400, Peter Geoghegan wrote:
> On Tue, Mar 24, 2026 at 1:27 PM Andres Freund <[email protected]> wrote:
> > > But that means that it won't be triggered when we don't enter the "if
> > > (hscan->xs_blk != ItemPointerGetBlockNumber(tid))" block that contains
> > > all this code. Besides, it just doesn't seem possible that
> > > heap_page_prune_opt would release its caller's pin.
> >
> > I was more concerned about read_stream_next_buffer() returning the wrong
> > block, due to prefetching somehow "desynchronizing" with the scan position 
> > and
> > catching that when it's clear that we just read a new block, rather than in 
> > a
> > place where it could be either the continuation of a scan on the same page 
> > or
> > a new page.
>
> Then I don't follow. The existing assertions will catch that (I should
> know, they've failed enough times during development).
>
> Basically, I don't get the concern about heap_page_prune_opt releasing
> its caller's pin. Even if that happened, the existing assertions would
> still catch it.

My point wasn't that heap_page_prune_opt() would release the pin or such, but
that an assertion failure in the "new block" case will tell you that it was
definitely prefetching that resulted in you looking at the wrong block, rather
than some state confusion leading to continuing on the last page when it
wouldn't be right.  An assertion that's after the if (changed block) doesn't
tell you which of those two went wrong.

But it really doesn't matter enough to continue discussing :)


> > After replacing the pause with an error I found that it's surprisingly easy 
> > to
> > hit on slow storage (or on fast storage if you set needed_wait=true in
> > read_stream_next_buffer()).  I've not done any performance validation on
> > whether that means the limit is too low.
>
> It's been a while since I last validated performance to justify the
> current maximum number of batches. I used buffered I/O for that. I'm
> sure that a higher maximum with very slow storage and a very high
> effective_io_concurrency will provide some benefit. But perfectly
> handling that isn't essential for the first committed version of index
> prefetching.

Agreed.


> I must admit I'm unsure how to evaluate the maximum number of batches.
> It can make sense to pursue diminishing returns. But up to what point,
> and according to what principle?

I think the theoretical amount of required IO concurrency can be calculated
based on the storage latency and IOPS.  IIRC it is

iops_qd1 = (1000 / latency_ms)
queue_depth = IOPS / iops_qd1
queue_depth = IOPS / (1000 / latency_ms)

Of course that's overly simplistic, as typically the latency increases the
more IO you issue, increasing the required QD to actually fully utilize the
available IOs. But it's a good approximation.


IIRC the upper end of "SSD like" cloud storage latency is around 3.5ms and
tends to top out around 20k IOPS. That's unfortunately volume limits, on
larger instance you can stripe multiple disks to get higher IOPS.

Faster tiers are ~0.25ms and capped at ~250k-400k IOPS. I think those are
typically instance limits, but I'm not sure about that.

So on the upper end of the cheaper tiers you need a queue depth of about 70,
and on the upper end of the (very expensive) tiers you need about 100.  That's
to utilize all IOPS of course, so if you have multiple processes doing IO...

For local NVMe SSDs latencies are around 8-30us and IOPS tops out at about
1.5M (although the latter is extremely hard to sustain, it's definitely only
possible with DIO etc).  That leaves you with QDs < 50 or so.


So, to be able to fully utilize current hardware with one query, we need to be
able to reach queue depth in the low hundreds, in the case of striped cheap
cloud SSDs. That's when a backend *just* does IO, nothing else.

Something like an index scan, will have its own limit to how much it can
process in a second. If we can only do 100k IOPS while searching the index,
fetching the heap tuples and processing them, we don't need to support the
queue depths to support doing 1M IOPS within one backend.

That's something that can presumably be quite easily experimentally
ballparked:

A fully cached, completely uncorrelated, index scan seems to be able to fetch
about 1.5M page fetches on my ~6 YO server CPU with turbo boost disabled, when
never looking at the results (i.e. using OFFSET) or immediately filtering away
the row. So I'd guess the limit on newer CPUs in SKUs optimized for clock
speed and boost enabled, is north of 2.5M pages/sec, higher than I'd have
thought!  That's without doing any IO though.

With correlated scans the limit is much lower, maybe 150k, just because
there's so many more tuples per page (and processing them trivially becomes
the bottleneck).


So, to support actually utilizing the full IO IO capability, we need to allow
for enough batches to keep a few hundred IOs in flight at the very extreme
end.  I'd assume you have a much better idea to how many batches that
translates to?


Just testing a read stream of random 4kB IO I can fully saturate all the SSDs
I have, up to ~700k IOPS of random IO.  IIRC I tried this with a few striped
SSDs in the past and got a bit higher than that.


Greetings,

Andres Freund


Reply via email to