Hi all,
I did some experiments on a few questions that flew over this thread. In
the hope to be useful.
DISTANCE CONTROL
I tested different strategies to increase distance. 2*d, 2*d+1, d+2, d+4,
and so on. In my head, what would make sense is d + io_combine_limi, but in
the end the 2*d gives the best results across different patterns, e.g.
(h{200}m{200}) that seems to be a more reasonable pattern, as previous
scans would have loaded in blocks. But these are fundamentally the same, as
I posted about this a markov model, and the limit will be something like
max_distance * sigmoid(h * (p - p0)), what changes is the transient when we
go in and out of a cached region.
LIMITING PREFETCH
To avoid prefetch waste with a limit node wouldn't it make sense to send
from the executor an estimate of how many rows will be required. A strict
lower bound would be limit (+ offset) - returned, but a selectivity factor
would be important if most rows are removed, a good start would be (limit -
filtered) * selectivity, a simple model for the selectivity would be
(unfiltered + 1) / (filtered + 1), the 1 here accounts for the uncertainty
about the next row being filtered or not, and avoids a division by zero for
the first row, and naturally will cap the distance to the limit when the
query starts.
I/O REORDERING
I did an experiment reordering the heap accesses, following a zig-zag
pattern, using constant memory like this
fill heap-up with W blocks
direction = up
read-block = pop from heap-up
for each new-block
if new-block > read block
push new-block into heap-up
else
push new-block into heap-down
if heap-up is empty
direction = down
if heap-down is empty
direction = up
if direction = up
read-block = pop from heap-up
else
read-block = pop from heap-down
The heap-up and heap-down will contain at most W elements, so they can be
stored in complementary regions of the same array.
With a small adjustmentment (moving one element to the heap behind when the
new-block is ahead) we ensure that the delay added to a block doesn't
exceed the buffer length, and we can restore the index order with another
heap, so this would give an algorithm with O(W) memory O(W * log(W))
complexity, asymptotically constant on the table size. And will reduce the
total seek distance by roughly 2/W.
Creating a ~400MB table
CREATE TABLE zigzag_test (
id serial, random_key int, data text,
pad1 char(2000), pad2 char(2000), pad3 char(2000)
);
SELECT setseed(0.42);
INSERT INTO zigzag_test (random_key, data, pad1, pad2, pad3)
SELECT (random() * 100000)::int, 'data_' || g, 'x', 'x', 'x'
FROM generate_series(1, 5000) g;
Then accessing 5% of it using index scan (or the zigzag that breaks the
index order)
SELECT sum(id), sum(abs(blk - prev_blk)) as seek_dist FROM (
SELECT id, (ctid::text::point)[0] as blk, lag((ctid::text::point)[0])
over () as prev_blk
FROM zigzag_test WHERE random_key < 50000
Measuring time +/- std-dev, using the I(ndex) order access and the Z(igzag)
order (not so exactly as the algorithm described above)
I order (ms) Z order (ms) Diff % Seek (I/Z)
Disk Prefetch
HDD on 250.4 +/- 227.8 168.7 +/- 83.0 -32.6% 78281/1242
off 185.8 +/- 269.9 141.9 +/- 77.4 -23.6% 78281/1242
SSD on 148.2 +/- 69.7 117.6 +/- 39.6 -20.7% 78281/1242
off 66.8 +/- 34.1 121.5 +/- 39.7 +81.8% 78281/1242
The tricky datapoint (and the one with the highest statistical
significance) here is SSD without prefetch in index order.
On small rows, and not-so-sparse indices this might help with
pinning-unpinning overhead (??), but for that case we already have bitmap
scans.
Regards,
Alexandre
On Wed, Feb 18, 2026 at 3:39 PM Tomas Vondra <[email protected]> wrote:
>
>
> On 2/18/26 05:21, Andres Freund wrote:
> > Hi,
> >
> > On 2026-02-17 22:36:53 +0100, Tomas Vondra wrote:
> >> On 2/17/26 21:16, Peter Geoghegan wrote:
> >>> On Tue, Feb 17, 2026 at 2:27 PM Andres Freund <[email protected]>
> wrote:
> >>>> On 2026-02-17 12:16:23 -0500, Peter Geoghegan wrote:
> >>>>> On Mon, Feb 16, 2026 at 11:48 AM Andres Freund <[email protected]>
> wrote:
> >>>>> I agree that the current heuristics (which were invented recently)
> are
> >>>>> too conservative. I overfit the heuristics to my current set of
> >>>>> adversarial queries, as a stopgap measure.
> >>>>
> >>>> Are you doing any testing on higher latency storage? I found it to
> be quite
> >>>> valuable to use dm_delay to have a disk with reproducible (i.e. not
> cloud)
> >>>> higher latency (i.e. not just a local SSD).
> >>>
> >>> I sometimes use dm_delay (with the minimum 1ms delay) when testing,
> >>> but don't do so regularly. Just because it's inconvenient to do so
> >>> (perhaps not a great reason).
> >>>
> >>>> Low latency NVMe can reduce the
> >>>> penalty of not enough readahead so much that it's hard to spot
> problems...
> >>>
> >>> I'll keep that in mind.
> >>>
> >>
> >> So, what counts as "higher latency" in this context? What delays should
> >> we consider practical/relevant for testing?
> >
> > 0.5-4ms is the range I've seen in various clouds across their reasonable
> > storage products (i.e. not spinning disks or other ver bulk oriented
> things).
> >
> > Unfortunately dm_delay doesn't support < 1ms delays, but it's still much
> > better than nothing.
> >
> > I've been wondering about teaching AIO to delay IOs (by adding a sleep to
> > workers and linking a IORING_OP_TIMEOUT submission with the actually
> intended
> > IO) to allow testing smaller delays.
> >
>
> Could be useful testing facility, if it's done in a way that does not
> limit the IO concurrency (i.e. the delay should probably be when
> consuming the IO, depending on the timestamp of the IO start).
>
> >
> >>> That would make sense. You can already tell when that's happened by
> >>> comparing the details shown by EXPLAIN ANALYZE against the same query
> >>> execution on master, but that approach is inconvenient. Automating my
> >>> microbenchmarks has proven to be important with this project. There's
> >>> quite a few competing considerations, and it's too easy to improve one
> >>> query at the cost of regressing another.
> >>>
> >>
> >> What counts as "unconsumed IO"? The IOs the stream already started, but
> >> then did not consume? That shouldn't be hard, I think.
> >
> > Yes, the number of IOs that were started but not consumed. Or, even
> better,
> > the number of IOs that completed but were not consumed - but that'd be
> harder
> > to get right now.
> >
> > I agree that started-but-not-consumed should be pretty easy.
> >
>
> I'll try to add it to the EXPLAIN.
>
>
> regards
>
> --
> Tomas Vondra
>
>