Re: New access method for b-tree.

Alexandre Felipe Fri, 20 Mar 2026 06:45:07 -0700

Happy St. Patrick's day!

(this was sitting on my drafts)


Based on what I said said in previous emails I see alternative
proposals

#1 Make it simpler by not changing the index access methods.

#2 Make it optimal by not using generic index searches
and not keeping multiple open index scans.

and
#3 Follow the pragmatic approach
Objective is, minimize the number of heap fetches.
As high level as possible, reusing existing functions
instead of writing custom code when possible.



Ants Aasma & Tomas Vondra

> > My workarounds I have proposed users have been either to rewrite the
> > query as a UNION ALL of a set of single value prefix queries wrapped
> > in an order by limit. This gives the exact needed merge append plan
> > shape. But repeating the query N times can get unwieldy when the
> > number of values grows, so the fallback is:
> >
> > SELECT * FROM unnest(:friends) id, LATERAL (
> >     SELECT * FROM posts
> >     WHERE user_id = id
> >     ORDER BY tstamp DESC LIMIT 100)
> > ORDER BY tstamp DESC LIMIT 100;
> >
> > The downside of this formulation is that we still have to fetch a
> > batch worth of items from scans where we otherwise would have only had
> > to look at one index tuple.
> >


> True. It's useful to think about the query this way, and it may be
> better than full select + sort, but it has issues too.
>

An issue with this query is generality, if this is joined with other
queries we can't determine in advance the limit.


> The main problem I can see is that at planning time the cardinality of
> > the prefix array might not be known, and in theory could be in the
> > millions. Having millions of index scans open at the same time is not
> > viable, so the method needs to somehow degrade gracefully. The idea I
> > had is to pick some limit, based on work_mem and/or benchmarking, and
> > one the limit is hit, populate the first batch and then run the next
> > batch of index scans, merging with the first result. Or something like
> > that, I can imagine a few different ways to handle it with different
> > tradeoffs.
> >
>
> Doesn't the proposed merge scan have a similar issue? Because that will
> also have to keep all the index scans open (even if only internally).
> Indeed, it needs to degrade gracefully, in some way.


It is true, but I think we can trust the planner.
This problem scales similarly in a memoize node.
Is ~24kB for each open index scan a good guess?

ALTERNATIVE #1 - More efficient

Or to avoid having N open index scans we could  (??)
(1) find the index page for the head of each prefix.
(2) for each prefix
(2.a) load tuples from each head page, if we reach
(2.b) if we consume the last tuple in a page save a pointer
to the next page.
(2.c) check if tuples for the next prefix are in the same page
(2.d) Release the page.
(3) producing tuples in the suffix order
  (3.b) when tuples for prefix are exhausted load load
          page from (2.b)


Matthias van de Meent, Feb 3
> btree index skip scan infrastructure efficiently prevents new index
> descents into the index when the selected SAOP key ranges are directly
> adjecent, while merge scan would generally do at least one index
> descent for each of its N scan heads (*) - which in the proposed
> prototype patch guarantees O(index depth * num scan heads) buffer
> accesses.

This could also be addressed if we do this custom descent,
I didn't bother about that depth factor because with a few random prefixes
doing so we are probably going to save accesses only for the top level.


I would prefer to start with a very conceptual implementation
that can already provide 1000x speedup, but if you think this
way is better, I am open to try it. I think this can be done
without affecting the planner logic and the PrefixJoin node.


I'm afraid the
> proposed batches execution will be rather complex, so I'd say v1 should
> simply have a threshold, and do the full scan + sort for more items.


Do you mean by an executor node that performs the query as if it was written

ALTERNATIVE #2 - Simpler(??)
for each _prefix of prefixes:
  result += (SELECT FROM table
        WHERE prefix = _prefix AND qual(*)
        ORDER BY suffix
        LIMIT N)
return SELECT * FROM result
    ORDER BY suffix
    LIMIT N

This query may have to produce N * len(prefixes) rows, while the
original proposal would produce only N + len(prefixes) - 1.

Alexandre Felipe, Feb 6
> | Method     | Shared Hit | Shared Read | Exec Time |
> |------------|-----------:|------------:|----------:|
> | Merge      |         13 |         119 |     13 ms |
> | IndexScan  |     15,308 |     525,310 |  3,409 ms |

This Prefix Batch Scan approach
 hit=62 read=773, Execution Time: 80.815 ms

> I can imagine that this would really nicely benefit from
> ReadStream'ification.
> >
>
> Not sure, maybe.
>

Actually as I was watching the index prefetch development I was
quite uncertain about how this would play with that, but we can
probably simply give a budget for each stream.


> One other connection I see is with block nested loops. In a perfect
> > future PostgreSQL could run the following as a set of merged index
> > scans that terminate early:
> >
> > SELECT posts.*
> > FROM follows f
> >     JOIN posts p ON f.followed_id = p.user_id
> > WHERE f.follower_id = :userid
> > ORDER BY p.tstamp DESC LIMIT 100;
> >
> > In practice this is not a huge issue - it's not that hard to transform
> > this to array_agg and = ANY subqueries.
> >

Automating that transformation seems quite non-trivial (to me).
>

Well, not trivial. To give a rough idea.

wc -l *.patch
     113 v2-0001-Test-the-baseline.patch
     614 v2-0002-Access-method.patch
     850 v2-0003-Planner-integration.patch
    1958 v2-0004-Multi-column.patch
    2439 v2-0005-Joins.patch

it is missing some important details like prefix deduplication
but for the scenario where the values on the other table
are known to be unique it is good.

The multi column accepts things like A in (...) B in (...)
and computes the cartesian product or (A, B) IN (...)


Regards,
Alexandre

Re: New access method for b-tree.

Reply via email to