Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

Matthias van de Meent Wed, 26 Jul 2023 05:30:18 -0700

On Tue, 25 Jul 2023 at 03:34, Peter Geoghegan <p...@bowt.ie> wrote:
>
> I've been working on a variety of improvements to nbtree's native
> ScalarArrayOpExpr execution. This builds on Tom's work in commit
> 9e8da0f7.


Cool.

> Attached patch is still at the prototype stage. I'm posting it as v1 a
> little earlier than I usually would because there has been much back
> and forth about it on a couple of other threads involving Tomas Vondra
> and Jeff Davis -- seems like it would be easier to discuss with
> working code available.
>
> The patch adds two closely related enhancements to ScalarArrayOp
> execution by nbtree:
>
> 1. Execution of quals with ScalarArrayOpExpr clauses during nbtree
> index scans (for equality-strategy SK_SEARCHARRAY scan keys) can now
> "advance the scan's array keys locally", which sometimes avoids
> significant amounts of unneeded pinning/locking of the same set of
> index pages.
>
> SAOP index scans become capable of eliding primitive index scans for
> the next set of array keys in line in cases where it isn't truly
> necessary to descend the B-Tree again. Index scans are now capable of
> "sticking with the existing leaf page for now" when it is determined
> that the end of the current set of array keys is physically close to
> the start of the next set of array keys (the next set in line to be
> materialized by the _bt_advance_array_keys state machine). This is
> often possible.
>
> Naturally, we still prefer to advance the array keys in the
> traditional way ("globally") much of the time. That means we'll
> perform another _bt_first/_bt_search descent of the index, starting a
> new primitive index scan. Whether we try to skip pages on the leaf
> level or stick with the current primitive index scan (by advancing
> array keys locally) is likely to vary a great deal. Even during the
> same index scan. Everything is decided dynamically, which is the only
> approach that really makes sense.
>
> This optimization can significantly lower the number of buffers pinned
> and locked in cases with significant locality, and/or with many array
> keys with no matches. The savings (when measured in buffers
> pined/locked) can be as high as 10x, 100x, or even more. Benchmarking
> has shown that transaction throughput for variants of "pgbench -S"
> designed to stress the implementation (hundreds of array constants)
> under concurrent load can have up to 5.5x higher transaction
> throughput with the patch. Less extreme cases (10 array constants,
> spaced apart) see about a 20% improvement in throughput. There are
> similar improvements to latency for the patch, in each case.

Considering that it caches/reuses the page across SAOP operations, can
(or does) this also improve performance for index scans on the outer
side of a join if the order of join columns matches the order of the
index?
That is, I believe this caches (leaf) pages across scan keys, but can
(or does) it also reuse these already-cached leaf pages across
restarts of the index scan/across multiple index lookups in the same
plan node, so that retrieval of nearby index values does not need to
do an index traversal?

> [...]
> Skip Scan
> =========
>
> MDAM encompasses something that people tend to call "skip scan" --
> terminology with a great deal of baggage. These days I prefer to call
> it "filling in missing key predicates", per the paper. That's much
> more descriptive, and makes it less likely that people will conflate
> the techniques with InnoDB style "loose Index scans" -- the latter is
> a much more specialized/targeted optimization. (I now believe that
> these are very different things, though I was thrown off by the
> superficial similarities for a long time. It's pretty confusing.)

I'm not sure I understand. MDAM seems to work on an index level to
return full ranges of values, while "skip scan" seems to try to allow
systems to signal to the index to skip to some other index condition
based on arbitrary cutoffs. This would usually be those of which the
information is not stored in the index, such as "SELECT user_id FROM
orders GROUP BY user_id HAVING COUNT(*) > 10", where the scan would go
though the user_id index and skip to the next user_id value when it
gets enough rows of a matching result (where "enough" is determined
above the index AM's plan node, or otherwise is impossible to
determine with only the scan key info in the index AM). I'm not sure
how this could work without specifically adding skip scan-related
index AM functionality, and I don't see how it fits in with this
MDAM/SAOP system.

> [...]
>
> Thoughts?

MDAM seems to require exponential storage for "scan key operations"
for conditions on N columns (to be precise, the product of the number
of distinct conditions on each column); e.g. an index on mytable
(a,b,c,d,e,f,g,h) with conditions "a IN (1, 2) AND b IN (1, 2) AND ...
AND h IN (1, 2)" would require 2^8 entries. If 4 conditions were used
for each column, that'd be 4^8, etc...
With an index column limit of 32, that's quite a lot of memory
potentially needed to execute the statement.
So, this begs the question: does this patch have the same issue? Does
it fail with OOM, does it gracefully fall back to the old behaviour
when the clauses are too complex to linearize/compose/fold into the
btree ordering clauses, or are scan keys dynamically constructed using
just-in-time- or generator patterns?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

Reply via email to