On Tue, 25 Jul 2023 at 03:34, Peter Geoghegan <p...@bowt.ie> wrote: > > I've been working on a variety of improvements to nbtree's native > ScalarArrayOpExpr execution. This builds on Tom's work in commit > 9e8da0f7.
Cool. > Attached patch is still at the prototype stage. I'm posting it as v1 a > little earlier than I usually would because there has been much back > and forth about it on a couple of other threads involving Tomas Vondra > and Jeff Davis -- seems like it would be easier to discuss with > working code available. > > The patch adds two closely related enhancements to ScalarArrayOp > execution by nbtree: > > 1. Execution of quals with ScalarArrayOpExpr clauses during nbtree > index scans (for equality-strategy SK_SEARCHARRAY scan keys) can now > "advance the scan's array keys locally", which sometimes avoids > significant amounts of unneeded pinning/locking of the same set of > index pages. > > SAOP index scans become capable of eliding primitive index scans for > the next set of array keys in line in cases where it isn't truly > necessary to descend the B-Tree again. Index scans are now capable of > "sticking with the existing leaf page for now" when it is determined > that the end of the current set of array keys is physically close to > the start of the next set of array keys (the next set in line to be > materialized by the _bt_advance_array_keys state machine). This is > often possible. > > Naturally, we still prefer to advance the array keys in the > traditional way ("globally") much of the time. That means we'll > perform another _bt_first/_bt_search descent of the index, starting a > new primitive index scan. Whether we try to skip pages on the leaf > level or stick with the current primitive index scan (by advancing > array keys locally) is likely to vary a great deal. Even during the > same index scan. Everything is decided dynamically, which is the only > approach that really makes sense. > > This optimization can significantly lower the number of buffers pinned > and locked in cases with significant locality, and/or with many array > keys with no matches. The savings (when measured in buffers > pined/locked) can be as high as 10x, 100x, or even more. Benchmarking > has shown that transaction throughput for variants of "pgbench -S" > designed to stress the implementation (hundreds of array constants) > under concurrent load can have up to 5.5x higher transaction > throughput with the patch. Less extreme cases (10 array constants, > spaced apart) see about a 20% improvement in throughput. There are > similar improvements to latency for the patch, in each case. Considering that it caches/reuses the page across SAOP operations, can (or does) this also improve performance for index scans on the outer side of a join if the order of join columns matches the order of the index? That is, I believe this caches (leaf) pages across scan keys, but can (or does) it also reuse these already-cached leaf pages across restarts of the index scan/across multiple index lookups in the same plan node, so that retrieval of nearby index values does not need to do an index traversal? > [...] > Skip Scan > ========= > > MDAM encompasses something that people tend to call "skip scan" -- > terminology with a great deal of baggage. These days I prefer to call > it "filling in missing key predicates", per the paper. That's much > more descriptive, and makes it less likely that people will conflate > the techniques with InnoDB style "loose Index scans" -- the latter is > a much more specialized/targeted optimization. (I now believe that > these are very different things, though I was thrown off by the > superficial similarities for a long time. It's pretty confusing.) I'm not sure I understand. MDAM seems to work on an index level to return full ranges of values, while "skip scan" seems to try to allow systems to signal to the index to skip to some other index condition based on arbitrary cutoffs. This would usually be those of which the information is not stored in the index, such as "SELECT user_id FROM orders GROUP BY user_id HAVING COUNT(*) > 10", where the scan would go though the user_id index and skip to the next user_id value when it gets enough rows of a matching result (where "enough" is determined above the index AM's plan node, or otherwise is impossible to determine with only the scan key info in the index AM). I'm not sure how this could work without specifically adding skip scan-related index AM functionality, and I don't see how it fits in with this MDAM/SAOP system. > [...] > > Thoughts? MDAM seems to require exponential storage for "scan key operations" for conditions on N columns (to be precise, the product of the number of distinct conditions on each column); e.g. an index on mytable (a,b,c,d,e,f,g,h) with conditions "a IN (1, 2) AND b IN (1, 2) AND ... AND h IN (1, 2)" would require 2^8 entries. If 4 conditions were used for each column, that'd be 4^8, etc... With an index column limit of 32, that's quite a lot of memory potentially needed to execute the statement. So, this begs the question: does this patch have the same issue? Does it fail with OOM, does it gracefully fall back to the old behaviour when the clauses are too complex to linearize/compose/fold into the btree ordering clauses, or are scan keys dynamically constructed using just-in-time- or generator patterns? Kind regards, Matthias van de Meent Neon (https://neon.tech/)