On 03/01/2025 21:43, Peter Geoghegan wrote:
The newly revised "skipskip" optimization seems to get the regressions
down to only a 5% - 10% increase in runtime across a wide variety of
unsympathetic cases -- I'm now validating performance against a test
suite based on the adversarial cases presented by Masahiro Ikeda on
this thread. Although I think that I'll end up tuning the "skipskip"
mechanism some more (I may have been too conservative in marginal
cases that actually do benefit from skipping), I deem these
regressions to be acceptable. They're only seen in the most
unsympathetic cases, where an omitted leading column has groupings of
no more than about 50 index tuples, making skipping pretty hopeless.
On my laptop, this is the worst case I could come up with:
create table skiptest as select g / 10 as a, g%10 as b from
generate_series(1, 10000000) g;
vacuum freeze skiptest;
create index on skiptest (a, b);
set enable_seqscan=off; set max_parallel_workers_per_gather=0;
\timing on
After repeating a few times, to warm the cache:
postgres=# select count(*) from skiptest where b=1;
count
---------
1000000
(1 row)
Time: 202.501 ms
And after 'set skipscan_prefix_cols=0':
select count(*) from skiptest where b=1;
count
---------
1000000
(1 row)
Time: 164.762 ms
EXPLAIN ANALYZE confirms that it uses an Index Only scan in both cases.
I knew from the outset that the hardest part of this project would be
avoiding regressions in highly unsympathetic cases. The regressions
that are still there seem very difficult to minimize any further; the
overhead that remains comes from the simple need to maintain the
scan's skip arrays once per page, before leaving the page. Once a scan
decides to apply the "skipskip" optimization, it tends to stick with
it for all future leaf pages -- leaving only the overhead of checking
the high key while advancing the scan's arrays. I've cut just about
all that that I can reasonably cut from the hot code paths that are at
issue with the regressed cases.
Hmm, looking at the code and profile with perf, a lot of code is
executed per tuple. Just function calls, passing arguments, checking
flags etc. I suspect you could shave off some cycles by structuring the
code in a more compiler and branch-prediction friendly way. Or just with
more aggressive inlining. Not sure what exactly to suggest, but the code
of _bt_readpage() and all the subroutines it calls is complicated.
Aside from the performance of those routines, it doesn't feel very
intuitive. _bt_checkkeys() not only checks the current keys, but it can
also read ahead tuples on the same page and update the array keys.
One little thing I noticed by stepping with debugger is that it calls
index_getattr() twice for the same tuple and attribute. First in
_bt_check_compare(), and if that sets *continuescan=false, again in
_bt_tuple_before_array_skeys(). The first index_getattr() call is
certainly pretty expensive because that's where you get the cache miss
on the tuple when scanning. Not sure if the second call matters much,
but it feels like a bad smell.
The comment on _bt_tuple_before_array_skeys() says:
* readpagetup callers must only call here when _bt_check_compare already set
* continuescan=false. We help these callers deal with _bt_check_compare's
* inability to distinguishing between the < and > cases (it uses equality
* operator scan keys, whereas we use 3-way ORDER procs)
That begs the question, why does _bt_check_compare() not call the 3-way
ORDER proc in the first place? That would avoid the overhead of another
call here.
Aside from these micro-optimizations, I wonder about the "look-ahead"
logic in _bt_checkkeys_look_ahead. It feels pretty simplistic. Could you
use Exponential Search
(https://en.wikipedia.org/wiki/Exponential_search) instead of a plain
binary search on the page?
--
Heikki Linnakangas
Neon (https://neon.tech)