Hi,
here are some benchmark results for GiST patch:
https://gist.github.com/mngr777/88ae200c9c30ba5656583d92e8d2cf9e
Code used for benchmarking: https://github.com/mngr777/pg_index_bm2,
see README for the list of test queries.
Results show query performance close to indexes built with no
sortsupport function, with index creation time reduced approx. by half.
On 1/8/22 10:20 PM, Aliaksandr Kalenik wrote:
After further testing, here is v2 where the issue that rightlink can be
set when an index page is already flushed is fixed.
On Sat, Dec 25, 2021 at 4:35 PM Andrey Borodin <x4...@yandex-team.ru
<mailto:x4...@yandex-team.ru>> wrote:
Hi Aliaksandr!
Thanks for working on this!
> Benchmark summary:
>
> create index roads_rdr_idx on roads_rdr using gist (geom);
>
> with sort support before patch / CREATE INDEX 76.709 ms
>
> with sort support after patch / CREATE INDEX 225.238 ms
>
> without sort support / CREATE INDEX 446.071 ms
>
> select count(*) from roads_rdr a, roads_rdr b where a.geom && b.geom;
>
> with sort support before patch / SELECT 5766.526 ms
>
> with sort support after patch / SELECT 2646.554 ms
>
> without sort support / SELECT 2721.718 ms
>
> index size
>
> with sort support before patch / IDXSIZE 2940928 bytes
>
> with sort support after patch / IDXSIZE 4956160 bytes
>
> without sort support / IDXSIZE 5447680 bytes
The numbers are impressive, newly build index is actually performing
better!
I've conducted some tests over points, not PostGIS geometry. For
points build is 2x slower now, but this is the cost of faster scans.
Some strong points of this index building technology.
The proposed algorithm is not randomly chosen as anything that
performs better than the original sorting build. We actually
researched every idea we knew from literature and intuition.
Although we consciously limited the search area by existing GiST API.
Stuff like penalty-based choose-subtree and split in equal halves
seriously limit possible solutions. If anyone knows an any better
way to build GiST faster or with better scan performance - please
let us know.
The proposed algorithm contains the current algorithm as a special
case: there is a parameter - the number of buffers accumulated
before calling Split. If this parameter is 1 proposed algorithm will
produce exactly the same output.
At this stage, we would like to hear some feedback from Postgres and
PostGIS community. What other performance aspects should we test?
Current patch implementation has some known deficiencies:
1. We need a GUC instead of the hard-coded buffer of 8 pages.
2. Is GiST sorting build still deterministic? If not - we should add
a fixed random seed into pageinspect tests.
3. It would make sense to check the resulting indexes with amcheck
[0], although it's not committed.
4. We cannot make an exact fillfactor due to the behavior of
picksplit. But can we improve anything here? I think if not - it's
still OK.
5. GistSortedBuildPageState is no more page state. It's Level state
or something like that.
6. The patch desperately needs comments.
Thanks!
Best regards, Andrey Borodin.
[0]
https://www.postgresql.org/message-id/flat/59D0DA6B-1652-4D44-B0EF-A582D5824F83%40yandex-team.ru
<https://www.postgresql.org/message-id/flat/59D0DA6B-1652-4D44-B0EF-A582D5824F83%40yandex-team.ru>