On Thu, 2 May 2024 at 17:19, Tomas Vondra <tomas.von...@enterprisedb.com> wrote:
>
> Hi,
>
> In PG17 we shall have parallel CREATE INDEX for BRIN indexes, and back
> when working on that I was thinking how difficult would it be to do
> something similar to do that for other index types, like GIN. I even had
> that on my list of ideas to pitch to potential contributors, as I was
> fairly sure it's doable and reasonably isolated / well-defined.
>
> However, I was not aware of any takers, so a couple days ago on a slow
> weekend I took a stab at it. And yes, it's doable - attached is a fairly
> complete, tested and polished version of the feature, I think. It turned
> out to be a bit more complex than I expected, for reasons that I'll get
> into when discussing the patches.

This is great. I've been thinking about approximately the same issue
recently, too, but haven't had time to discuss/implement any of this
yet. I think some solutions may even be portable to the btree parallel
build: it also has key deduplication (though to a much smaller degree)
and could benefit from deduplication during the scan/ssup load phase,
rather than only during insertion.

> First, let's talk about the benefits - how much faster is that than the
> single-process build we have for GIN indexes? I do have a table with the
> archive of all our mailing lists - it's ~1.5M messages, table is ~21GB
> (raw dump is about 28GB). This does include simple text data (message
> body), JSONB (headers) and tsvector (full-text on message body).

Sidenote: Did you include the tsvector in the table to reduce time
spent during index creation? I would have used an expression in the
index definition, rather than a direct column.

> If I do CREATE index with different number of workers (0 means serial
> build), I get this timings (in seconds):

[...]

> This shows the benefits are pretty nice, depending on the opclass. For
> most indexes it's maybe ~3-4x faster, which is nice, and I don't think
> it's possible to do much better - the actual index inserts can happen
> from a single process only, which is the main limit.

Can we really not insert with multiple processes? It seems to me that
GIN could be very suitable for that purpose, with its clear double
tree structure distinction that should result in few buffer conflicts
if different backends work on known-to-be-very-different keys.
We'd probably need multiple read heads on the shared tuplesort, and a
way to join the generated top-level subtrees, but I don't think that
is impossible. Maybe it's work for later effort though.

Have you tested and/or benchmarked this with multi-column GIN indexes?

> For some of the opclasses it can regress (like the jsonb_path_ops). I
> don't think that's a major issue. Or more precisely, I'm not surprised
> by it. It'd be nice to be able to disable the parallel builds in these
> cases somehow, but I haven't thought about that.

Do you know why it regresses?

> I do plan to do some tests with btree_gin, but I don't expect that to
> behave significantly differently.
>
> There are small variations in the index size, when built in the serial
> way and the parallel way. It's generally within ~5-10%, and I believe
> it's due to the serial build adding the TIDs incrementally, while the
> build adds them in much larger chunks (possibly even in one chunk with
> all the TIDs for the key).

I assume that was '[...] while the [parallel] build adds them [...]', right?

> I believe the same size variation can happen
> if the index gets built in a different way, e.g. by inserting the data
> in a different order, etc. I did a number of tests to check if the index
> produces the correct results, and I haven't found any issues. So I think
> this is OK, and neither a problem nor an advantage of the patch.
>
>
> Now, let's talk about the code - the series has 7 patches, with 6
> non-trivial parts doing changes in focused and easier to understand
> pieces (I hope so).

The following comments are generally based on the descriptions; I
haven't really looked much at the patches yet except to validate some
assumptions.

> 1) v20240502-0001-Allow-parallel-create-for-GIN-indexes.patch
>
> This is the initial feature, adding the "basic" version, implemented as
> pretty much 1:1 copy of the BRIN parallel build and minimal changes to
> make it work for GIN (mostly about how to store intermediate results).
>
> The basic idea is that the workers do the regular build, but instead of
> flushing the data into the index after hitting the memory limit, it gets
> written into a shared tuplesort and sorted by the index key. And the
> leader then reads this sorted data, accumulates the TID for a given key
> and inserts that into the index in one go.

In the code, GIN insertions are still basically single btree
insertions, all starting from the top (but with many same-valued
tuples at once). Now that we have a tuplesort with the full table's
data, couldn't the code be adapted to do more efficient btree loading,
such as that seen in the nbtree code, where the rightmost pages are
cached and filled sequentially without requiring repeated searches
down the tree? I suspect we can gain a lot of time there.

I don't need you to do that, but what's your opinion on this?

> 2) v20240502-0002-Use-mergesort-in-the-leader-process.patch
>
> The approach implemented by 0001 works, but there's a little bit of
> issue - if there are many distinct keys (e.g. for trigrams that can
> happen very easily), the workers will hit the memory limit with only
> very short TID lists for most keys. For serial build that means merging
> the data into a lot of random places, and in parallel build it means the
> leader will have to merge a lot of tiny lists from many sorted rows.
>
> Which can be quite annoying and expensive, because the leader does so
> using qsort() in the serial part. It'd be better to ensure most of the
> sorting happens in the workers, and the leader can do a mergesort. But
> the mergesort must not happen too often - merging many small lists is
> not cheaper than a single qsort (especially when the lists overlap).
>
> So this patch changes the workers to process the data in two phases. The
> first works as before, but the data is flushed into a local tuplesort.
> And then each workers sorts the results it produced, and combines them
> into results with much larger TID lists, and those results are written
> to the shared tuplesort. So the leader only gets very few lists to
> combine for a given key - usually just one list per worker.

Hmm, I was hoping we could implement the merging inside the tuplesort
itself during its own flush phase, as it could save significantly on
IO, and could help other users of tuplesort with deduplication, too.

> 3) v20240502-0003-Remove-the-explicit-pg_qsort-in-workers.patch
>
> In 0002 the workers still do an explicit qsort() on the TID list before
> writing the data into the shared tuplesort. But we can do better - the
> workers can do a merge sort too. To help with this, we add the first TID
> to the tuplesort tuple, and sort by that too - it helps the workers to
> process the data in an order that allows simple concatenation instead of
> the full mergesort.
>
> Note: There's a non-obvious issue due to parallel scans always being
> "sync scans", which may lead to very "wide" TID ranges when the scan
> wraps around. More about that later.

As this note seems to imply, this seems to have a strong assumption
that data received in parallel workers is always in TID order, with
one optional wraparound. Non-HEAP TAMs may break with this assumption,
so what's the plan on that?

> 4) v20240502-0004-Compress-TID-lists-before-writing-tuples-t.patch
>
> The parallel build passes data between processes using temporary files,
> which means it may need significant amount of disk space. For BRIN this
> was not a major concern, because the summaries tend to be pretty small.
>
> But for GIN that's not the case, and the two-phase processing introduced
> by 0002 make it worse, because the worker essentially creates another
> copy of the intermediate data. It does not need to copy the key, so
> maybe it's not exactly 2x the space requirement, but in the worst case
> it's not far from that.
>
> But there's a simple way how to improve this - the TID lists tend to be
> very compressible, and GIN already implements a very light-weight TID
> compression, so this patch does just that - when building the tuple to
> be written into the tuplesort, we just compress the TIDs.

See note on 0002: Could we do this in the tuplesort writeback, rather
than by moving the data around multiple times?

[...]
> So 0007 does something similar - it tracks if the TID value goes
> backward in the callback, and if it does it dumps the state into the
> tuplesort before processing the first tuple from the beginning of the
> table. Which means we end up with two separate "narrow" TID list, not
> one very wide one.

See note above: We may still need a merge phase, just to make sure we
handle all TAM parallel scans correctly, even if that merge join phase
wouldn't get hit in vanilla PostgreSQL.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)


Reply via email to