On Thu, 9 May 2024 at 15:13, Tomas Vondra <tomas.von...@enterprisedb.com> wrote:
> Let me explain the relevant part of the patch, and how I understand the
> improvement suggested by Matthias. The patch does the work in three phases:
>
>
> 1) Worker gets data from table, split that into index items and add
> those into a "private" tuplesort, and finally sorts that. So a worker
> may see a key many times, with different TIDs, so the tuplesort may
> contain many items for the same key, with distinct TID lists:
>
>    key1: 1, 2, 3, 4
>    key1: 5, 6, 7
>    key1: 8, 9, 10
>    key2: 1, 2, 3
>    ...

This step is actually split in several components/phases, too.
As opposed to btree, which directly puts each tuple's data into the
tuplesort, this GIN approach actually buffers the tuples in memory to
generate these TID lists for data keys, and flushes these pairs of Key
+ TID list into the tuplesort when its own memory limit is exceeded.
That means we essentially double the memory used for this data: One
GIN deform buffer, and one in-memory sort buffer in the tuplesort.
This is fine for now, but feels duplicative, hence my "let's allow
tuplesort to merge the key+TID pairs into pairs of key+TID list"
comment.

> The trouble with (2) is that it "just copies" data from one tuplesort
> into another, increasing the disk space requirements. In an extreme
> case, when nothing can be combined, it pretty much doubles the amount of
> disk space, and makes the build longer.
>
> What I think Matthias is suggesting, is that this "TID list merging"
> could be done directly as part of the tuplesort in step (1). So instead
> of just moving the "sort tuples" from the appropriate runs, it could
> also do an optional step of combining the tuples and writing this
> combined tuple into the tuplesort result (for that worker).

Yes, but with a slightly more extensive approach than that even, see above.

> Matthias also mentioned this might be useful when building btree indexes
> with key deduplication.
>
> AFAICS this might work, although it probably requires for the "combined"
> tuple to be smaller than the sum of the combined tuples (in order to fit
> into the space).

*must not be larger than the sum; not "must be smaller than the sum" [^0].
For btree tuples with posting lists this is guaranteed to be true: The
added size of a btree tuple with a posting list (containing at least 2
values) vs one without is the maxaligned size of 2 TIDs, or 16 bytes
(12 on 32-bit systems). The smallest btree tuple with data is also 16
bytes (or 12 bytes on 32-bit systems), so this works out nicely.

> But at least in the GIN build in the workers this is
> likely true, because the TID lists do not overlap (and thus not hurting
> the compressibility).
>
> That being said, I still see this more as an optimization than something
> required for the patch,

Agreed.

> and I don't think I'll have time to work on this
> anytime soon. The patch is not extremely complex, but it's not trivial
> either. But if someone wants to take a stab at extending tuplesort to
> allow this, I won't object ...

Same here: While I do have some ideas on where and how to implement
this, I'm not planning on working on that soon.


Kind regards,

Matthias van de Meent

[^0] There's some overhead in the tuplesort serialization too, so
there is some leeway there, too.


Reply via email to