Re: Parallel CREATE INDEX for GIN indexes

Tomas Vondra Sun, 07 Jul 2024 09:26:25 -0700

On 7/5/24 21:45, Matthias van de Meent wrote:
> On Wed, 3 Jul 2024 at 20:36, Matthias van de Meent
> <boekewurm+postg...@gmail.com> wrote:
>>
>> On Mon, 24 Jun 2024 at 02:58, Tomas Vondra
>> <tomas.von...@enterprisedb.com> wrote:
>>> So I've switched this to use the regular data-type comparisons, with
>>> SortSupport etc. There's a bit more cleanup remaining and testing
>>> needed, but I'm not aware of any bugs.
> 
> I've hit assertion failures in my testing of the combined patches, in
> AssertCheckItemPointers: it assumes it's never called when the buffer
> is empty and uninitialized, but that's wrong: we don't initialize the
> items array until the first tuple, which will cause the assertion to
> fire. By updating the first 2 assertions of AssertCheckItemPointers, I
> could get it working.
>


Yeah, sorry. I think I broke this assert while doing the recent
cleanups. The assert used to be called only when doing a sort, and then
it certainly can't be empty. But then I moved it to be called from the
generic GinTuple assert function, and that broke this assumption.

>> ---
>>> +++ b/src/backend/utils/sort/tuplesortvariants.c
>>
>> I was thinking some more about merging tuples inside the tuplesort. I
>> realized that this could be implemented by allowing buffering of tuple
>> writes in writetup. This would require adding a flush operation at the
>> end of mergeonerun to store the final unflushed tuple on the tape, but
>> that shouldn't be too expensive. This buffering, when implemented
>> through e.g. a GinBuffer in TuplesortPublic->arg, could allow us to
>> merge the TID lists of same-valued GIN tuples while they're getting
>> stored and re-sorted, thus reducing the temporary space usage of the
>> tuplesort by some amount with limited overhead for other
>> non-deduplicating tuplesorts.
>>
>> I've not yet spent the time to get this to work though, but I'm fairly
>> sure it'd use less temporary space than the current approach with the
>> 2 tuplesorts, and could have lower overall CPU overhead as well
>> because the number of sortable items gets reduced much earlier in the
>> process.
> 
> I've now spent some time on this. Attached the original patchset, plus
> 2 incremental patches, the first of which implement the above design
> (patch no. 8).
> 
> Local tests show it's significantly faster: for the below test case
> I've seen reindex time reduced from 777455ms to 626217ms, or ~20%
> improvement.
> 
> After applying the 'reduce the size of GinTuple' patch, index creation
> time is down to 551514ms, or about 29% faster total. This all was
> tested with a fresh stock postgres configuration.
> 
> """
> CREATE UNLOGGED TABLE testdata
> AS SELECT sha256(i::text::bytea)::text
>     FROM generate_series(1, 15000000) i;
> CREATE INDEX ON testdata USING gin (sha256 gin_trgm_ops);
> """
> 

Those results look really promising. I certainly would not have expected
such improvements - 20-30% speedup on top of the already parallel run
seems great. I'll do a bit more testing to see how much this helps for
the "more realistic" data sets.

I can't say I 100% understand how the new stuff in tuplesortvariants.c
works, but that's mostly because my knowledge of tuplesort internals is
fairly limited (and I managed to survive without that until now).

What makes me a bit unsure/uneasy is that this seems to "inject" custom
code fairly deep into the tuplesort logic. I'm not quite sure if this is
a good idea ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel CREATE INDEX for GIN indexes

Reply via email to