Re: WIP: parallel GiST index builds

Tomas Vondra Sun, 21 Jul 2024 13:42:38 -0700

On 7/21/24 21:31, Andrey M. Borodin wrote:
> Hi Tomas!
> 
>> On 7 Jun 2024, at 20:41, Tomas Vondra
>> <tomas.von...@enterprisedb.com> wrote:
>> 
>> After looking into parallel builds for BRIN and GIN indexes, I was 
>> wondering if there's a way to do parallel builds for GiST too. I
>> knew next to nothing about how GiST works, but I gave it a shot and
>> here's what I have - the attached patch allows parallel GiST builds
>> for the "unsorted" case (i.e. when the opclass does not include
>> sortsupport), and does not support buffered builds.
> 
> I think this totally makes sense. I've took a look into tuples
> partitioning (for sorted build) in your Github and I see that it's
> complicated feature. So, probably, we can do it later. I'm trying to
> review the patch as it is now. Currently I have some questions about
> code.


OK. I'm not even sure partitioning is the right approach for sorted
builds. Or how to do it, exactly.

> 
> 1. Do I get it right that is_parallel argument for gistGetFakeLSN()
> is only needed for assertion? And this assertion can be ensured just
> by inspecting code. Is it really necessary?

Yes, in the patch it's only used for an assert. But it's actually
incorrect - just as I speculated in my initial message (in the section
about gistGetFakeLSN), it sometimes fails to detect a page split. I
noticed that while testing the patch adding GiST to amcheck, and I
reported that in that thread [1]. But I apparently forgot to post an
updated version of this patch :-(

I'll post a new version tomorrow, but in short it needs to update the
fake LSN even if (lastlsn != currlsn) if is_parallel=true. It's a bit
annoying this means we generate a new fake LSN on every call, and I'm
not sure that's actually necessary. But I've been unable to come up with
a better condition when to generate a new LSN.

[1]
https://www.postgresql.org/message-id/79622955-6d1a-4439-b358-ec2b6a9e7cbf%40enterprisedb.com

> 2. gistBuildParallelCallback() updates indtuplesSize, but it seems to be
> not used anywhere. AFAIK it's only needed to buffered build. 3. I
> think we need a test that reliably triggers parallel and serial
> builds.
> 

Yeah, it's possible the variable is unused. Agreed on the testing.

> As far as I know, there's a well known trick to build better GiST
> over PostGIS data: randomize input. I think parallel scan is just
> what is needed, it will shuffle tuples enough...
> 

I'm not sure I understand this comment. What do you mean by "better
GiST" or what does that mean for this patch?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WIP: parallel GiST index builds

Reply via email to