On 01/17/2016 10:03 PM, Jeff Janes wrote:
On Fri, Jan 15, 2016 at 3:29 PM, Peter Geoghegan <p...@heroku.com> wrote:
On Fri, Jan 15, 2016 at 2:38 PM, Constantin S. Pan <kva...@gmail.com> wrote:
I have a draft implementation which divides the whole process between
N parallel workers, see the patch attached. Instead of a full scan of
the relation, I give each worker a range of blocks to read.

I am currently working on a patch that allows B-Tree index builds to
be performed in parallel. I think I'm a week or two away from posting

Even without parallelism, wouldn't it be better if GIN indexes were
built using tuplesort? I know way way less about the gin am than the
nbtree am, but I imagine that a prominent cost for GIN index builds is
constructing the main B-Tree (the one that's constructed over key
values) itself. Couldn't tuplesort.c be adapted to cover this case?
That would be much faster in general, particularly with the recent
addition of abbreviated keys, while also leaving a clear path forward
to performing the build in parallel.

I think it would take a lot of changes to tuple sort to make this be a
almost-always win.

In the general case each GIN key occurs in many tuples, and the
in-memory rbtree is good at compressing the tid list for each key to
maximize the amount of data that can be in memory (although perhaps it
could be even better, as it doesn't use varbyte encoding on the tid
list the way the posting lists on disk do--it could do so in the bulk
build case, where it receives the tid in order, but not feasibly in
the pending-list clean-up case, where it doesn't get them in order)

When I was testing building an index on a column of text identifiers
where there were a couple million identifiers occurring a few dozen
times each, building it as a gin index (using btree_gin so that plain
text columns could be indexed) was quite a bit faster than building it
as a regular btree index using tuple sort.  I didn't really
investigate this difference, but I assume it was due to the better
memory usage leading to less IO.

I've been wondering about this too, using tuplesort to build GIN indexes, for a long time. Surely the highly-optimized quicksort+merge code in tuplesort.c is faster than maintaining a red-black tree? Or so I thought.

I wrote a quick prototype of using tuplesort.c for GIN index build. I tested it with a 500 MB table of integer arrays, so that the sort fit completely in memory. It's a lot slower than the current code. It turns out eliminating the duplicates early is really really important.

So we want to keep the red-black tree, to eliminate the duplicates. Or add that capability to tuplesort.c, which might speed up Sort+Unique and aggregates in general, but that's a big effort.

However, I still wonder, if we shouldn't use a merge approach when the tree doesn't fit in memory, like tuplesort.c does. Currently, when the tree is full, we flush it out to the index, by inserting all the entries. That can get quite expensive, I/O-wise. It also generates more WAL, compared to writing each page only once.

If we flushed the tree to a tape instead, then we could perhaps use the machinery that Peter's parallel B-tree patch is adding to tuplesort.c, to merge the tapes. I'm not sure if that works out, but I think it's worth some experimentation.

But I do think this with aggregation would be worth it despite the
amount of work involved.  In addition to automatically benefiting from
parallel sorts and any other future sort improvements, I think it
would also generate better indexes.  I have a theory that a problem
with very large gin indexes is that repeatedly building
maintenance_work_mem worth of rbtree entries and then merging them to
disk yields highly fragmented btrees, in which logical adjacent
key-space does not occupy physically nearby leaf pages, which then can
causes problems either with access that follows a pattern (like range
scans, except gin indexes can't really do those anyway) or further
bulk operations.

Yeah, there's that too.

- Heikki

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to