[PATCHES] GIN improvements

Teodor Sigaev Fri, 30 May 2008 20:56:11 -0700


Improvements of GIN indexes were presented on PGCon 2008. Presentation:
 http://www.sigaev.ru/gin/fastinsert_and_multicolumn_GIN.pdf


1) multicolumn GIN

This patch ( http://www.sigaev.ru/misc/multicolumn_gin-0.2.gz ) adds multicolumnsupport to GIN. The basic idea is: keys (entries in GIN terminology) extractedfrom values are stored in separated tuples along with their column number. Inthat case, multicolumn clause is just AND of column's clauses. Unlike otherindexes, the performance of search doesn't depends on what column of index(first, last, any subset) is used in search clause. This property can be used ingincostestimate, but I haven't looked on it yet.


2) fast insert into GIN

This patch ( http://www.sigaev.ru/misc/fast_insert_gin-0.4.gz ) implements anidea of using bulk insert technique, which used at index creation time. Insertedrows are stored in the linked list of pending pages and inserted to the regularstructure of GIN at vacuum time. The algorithm is shown in presentation, butinsert completion process (vacuum) was significantly reworkes to improveconcurrency. Now, the list of pending page is locked much lesser time - onlyduring deletion of pages from the list.


Open item:

what is a right time to call insert completion? Currently, it is called byginbulkdelete and ginvacuumcleanup, ginvacuumcleanup will call completion ifginbulkdelete wasn't called. That's not good, but works. Completion processshould started before ginbulkdelete because ginbulkdelete doesn't look on

pending pages at all.

Since insert completion (of any index if that method will exists, I think) runsfast if number of inserted tuples is a small because it doesn't go through thewhole index, so, IMHO, the existing statistic's fields should not be changed.That idea, discussed at PGCon, is to have trigger in vacuum which will be firedif number of inserted tuples becomes big. Now I don't think that the idea isuseful for two reason: for small number of tuples completion is a cheap and itshould be called before ginbulkdelete. IMHO, it's enough to add an optionalmethod to pg_am (aminsertcleanup, per Tom's suggestion). This method will becalled before ambulkdelete and amvacuumcleanup. Opinions, objections, suggestions?


On presentation some people were interested on how our changes affect the

search speed after rows insert. The tests are below: We use the same tables asin presentation and measure search times ( after insertion of some rows ) beforeand after vacuum. All times are in ms. Test tables contain 100000 rows, in thefirst table the number of elements in array is 100 with cardinality = 500,second - 100 and 500000, last - 1000 and 500.


Insert 10000 into table with 100000 rows (10%)
                 |    v && '{v1}'   |
-----------------+---------+--------+ found
                 | novac-d |  vac-d |  rows
-----------------+---------+--------+-------
n:100,  c:500    |   118   |    35  | 19909
n:100,  c:500000 |    95   |   0.7  |    25
n:1000, c:500    |   380   |   79   | 95211


Insert 1000 into table with 100000 rows (1%)
                 |    v && '{v1}'   |
-----------------+---------+--------+ found
                 | novac-d |  vac-d |  rows
-----------------+---------+--------+-------
n:100,  c:500    |    40   |    31  | 18327
n:100,  c:500000 |    13   |   0.5  |    26
n:1000, c:500    |   102   |    71  | 87499

Insert 100 into table with 100000 rows (0.1%)
                 |    v && '{v1}'   |
-----------------+---------+--------+ found
                 | novac-d |  vac-d |  rows
-----------------+---------+--------+-------
n:100,  c:500    |    32   |    31  | 18171
n:100,  c:500000 |   1.7   |   0.5  |    20
n:1000, c:500    |    74   |    71  | 87499

Looking at result it's easy to conclude that:
 - time of search pending list is O(number of inserted rows), i.e., search time
   is equal to (time of search in GIN) + K1 * (number of inserted rows after the
   last vacuum).
 - search time is O(average length of indexed columns). Observations made above
   is also applicable here.
 - significant performance gap starts around 5-10% of inserts or near 500-1000
   inserts.  This is very depends on specific dataset.

Notice, that insert performance to GIN was increased up to 10 times. See
exact results in presentation.

Do we need to add option to control this (fast insertion) feature?
If so, what is a default value? It's not clear to me.

Note: These patches are mutually exclusive because they touch the same pieces
of code and I'm too lazy to manage several depending patches. I don't see any
problem to join patches to one, but IMHO it will be difficult to review.

--
Teodor Sigaev                                   E-mail: [EMAIL PROTECTED]
                                                   WWW: http://www.sigaev.ru/

--
Sent via pgsql-patches mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches

[PATCHES] GIN improvements

Reply via email to