full-scan cases

David E . Wheeler Fri, 14 Jan 2011 17:39:49 -0800

On Jan 8, 2011, at 9:41 PM, Tom Lane wrote:

> "David E. Wheeler" <da...@kineticode.com> writes:
>> On Jan 7, 2011, at 4:19 PM, Tom Lane wrote:
>>> Well, actually, I just committed it.  If you want to test, feel free.
>>> Note that right now only the anyarray && <@ @> operators are genuinely
>>> fixed ... I plan to hack on tsearch and contrib pretty soon though.
> 
>> Hrm, the queries I wrote for this sort of thing use intarray:
>>    WHERE blah @@ '(12|14)'::query_int
>> That's not done yet though, right?
> 
> intarray is done now, feel free to test ...


Tom,

Well, I regret to say that what I found is…all over the place.

So I have a rather wide table with two GIST indexes, one on an integer[] column 
and one on a tsvector column. I duped the table and replaced those indexes with 
GIN indexes. And the results of my testing don't make much sense.

Well, first the good news: I got no NULL-related errors at all. There are a lot 
of rows with an empty array in the integer[] column. And I got the same results 
for my queries against the table with the GIN indexes as the one with the GiST 
indexes. So all that's good.

One of the reasons our client wants GIN for the integer[] column so bad is 
because recreating the GiST integer[] index is quite painful. Before I duped 
the table, I was just dropping and recreating the index on the original table. 
It was great to create the GIN index; for 400K rows, it took 1300 ms. After my 
initial round of testing, I dropped it and created the GiST index. That ran 
for…well, *hours*. Four at least, maybe six or seven (I forgot \timing and was 
letting it run on screen while I did some iOS hacking). I think something might 
be really wrong with GiST index creation for integer arrays, because the 
difference was just appalling. 

As a sanity check, I did the same thing today with the same table(s) on their 
ts_vector columns. Creating the GiST index took just under 7 seconds; the GIN 
index took 23.4 seconds. On a second attempt, GiST took 16371 ms and GIN 30452. 
I had expected GIN to be faster here, but both are within the realms of the 
acceptable, compared to the time it took to create the GiST index on the 
integer[] column.

As for the queries, here too I was surprised. As I said, the integer[] column 
had a lot of empty arrays. And the indexes look like so:

  "idx_gist_features" gist (features) WHERE deleted_at IS NULL AND status = 1
  "idx_gist_textsearch" gist (ts_index_col) WHERE deleted_at IS NULL AND status 
= 1

Just s/gist/gin/ for the gin indexes. For the integer[] column, I ran a bunch 
of queries like so (again, just s/gist/gin/ for the gin versions):

    explain analyze SELECT count(*) FROM gist_listings
      WHERE features @@ '(1369043)'::query_int
        AND deleted_at IS NULL AND mls_status_id = 1;

This integer had pretty high selectvity, 86 out of 406K rows.
Gist: 117.444 on the first run, around 3.2 ms thereafter
GIN:  Around 325 ms on all runs

    explain analyze SELECT count(*) FROM gist_listings
      WHERE features @@ '(1368798|1369043)'::query_int
        AND deleted_at IS NULL AND mls_status_id = 1;

Rows selected: 91528.
Gist: 4030.282 ms on the first run, around 210 ms thereafter.
GIN:  4309.259 ms on the first run, around 400 ms thereafter

    explain analyze SELECT count(*) FROM gist_listings
      WHERE features @@ '(1368799&1368800&1369043)'::query_int
        AND deleted_at IS NULL AND mls_status_id = 1;

Rows selected: 91528
Gist: 1738.568 ms on the first run, around  24 ms thereafter.
GIN:  4427.517 ms on the first run, around 340 ms thereafter

These numbers are a bit crazy-making, but the upshot is that Gist is slow out 
of the gate, but with data cached, it's pretty speedy. With indexscan and 
bitmapscan disabled, these queries all took 300-400 ms. So GIN was never better 
performing than a table scan. So while GIN is a big win for re-indexing (this 
database gets its intarray Gist indexes updated quite frequently, as they get 
quited bloated in a hurry), it's not a win at all for querying.

So, thinking that there might be something funky with intarray GIN support, we 
wanted to test performance of GIST vs. GIN on the tsquery column. Here GIN was 
a much bigger win. With a query like this:

    SELECT l.* FROM gist_listings
     WHERE ts_index_col @@ to_tsquery(regexp_replace(
         plainto_tsquery('english', '1 Infinite Loop')::text,
         '''(?=[[:space:]]|$)', ''':B', 'g'
     )) and deleted_at IS NULL AND status = 1;

With zero rows returned, GIN consistently executed in 20 ms. Gist took 838.274 
for the first run and 25-30 ms on subsequent runs. So GIN is the clear winner 
here, except for index creation as noted above.

And here's one that selects a single row:

    SELECT l.* FROM gist_listings
     WHERE ts_index_col @@ to_tsquery(regexp_replace(
         plainto_tsquery('english', 'volkswagon')::text,
         '''(?=[[:space:]]|$)', ''':B', 'g'
     )) and deleted_at IS NULL AND status = 1;

GiST: 495.867 first run, 380 thereafter
GIN:  83.980 first run, 330 thereafter

Again, GIN is the clear winner here, though it's negligible when the data is in 
the cache.

So some questions:

* Is something seriously wrong with GiST index creation on integer[] columns?

* Why does GIN performance appear to be no better than table scans on integer[] 
columns?

* Why does it take 3-4x longer to create the GIN than the GiST index on 
tsvector? I thought that GIN was supposed to be faster to update

Hope this is helpful, and please do let me know if there are any other tests 
you'd like me to run against this data.

Best,

David
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

Reply via email to