On Fri, Nov 15, 2013 at 6:57 PM, Rod Taylor <p...@rbt.ca> wrote:

> I tried again this morning using gin-packed-postinglists-16.patch and
> gin-fast-scan.6.patch. No crashes.
>
> It is about a 0.1% random sample of production data (10,000,000 records)
> with the below structure. Pg was compiled with debug enabled in both cases.
>
>       Table "public.kp"
>  Column |  Type   | Modifiers
> --------+---------+-----------
>  id     | bigint  | not null
>  string | text    | not null
>  score1 | integer |
>  score2 | integer |
>  score3 | integer |
>  score4 | integer |
> Indexes:
>     "kp_pkey" PRIMARY KEY, btree (id)
>     "kp_string_key" UNIQUE CONSTRAINT, btree (string)
>     "textsearch_gin_idx" gin (to_tsvector('simple'::regconfig, string))
> WHERE score1 IS NOT NULL
>
>
>
> This is a query tested. All data is in Pg buffer cache for these timings.
> Words like "the" and "and" are very common (~9% of entries, each) and a
> word like "hotel" is much less common (~0.2% of entries).
>
>   SELECT id,string
>     FROM kp
>    WHERE score1 IS NOT NULL
>      AND to_tsvector('simple', string) @@ to_tsquery('simple', ?)
>              -- ? is substituted with the query strings
> ORDER BY score1 DESC, score2 ASC
> LIMIT 1000;
>
>  Limit  (cost=56.04..56.04 rows=1 width=37) (actual time=250.010..250.032
> rows=142 loops=1)
>    ->  Sort  (cost=56.04..56.04 rows=1 width=37) (actual
> time=250.008..250.017 rows=142 loops=1)
>          Sort Key: score1, score2
>          Sort Method: quicksort  Memory: 36kB
>          ->  Bitmap Heap Scan on kp  (cost=52.01..56.03 rows=1 width=37)
> (actual time=249.711..249.945 rows=142 loops=1)
>                Recheck Cond: ((to_tsvector('simple'::regconfig, string) @@
> '''hotel'' & ''and'' & ''the'''::tsquery) AND (score1 IS NOT NULL))
>                ->  Bitmap Index Scan on textsearch_gin_idx
> (cost=0.00..52.01 rows=1 width=0) (actual time=249.681..249.681 rows=142
> loops=1)
>                      Index Cond: (to_tsvector('simple'::regconfig, string)
> @@ '''hotel'' & ''and'' & ''the'''::tsquery)
>  Total runtime: 250.096 ms
>
>
>
> Times are from \timing on.
>
> MASTER
> =======
> the:               888.436 ms   926.609 ms   885.502 ms
> and:               944.052 ms   937.732 ms   920.050 ms
> hotel:              53.992 ms    57.039 ms    65.581 ms
> and & the & hotel: 260.308 ms   248.275 ms   248.098 ms
>
> These numbers roughly match what we get with Pg 9.2. The time savings
> between 'the' and 'and & the & hotel' is mostly heap lookups for the score
> and the final sort.
>
>
>
> The size of the index on disk is about 2% smaller in the patched version.
>
> PATCHED
> =======
> the:              1055.169 ms 1081.976 ms  1083.021 ms
> and:               912.173 ms  949.364 ms   965.261 ms
> hotel:              62.591 ms   64.341 ms    62.923 ms
> and & the & hotel: 268.577 ms  259.293 ms   257.408 ms
> hotel & and & the: 253.574 ms  258.071 ms  250.280 ms
>
> I was hoping that the 'and & the & hotel' case would improve with this
> patch to be closer to the 'hotel' search, as I thought that was the kind of
> thing it targeted. Unfortunately, it did not. I actually applied the
> patches, compiled, initdb/load data, and ran it again thinking I made a
> mistake.
>
> Reordering the terms 'hotel & and & the' doesn't change the result.
>

Oh, in this path new consistent method isn't implemented for tsvector
opclass, for array only. Will be fixed soon.
BTW, was index 2% smaller or 2 times smaller? If it's 2% smaller than I
need to know more about your dataset :)

------
With best regards,
Alexander Korotkov.

Reply via email to