2%.

It's essentially sentence fragments from 1 to 5 words in length. I wasn't
expecting it to be much smaller.

10 recent value selections:

 white vinegar reduce color running
 vinegar cure uti
 cane vinegar acidity depends parameter
 how remedy fir clogged shower
 use vinegar sensitive skin
 home remedies removing rust heating
 does non raw apple cider
 home remedies help maintain healthy
 can vinegar mess up your
 apple cide vineger ph balance

regards,

Rod



On Fri, Nov 15, 2013 at 12:51 PM, Alexander Korotkov
<aekorot...@gmail.com>wrote:

> On Fri, Nov 15, 2013 at 6:57 PM, Rod Taylor <p...@rbt.ca> wrote:
>
>> I tried again this morning using gin-packed-postinglists-16.patch and
>> gin-fast-scan.6.patch. No crashes.
>>
>> It is about a 0.1% random sample of production data (10,000,000 records)
>> with the below structure. Pg was compiled with debug enabled in both cases.
>>
>>       Table "public.kp"
>>  Column |  Type   | Modifiers
>> --------+---------+-----------
>>  id     | bigint  | not null
>>  string | text    | not null
>>  score1 | integer |
>>  score2 | integer |
>>  score3 | integer |
>>  score4 | integer |
>> Indexes:
>>     "kp_pkey" PRIMARY KEY, btree (id)
>>     "kp_string_key" UNIQUE CONSTRAINT, btree (string)
>>     "textsearch_gin_idx" gin (to_tsvector('simple'::regconfig, string))
>> WHERE score1 IS NOT NULL
>>
>>
>>
>> This is a query tested. All data is in Pg buffer cache for these timings.
>> Words like "the" and "and" are very common (~9% of entries, each) and a
>> word like "hotel" is much less common (~0.2% of entries).
>>
>>   SELECT id,string
>>     FROM kp
>>    WHERE score1 IS NOT NULL
>>      AND to_tsvector('simple', string) @@ to_tsquery('simple', ?)
>>              -- ? is substituted with the query strings
>> ORDER BY score1 DESC, score2 ASC
>> LIMIT 1000;
>>
>>  Limit  (cost=56.04..56.04 rows=1 width=37) (actual time=250.010..250.032
>> rows=142 loops=1)
>>    ->  Sort  (cost=56.04..56.04 rows=1 width=37) (actual
>> time=250.008..250.017 rows=142 loops=1)
>>          Sort Key: score1, score2
>>          Sort Method: quicksort  Memory: 36kB
>>          ->  Bitmap Heap Scan on kp  (cost=52.01..56.03 rows=1 width=37)
>> (actual time=249.711..249.945 rows=142 loops=1)
>>                Recheck Cond: ((to_tsvector('simple'::regconfig, string)
>> @@ '''hotel'' & ''and'' & ''the'''::tsquery) AND (score1 IS NOT NULL))
>>                ->  Bitmap Index Scan on textsearch_gin_idx
>> (cost=0.00..52.01 rows=1 width=0) (actual time=249.681..249.681 rows=142
>> loops=1)
>>                      Index Cond: (to_tsvector('simple'::regconfig,
>> string) @@ '''hotel'' & ''and'' & ''the'''::tsquery)
>>  Total runtime: 250.096 ms
>>
>>
>>
>> Times are from \timing on.
>>
>> MASTER
>> =======
>> the:               888.436 ms   926.609 ms   885.502 ms
>> and:               944.052 ms   937.732 ms   920.050 ms
>> hotel:              53.992 ms    57.039 ms    65.581 ms
>> and & the & hotel: 260.308 ms   248.275 ms   248.098 ms
>>
>> These numbers roughly match what we get with Pg 9.2. The time savings
>> between 'the' and 'and & the & hotel' is mostly heap lookups for the score
>> and the final sort.
>>
>>
>>
>> The size of the index on disk is about 2% smaller in the patched version.
>>
>> PATCHED
>> =======
>> the:              1055.169 ms 1081.976 ms  1083.021 ms
>> and:               912.173 ms  949.364 ms   965.261 ms
>> hotel:              62.591 ms   64.341 ms    62.923 ms
>> and & the & hotel: 268.577 ms  259.293 ms   257.408 ms
>> hotel & and & the: 253.574 ms  258.071 ms  250.280 ms
>>
>> I was hoping that the 'and & the & hotel' case would improve with this
>> patch to be closer to the 'hotel' search, as I thought that was the kind of
>> thing it targeted. Unfortunately, it did not. I actually applied the
>> patches, compiled, initdb/load data, and ran it again thinking I made a
>> mistake.
>>
>> Reordering the terms 'hotel & and & the' doesn't change the result.
>>
>
> Oh, in this path new consistent method isn't implemented for tsvector
> opclass, for array only. Will be fixed soon.
> BTW, was index 2% smaller or 2 times smaller? If it's 2% smaller than I
> need to know more about your dataset :)
>
> ------
> With best regards,
> Alexander Korotkov.
>
>

Reply via email to