I tried again this morning using gin-packed-postinglists-16.patch and gin-fast-scan.6.patch. No crashes.
It is about a 0.1% random sample of production data (10,000,000 records) with the below structure. Pg was compiled with debug enabled in both cases. Table "public.kp" Column | Type | Modifiers --------+---------+----------- id | bigint | not null string | text | not null score1 | integer | score2 | integer | score3 | integer | score4 | integer | Indexes: "kp_pkey" PRIMARY KEY, btree (id) "kp_string_key" UNIQUE CONSTRAINT, btree (string) "textsearch_gin_idx" gin (to_tsvector('simple'::regconfig, string)) WHERE score1 IS NOT NULL This is a query tested. All data is in Pg buffer cache for these timings. Words like "the" and "and" are very common (~9% of entries, each) and a word like "hotel" is much less common (~0.2% of entries). SELECT id,string FROM kp WHERE score1 IS NOT NULL AND to_tsvector('simple', string) @@ to_tsquery('simple', ?) -- ? is substituted with the query strings ORDER BY score1 DESC, score2 ASC LIMIT 1000; Limit (cost=56.04..56.04 rows=1 width=37) (actual time=250.010..250.032 rows=142 loops=1) -> Sort (cost=56.04..56.04 rows=1 width=37) (actual time=250.008..250.017 rows=142 loops=1) Sort Key: score1, score2 Sort Method: quicksort Memory: 36kB -> Bitmap Heap Scan on kp (cost=52.01..56.03 rows=1 width=37) (actual time=249.711..249.945 rows=142 loops=1) Recheck Cond: ((to_tsvector('simple'::regconfig, string) @@ '''hotel'' & ''and'' & ''the'''::tsquery) AND (score1 IS NOT NULL)) -> Bitmap Index Scan on textsearch_gin_idx (cost=0.00..52.01 rows=1 width=0) (actual time=249.681..249.681 rows=142 loops=1) Index Cond: (to_tsvector('simple'::regconfig, string) @@ '''hotel'' & ''and'' & ''the'''::tsquery) Total runtime: 250.096 ms Times are from \timing on. MASTER ======= the: 888.436 ms 926.609 ms 885.502 ms and: 944.052 ms 937.732 ms 920.050 ms hotel: 53.992 ms 57.039 ms 65.581 ms and & the & hotel: 260.308 ms 248.275 ms 248.098 ms These numbers roughly match what we get with Pg 9.2. The time savings between 'the' and 'and & the & hotel' is mostly heap lookups for the score and the final sort. The size of the index on disk is about 2% smaller in the patched version. PATCHED ======= the: 1055.169 ms 1081.976 ms 1083.021 ms and: 912.173 ms 949.364 ms 965.261 ms hotel: 62.591 ms 64.341 ms 62.923 ms and & the & hotel: 268.577 ms 259.293 ms 257.408 ms hotel & and & the: 253.574 ms 258.071 ms 250.280 ms I was hoping that the 'and & the & hotel' case would improve with this patch to be closer to the 'hotel' search, as I thought that was the kind of thing it targeted. Unfortunately, it did not. I actually applied the patches, compiled, initdb/load data, and ran it again thinking I made a mistake. Reordering the terms 'hotel & and & the' doesn't change the result. On Fri, Nov 15, 2013 at 1:51 AM, Alexander Korotkov <aekorot...@gmail.com>wrote: > On Fri, Nov 15, 2013 at 3:25 AM, Rod Taylor <r...@simple-knowledge.com>wrote: > >> I checked out master and put together a test case using a small >> percentage of production data for a known problem we have with Pg 9.2 and >> text search scans. >> >> A small percentage in this case means 10 million records randomly >> selected; has a few billion records. >> >> >> Tests ran for master successfully and I recorded timings. >> >> >> >> Applied the patch included here to master along with >> gin-packed-postinglists-14.patch. >> Run make clean; ./configure; make; make install. >> make check (All 141 tests passed.) >> >> initdb, import dump >> >> >> The GIN index fails to build with a segfault. >> > > Thanks for testing. See fixed version in thread about packed posting lists. > > ------ > With best regards, > Alexander Korotkov. >