Depending on your needs, you might consider putting the data into a columnar text search engine like Lucene, having it return the integer id's which can then be used for row lookups in PG.
On Thu, Sep 22, 2011 at 11:40 AM, Jonathan Bartlett < jonathan.l.bartl...@gmail.com> wrote: > I am working on a fuzzy search of a large dataset. Basically, it is a list > of all of the songs, albums, artists, movies, and celebrities exported from > Freebase. Anyway, I was hoping to get a fuzzy search that was nearly as > fast as the full-text search with the new nearest-neighbor GIST indexes, > but, while it is improved from 9.0, it is still taking some time. The table > has about 16 million rows, each with a "name" column that is usually 2-10 > words. > > My query using full-text search is this: > explain analyze select * from entities where > to_tsvector('unaccented_english', entities.name) @@ > plainto_tsquery('unaccented_english', 'bon jovi'); > QUERY > PLAN > > ---------------------------------------------------------------------------------------------------------------------------------------------- > Bitmap Heap Scan on entities (cost=42.64..1375.62 rows=340 width=1274) > (actual time=0.422..0.617 rows=109 loops=1) > Recheck Cond: (to_tsvector('unaccented_english'::regconfig, > (name)::text) @@ '''bon'' & ''jovi'''::tsquery) > -> Bitmap Index Scan on entity_unaccented_name_gin_index > (cost=0.00..42.56 rows=340 width=0) (actual time=0.402..0.402 rows=109 > loops=1) > Index Cond: (to_tsvector('unaccented_english'::regconfig, > (name)::text) @@ '''bon'' & ''jovi'''::tsquery) > Total runtime: 0.728 ms > (5 rows) > > My new query using trigrams is this: > explain analyze select * from entities where name % 'bon jovi'; > QUERY > PLAN > > ---------------------------------------------------------------------------------------------------------------------------------------------------- > Bitmap Heap Scan on entities (cost=913.73..46585.14 rows=13615 > width=1274) (actual time=7769.380..7772.739 rows=326 loops=1) > Recheck Cond: ((name)::text % 'bon jovi'::text) > -> Bitmap Index Scan on tmp_entity_name_trgm_gist_idx > (cost=0.00..910.33 rows=13615 width=0) (actual time=7769.307..7769.307 > rows=326 loops=1) > Index Cond: ((name)::text % 'bon jovi'::text) > Total runtime: 7773.008 ms > > If I put a limit on it, it gets better, but is still pretty bad: > explain analyze select * from entities where name % 'bon jovi' limit 50; > > QUERY PLAN > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------- > Limit (cost=0.00..200.14 rows=50 width=1274) (actual time=1.246..1226.146 > rows=50 loops=1) > -> Index Scan using tmp_entity_name_trgm_gist_idx on entities > (cost=0.00..54498.48 rows=13615 width=1274) (actual time=1.243..1226.016 > rows=50 loops=1) > Index Cond: ((name)::text % 'bon jovi'::text) > Total runtime: 1226.261 ms > (4 rows) > > And if I try to get the "best" matches, the performance goes completely > down the tubes, even with a limit: > explain analyze select * from entities where name % 'bon jovi' order by > name <-> 'bon jovi' limit 50; > > QUERY PLAN > > > --------------------------------------------------------------------------------------------------------------------------------------------------------------- > Limit (cost=0.00..200.39 rows=50 width=1274) (actual > time=421.811..8058.877 rows=50 loops=1) > -> Index Scan using tmp_entity_name_trgm_gist_idx on entities > (cost=0.00..54566.55 rows=13615 width=1274) (actual time=421.808..8058.766 > rows=50 loops=1) > Index Cond: ((name)::text % 'bon jovi'::text) > Order By: ((name)::text <-> 'bon jovi'::text) > Total runtime: 8060.760 ms > > Anyway, this may just be a limitation of the trigram indexing, but my hope > was to get a fuzzy search that at least approached the performance of the > full-text searching. Am I missing something, or am I just bumping into the > limits? I also noticed that different strings get radically different > performance. Searching for "hello" drops the search time down to 310ms! > But searching for 'hello my friend' brings the search time to 9616ms! > explain analyze select * from entities where name % 'hello' order by name > <-> 'hello' limit 50; > > QUERY PLAN > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------ > Limit (cost=0.00..200.39 rows=50 width=1274) (actual > time=5.056..309.492rows=50 loops=1) > -> Index Scan using tmp_entity_name_trgm_gist_idx on entities > (cost=0.00..54566.55 rows=13615 width=1274) (actual > time=5.053..309.393rows=50 loops=1) > Index Cond: ((name)::text % 'hello'::text) > Order By: ((name)::text <-> 'hello'::text) > Total runtime: 309.637 ms > > explain analyze select * from entities where name % 'hello my friend' order > by name <-> 'hello my friend' limit 50; > > QUERY PLAN > > > -------------------------------------------------------------------------------------------------------------------------------------------------------------- > Limit (cost=0.00..200.39 rows=50 width=1274) (actual > time=76.358..9616.066 rows=50 loops=1) > -> Index Scan using tmp_entity_name_trgm_gist_idx on entities > (cost=0.00..54566.55 rows=13615 width=1274) (actual time=76.356..9615.968 > rows=50 loops=1) > Index Cond: ((name)::text % 'hello my friend'::text) > Order By: ((name)::text <-> 'hello my friend'::text) > Total runtime: 9616.203 ms > > For reference, here is my table structure: > \d entities > Table "public.entities" > Column | Type | > Modifiers > > ----------------------+-----------------------------+------------------------------------------------------- > id | integer | not null default > nextval('entities_id_seq'::regclass) > name | character varying(255) | > disambiguation | character varying(255) | > description | text | > entity_basic_type | character varying(255) | > entity_extended_type | character varying(255) | > primary | boolean | default true > semantic_world_id | integer | > calc_completed | boolean | default true > source | text | > source_entity_id | integer | > created_at | timestamp without time zone | > updated_at | timestamp without time zone | > data_import_id | integer | > validated | boolean | default true > weight | integer | > description_source | text | > description_url | text | > rating | text | > Indexes: > "entities_pkey" PRIMARY KEY, btree (id) > "entity_lower_name_idx" btree (lower(name::text) text_pattern_ops) > "entity_name_gin_index" gin (to_tsvector('english'::regconfig, > name::text)) > "entity_unaccented_name_gin_index" gin > (to_tsvector('unaccented_english'::regconfig, name::text)) > "index_entities_on_data_import_id" btree (data_import_id) > "index_entities_on_name" btree (name) > "index_entities_on_source" btree (source) > "tmp_entity_name_trgm_gist_idx" gist (name gist_trgm_ops) > > Thanks for the help! > > Jon >