Depending on your needs, you might consider putting the data into a columnar
text search engine like Lucene, having it return the integer id's which can
then be used for row lookups in PG.

On Thu, Sep 22, 2011 at 11:40 AM, Jonathan Bartlett <
jonathan.l.bartl...@gmail.com> wrote:

> I am working on a fuzzy search of a large dataset.  Basically, it is a list
> of all of the songs, albums, artists, movies, and celebrities exported from
> Freebase.  Anyway, I was hoping to get a fuzzy search that was nearly as
> fast as the full-text search with the new nearest-neighbor GIST indexes,
> but, while it is improved from 9.0, it is still taking some time.  The table
> has about 16 million rows, each with a "name" column that is usually 2-10
> words.
>
> My query using full-text search is this:
> explain analyze select * from entities where
> to_tsvector('unaccented_english', entities.name) @@
> plainto_tsquery('unaccented_english', 'bon jovi');
>                                                                   QUERY
> PLAN
>
> ----------------------------------------------------------------------------------------------------------------------------------------------
>  Bitmap Heap Scan on entities  (cost=42.64..1375.62 rows=340 width=1274)
> (actual time=0.422..0.617 rows=109 loops=1)
>    Recheck Cond: (to_tsvector('unaccented_english'::regconfig,
> (name)::text) @@ '''bon'' & ''jovi'''::tsquery)
>    ->  Bitmap Index Scan on entity_unaccented_name_gin_index
>  (cost=0.00..42.56 rows=340 width=0) (actual time=0.402..0.402 rows=109
> loops=1)
>          Index Cond: (to_tsvector('unaccented_english'::regconfig,
> (name)::text) @@ '''bon'' & ''jovi'''::tsquery)
>  Total runtime: 0.728 ms
> (5 rows)
>
> My new query using trigrams is this:
> explain analyze select * from entities where name % 'bon jovi';
>                                                                      QUERY
> PLAN
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>  Bitmap Heap Scan on entities  (cost=913.73..46585.14 rows=13615
> width=1274) (actual time=7769.380..7772.739 rows=326 loops=1)
>    Recheck Cond: ((name)::text % 'bon jovi'::text)
>    ->  Bitmap Index Scan on tmp_entity_name_trgm_gist_idx
>  (cost=0.00..910.33 rows=13615 width=0) (actual time=7769.307..7769.307
> rows=326 loops=1)
>          Index Cond: ((name)::text % 'bon jovi'::text)
>  Total runtime: 7773.008 ms
>
> If I put a limit on it, it gets better, but is still pretty bad:
> explain analyze select * from entities where name % 'bon jovi' limit 50;
>
>  QUERY PLAN
>
>
> -------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=0.00..200.14 rows=50 width=1274) (actual time=1.246..1226.146
> rows=50 loops=1)
>    ->  Index Scan using tmp_entity_name_trgm_gist_idx on entities
>  (cost=0.00..54498.48 rows=13615 width=1274) (actual time=1.243..1226.016
> rows=50 loops=1)
>          Index Cond: ((name)::text % 'bon jovi'::text)
>  Total runtime: 1226.261 ms
> (4 rows)
>
> And if I try to get the "best" matches, the performance goes completely
> down the tubes, even with a limit:
> explain analyze select * from entities where name % 'bon jovi' order by
> name <-> 'bon jovi' limit 50;
>
> QUERY PLAN
>
>
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=0.00..200.39 rows=50 width=1274) (actual
> time=421.811..8058.877 rows=50 loops=1)
>    ->  Index Scan using tmp_entity_name_trgm_gist_idx on entities
>  (cost=0.00..54566.55 rows=13615 width=1274) (actual time=421.808..8058.766
> rows=50 loops=1)
>          Index Cond: ((name)::text % 'bon jovi'::text)
>          Order By: ((name)::text <-> 'bon jovi'::text)
>  Total runtime: 8060.760 ms
>
> Anyway, this may just be a limitation of the trigram indexing, but my hope
> was to get a fuzzy search that at least approached the performance of the
> full-text searching.  Am I missing something, or am I just bumping into the
> limits?  I also noticed that different strings get radically different
> performance.  Searching for "hello" drops the search time down to 310ms!
>  But searching for 'hello my friend' brings the search time to 9616ms!
> explain analyze select * from entities where name % 'hello' order by name
> <-> 'hello' limit 50;
>
>  QUERY PLAN
>
>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=0.00..200.39 rows=50 width=1274) (actual 
> time=5.056..309.492rows=50 loops=1)
>    ->  Index Scan using tmp_entity_name_trgm_gist_idx on entities
>  (cost=0.00..54566.55 rows=13615 width=1274) (actual 
> time=5.053..309.393rows=50 loops=1)
>          Index Cond: ((name)::text % 'hello'::text)
>          Order By: ((name)::text <-> 'hello'::text)
>  Total runtime: 309.637 ms
>
> explain analyze select * from entities where name % 'hello my friend' order
> by name <-> 'hello my friend' limit 50;
>
> QUERY PLAN
>
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=0.00..200.39 rows=50 width=1274) (actual
> time=76.358..9616.066 rows=50 loops=1)
>    ->  Index Scan using tmp_entity_name_trgm_gist_idx on entities
>  (cost=0.00..54566.55 rows=13615 width=1274) (actual time=76.356..9615.968
> rows=50 loops=1)
>          Index Cond: ((name)::text % 'hello my friend'::text)
>          Order By: ((name)::text <-> 'hello my friend'::text)
>  Total runtime: 9616.203 ms
>
> For reference, here is my table structure:
> \d entities
>                                           Table "public.entities"
>         Column        |            Type             |
> Modifiers
>
> ----------------------+-----------------------------+-------------------------------------------------------
>  id                   | integer                     | not null default
> nextval('entities_id_seq'::regclass)
>  name                 | character varying(255)      |
>  disambiguation       | character varying(255)      |
>  description          | text                        |
>  entity_basic_type    | character varying(255)      |
>  entity_extended_type | character varying(255)      |
>  primary              | boolean                     | default true
>  semantic_world_id    | integer                     |
>  calc_completed       | boolean                     | default true
>  source               | text                        |
>  source_entity_id     | integer                     |
>  created_at           | timestamp without time zone |
>  updated_at           | timestamp without time zone |
>  data_import_id       | integer                     |
>  validated            | boolean                     | default true
>  weight               | integer                     |
>  description_source   | text                        |
>  description_url      | text                        |
>  rating               | text                        |
> Indexes:
>     "entities_pkey" PRIMARY KEY, btree (id)
>     "entity_lower_name_idx" btree (lower(name::text) text_pattern_ops)
>     "entity_name_gin_index" gin (to_tsvector('english'::regconfig,
> name::text))
>     "entity_unaccented_name_gin_index" gin
> (to_tsvector('unaccented_english'::regconfig, name::text))
>     "index_entities_on_data_import_id" btree (data_import_id)
>     "index_entities_on_name" btree (name)
>     "index_entities_on_source" btree (source)
>     "tmp_entity_name_trgm_gist_idx" gist (name gist_trgm_ops)
>
> Thanks for the help!
>
> Jon
>

Reply via email to