Re: [GENERAL] Full Text Search combined with Fuzzy
On 03.03.2017 16:17, Nicolas Paris wrote: Nice ! I do have 9.6 version. Would this kind of index could handle more than 20M large texts ? The recheck condition looks ressource consuming. You are right. I think pg_trgm will be not good for such large texts, unfortunately. The full text index + phrase search + synonym dictionnary is the only other alternativ to deal with typo-phrase mining ? I suppose there are no other options now. Though, prefix search maybe will help you [1]. Is there any possibility in the future to add typo in the full text road-map ? As far as I know, there is no plans in the near future to add similarity full text search. 1. https://www.postgresql.org/docs/current/static/textsearch-dictionaries.html -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Full Text Search combined with Fuzzy
Le 03 mars 2017 à 14:08, Artur Zakirov écrivait : > On 03.03.2017 15:49, Nicolas Paris wrote: > > > >Hi Oleg, > > > >Thanks. I thought pgtrgm was not able to index my long texts because of > >limitation of 8191 bytes per index row for btree. > > > >Then I found out it is possible to use pgtrgm over a GIN/GIST index. > >My final use case is phrase mining in texts. > > > >I want my application returns texts that contains approximatly the user > >entry: > > > >Eg: user search "Hello Word" > >a text containing "blah blah blah hello world blah blah blah" would be > >returned. > > > >Test: > >postgres=# CREATE table test_trgm (texts text); > >CREATE TABLE > >postgres=# CREATE INDEX ON test_trgm USING GIN(texts gin_trgm_ops); > >CREATE INDEX > >postgres=# SET enable_seqscan = OFF; > >SET > >postgres=# insert into test_trgm VALUES ('blah blah blah hello world blah > >blah blah'); > >INSERT 0 1 > >postgres=# insert into test_trgm VALUES ('blah blah blah hello word blah > >blah blah'); > >INSERT 0 1 > >postgres=# SELECT texts, similarity(texts, 'hello word') FROM test_trgm > >WHERE texts % 'hello word'; > > texts | similarity > >---+ > > blah blah blah hello world blah blah blah | 0.473684 > > blah blah blah hello word blah blah blah | 0.6875 > >(2 rows) > > > >postgres=# EXPLAIN SELECT texts, similarity(texts, 'hello word') FROM > >test_trgm WHERE texts % 'hello word'; > >QUERY PLAN > >--- > > Bitmap Heap Scan on test_trgm (cost=52.01..56.03 rows=1 width=32) > > Recheck Cond: (texts % 'hello word'::text) > > -> Bitmap Index Scan on test_trgm_texts_idx (cost=0.00..52.01 rows=1 > > width=0) > > Index Cond: (texts % 'hello word'::text) > >(4 rows) > > > >Conclusion: If I d'say 0.4 is my threshold, would this methodology meet > >my requirements ? > > > >Thanks for the help ! > > > > Hello, > > If you use PostgreSQL 9.6, then word_similarity() can help you [1]. For > example: > > postgres=# SELECT texts, word_similarity('hello word', texts) FROM test_trgm > WHERE 'hello word' <% texts; >texts | word_similarity > ---+- > blah blah blah hello world blah blah blah |0.818182 > blah blah blah hello word blah blah blah | 1 > (2 rows) > > 1. https://www.postgresql.org/docs/9.6/static/pgtrgm.html > Nice ! I do have 9.6 version. Would this kind of index could handle more than 20M large texts ? The recheck condition looks ressource consuming. The full text index + phrase search + synonym dictionnary is the only other alternativ to deal with typo-phrase mining ? Is there any possibility in the future to add typo in the full text road-map ? Thanks, > -- > Artur Zakirov > Postgres Professional: http://www.postgrespro.com > Russian Postgres Company -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Full Text Search combined with Fuzzy
On 03.03.2017 15:49, Nicolas Paris wrote: Hi Oleg, Thanks. I thought pgtrgm was not able to index my long texts because of limitation of 8191 bytes per index row for btree. Then I found out it is possible to use pgtrgm over a GIN/GIST index. My final use case is phrase mining in texts. I want my application returns texts that contains approximatly the user entry: Eg: user search "Hello Word" a text containing "blah blah blah hello world blah blah blah" would be returned. Test: postgres=# CREATE table test_trgm (texts text); CREATE TABLE postgres=# CREATE INDEX ON test_trgm USING GIN(texts gin_trgm_ops); CREATE INDEX postgres=# SET enable_seqscan = OFF; SET postgres=# insert into test_trgm VALUES ('blah blah blah hello world blah blah blah'); INSERT 0 1 postgres=# insert into test_trgm VALUES ('blah blah blah hello word blah blah blah'); INSERT 0 1 postgres=# SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE texts % 'hello word'; texts | similarity ---+ blah blah blah hello world blah blah blah | 0.473684 blah blah blah hello word blah blah blah | 0.6875 (2 rows) postgres=# EXPLAIN SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE texts % 'hello word'; QUERY PLAN --- Bitmap Heap Scan on test_trgm (cost=52.01..56.03 rows=1 width=32) Recheck Cond: (texts % 'hello word'::text) -> Bitmap Index Scan on test_trgm_texts_idx (cost=0.00..52.01 rows=1 width=0) Index Cond: (texts % 'hello word'::text) (4 rows) Conclusion: If I d'say 0.4 is my threshold, would this methodology meet my requirements ? Thanks for the help ! Hello, If you use PostgreSQL 9.6, then word_similarity() can help you [1]. For example: postgres=# SELECT texts, word_similarity('hello word', texts) FROM test_trgm WHERE 'hello word' <% texts; texts | word_similarity ---+- blah blah blah hello world blah blah blah |0.818182 blah blah blah hello word blah blah blah | 1 (2 rows) 1. https://www.postgresql.org/docs/9.6/static/pgtrgm.html -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Full Text Search combined with Fuzzy
Le 27 févr. 2017 à 10:32, Oleg Bartunov écrivait : > > > On Sun, Feb 26, 2017 at 3:52 PM, Nicolas Pariswrote: > > Hello, > > AFAIK there is no built-in way to combine full text search and fuzzy > matching > (https://www.postgresql.org/docs/current/static/fuzzystrmatch.html). > By example, phrase searching with tipos in it. > > First I don't know if postgresql concurrents (lucene based...) are able > to do so. > > > Usually, https://www.postgresql.org/docs/current/static/pgtrgm.html is used > for > this. > Hi Oleg, Thanks. I thought pgtrgm was not able to index my long texts because of limitation of 8191 bytes per index row for btree. Then I found out it is possible to use pgtrgm over a GIN/GIST index. My final use case is phrase mining in texts. I want my application returns texts that contains approximatly the user entry: Eg: user search "Hello Word" a text containing "blah blah blah hello world blah blah blah" would be returned. Test: postgres=# CREATE table test_trgm (texts text); CREATE TABLE postgres=# CREATE INDEX ON test_trgm USING GIN(texts gin_trgm_ops); CREATE INDEX postgres=# SET enable_seqscan = OFF; SET postgres=# insert into test_trgm VALUES ('blah blah blah hello world blah blah blah'); INSERT 0 1 postgres=# insert into test_trgm VALUES ('blah blah blah hello word blah blah blah'); INSERT 0 1 postgres=# SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE texts % 'hello word'; texts | similarity ---+ blah blah blah hello world blah blah blah | 0.473684 blah blah blah hello word blah blah blah | 0.6875 (2 rows) postgres=# EXPLAIN SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE texts % 'hello word'; QUERY PLAN --- Bitmap Heap Scan on test_trgm (cost=52.01..56.03 rows=1 width=32) Recheck Cond: (texts % 'hello word'::text) -> Bitmap Index Scan on test_trgm_texts_idx (cost=0.00..52.01 rows=1 width=0) Index Cond: (texts % 'hello word'::text) (4 rows) Conclusion: If I d'say 0.4 is my threshold, would this methodology meet my requirements ? Thanks for the help ! > > Second, is such feature is in the road map ? > > Third, I wonder if it is a good idea to use the postgresql synonyms > feature for such prupose.(https://www.postgresql.org/docs/current/static/ > textsearch-dictionaries.html) > I mean, building up a synonyms dictionnary containing tipos. By eg: > > postgres pgsql > postgresql pgsql > postgrez pgsql > postgre pgsql > gogle googl > gooogle googl > > There is multiple way to build such dictionary. But my question is about > the implementation of dictionnaries in postgresql: Is postgresql > supposed to take advantage of billion entries dictionaries ? > > > dictionary is just a program, so it's up to developer how to write efficient > program to deal with billion entries. Specifically to synonym dictionary, it's > not intended to work with a lot of entries. btw, have a look on contrib/ > dict_xsyn dictionary, which is more flexible than synonym. > > > Thanks by advance for you answers, > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general > > -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Full Text Search combined with Fuzzy
On Sun, Feb 26, 2017 at 3:52 PM, Nicolas Pariswrote: > Hello, > > AFAIK there is no built-in way to combine full text search and fuzzy > matching > (https://www.postgresql.org/docs/current/static/fuzzystrmatch.html). > By example, phrase searching with tipos in it. > > First I don't know if postgresql concurrents (lucene based...) are able > to do so. > Usually, https://www.postgresql.org/docs/current/static/pgtrgm.html is used for this. > > Second, is such feature is in the road map ? > > Third, I wonder if it is a good idea to use the postgresql synonyms > feature for such prupose.(https://www.postgresql.org/docs/current/ > static/textsearch-dictionaries.html) > I mean, building up a synonyms dictionnary containing tipos. By eg: > > postgrespgsql > postgresql pgsql > postgrezpgsql > postgre pgsql > gogle googl > gooogle googl > > There is multiple way to build such dictionary. But my question is about > the implementation of dictionnaries in postgresql: Is postgresql > supposed to take advantage of billion entries dictionaries ? > dictionary is just a program, so it's up to developer how to write efficient program to deal with billion entries. Specifically to synonym dictionary, it's not intended to work with a lot of entries. btw, have a look on contrib/dict_xsyn dictionary, which is more flexible than synonym. > > Thanks by advance for you answers, > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general >
[GENERAL] Full Text Search combined with Fuzzy
Hello, AFAIK there is no built-in way to combine full text search and fuzzy matching (https://www.postgresql.org/docs/current/static/fuzzystrmatch.html). By example, phrase searching with tipos in it. First I don't know if postgresql concurrents (lucene based...) are able to do so. Second, is such feature is in the road map ? Third, I wonder if it is a good idea to use the postgresql synonyms feature for such prupose.(https://www.postgresql.org/docs/current/static/textsearch-dictionaries.html) I mean, building up a synonyms dictionnary containing tipos. By eg: postgrespgsql postgresql pgsql postgrezpgsql postgre pgsql gogle googl gooogle googl There is multiple way to build such dictionary. But my question is about the implementation of dictionnaries in postgresql: Is postgresql supposed to take advantage of billion entries dictionaries ? Thanks by advance for you answers, -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general