Re: [GENERAL] Full Text Search combined with Fuzzy

2017-03-03 Thread Artur Zakirov

On 03.03.2017 16:17, Nicolas Paris wrote:

Nice ! I do have 9.6 version.

Would this kind of index could handle more than 20M large texts ? The
recheck condition looks ressource consuming.


You are right. I think pg_trgm will be not good for such large texts, 
unfortunately.




The full text index + phrase search + synonym dictionnary is the only
other alternativ to deal with typo-phrase mining ?


I suppose there are no other options now. Though, prefix search maybe 
will help you [1].




Is there any possibility in the future to add typo in the full text
road-map ?


As far as I know, there is no plans in the near future to add similarity 
full text search.


1. 
https://www.postgresql.org/docs/current/static/textsearch-dictionaries.html


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Full Text Search combined with Fuzzy

2017-03-03 Thread Nicolas Paris
Le 03 mars 2017 à 14:08, Artur Zakirov écrivait :
> On 03.03.2017 15:49, Nicolas Paris wrote:
> >
> >Hi Oleg,
> >
> >Thanks. I thought pgtrgm was not able to index my long texts because of
> >limitation of 8191 bytes per index row for btree.
> >
> >Then I found out it is possible to use pgtrgm over a GIN/GIST index.
> >My final use case is phrase mining in texts.
> >
> >I want my application returns texts that contains approximatly the user
> >entry:
> >
> >Eg: user search "Hello Word"
> >a text containing "blah blah blah hello world blah blah blah" would be
> >returned.
> >
> >Test:
> >postgres=# CREATE table test_trgm (texts text);
> >CREATE TABLE
> >postgres=# CREATE INDEX ON test_trgm USING GIN(texts gin_trgm_ops);
> >CREATE INDEX
> >postgres=# SET enable_seqscan = OFF;
> >SET
> >postgres=# insert into test_trgm VALUES ('blah blah blah hello world blah 
> >blah blah');
> >INSERT 0 1
> >postgres=# insert into test_trgm VALUES ('blah blah blah hello word blah 
> >blah blah');
> >INSERT 0 1
> >postgres=# SELECT texts, similarity(texts, 'hello word') FROM test_trgm 
> >WHERE texts % 'hello word';
> >   texts   | similarity
> >---+
> > blah blah blah hello world blah blah blah |   0.473684
> > blah blah blah hello word blah blah blah  | 0.6875
> >(2 rows)
> >
> >postgres=# EXPLAIN SELECT texts, similarity(texts, 'hello word') FROM 
> >test_trgm WHERE texts % 'hello word';
> >QUERY PLAN
> >---
> > Bitmap Heap Scan on test_trgm  (cost=52.01..56.03 rows=1 width=32)
> >   Recheck Cond: (texts % 'hello word'::text)
> >   ->  Bitmap Index Scan on test_trgm_texts_idx  (cost=0.00..52.01 rows=1 
> > width=0)
> > Index Cond: (texts % 'hello word'::text)
> >(4 rows)
> >
> >Conclusion: If I d'say 0.4 is my threshold, would this methodology meet
> >my requirements ?
> >
> >Thanks for the help !
> >
> 
> Hello,
> 
> If you use PostgreSQL 9.6, then word_similarity() can help you [1]. For
> example:
> 
> postgres=# SELECT texts, word_similarity('hello word', texts) FROM test_trgm
> WHERE 'hello word' <% texts;
>texts   | word_similarity
> ---+-
>  blah blah blah hello world blah blah blah |0.818182
>  blah blah blah hello word blah blah blah  |   1
> (2 rows)
> 
> 1. https://www.postgresql.org/docs/9.6/static/pgtrgm.html
> 

Nice ! I do have 9.6 version.

Would this kind of index could handle more than 20M large texts ? The
recheck condition looks ressource consuming.

The full text index + phrase search + synonym dictionnary is the only
other alternativ to deal with typo-phrase mining ?

Is there any possibility in the future to add typo in the full text
road-map ?

Thanks,

> -- 
> Artur Zakirov
> Postgres Professional: http://www.postgrespro.com
> Russian Postgres Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Full Text Search combined with Fuzzy

2017-03-03 Thread Artur Zakirov

On 03.03.2017 15:49, Nicolas Paris wrote:


Hi Oleg,

Thanks. I thought pgtrgm was not able to index my long texts because of
limitation of 8191 bytes per index row for btree.

Then I found out it is possible to use pgtrgm over a GIN/GIST index.
My final use case is phrase mining in texts.

I want my application returns texts that contains approximatly the user
entry:

Eg: user search "Hello Word"
a text containing "blah blah blah hello world blah blah blah" would be
returned.

Test:
postgres=# CREATE table test_trgm (texts text);
CREATE TABLE
postgres=# CREATE INDEX ON test_trgm USING GIN(texts gin_trgm_ops);
CREATE INDEX
postgres=# SET enable_seqscan = OFF;
SET
postgres=# insert into test_trgm VALUES ('blah blah blah hello world blah blah 
blah');
INSERT 0 1
postgres=# insert into test_trgm VALUES ('blah blah blah hello word blah blah 
blah');
INSERT 0 1
postgres=# SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE 
texts % 'hello word';
   texts   | similarity
---+
 blah blah blah hello world blah blah blah |   0.473684
 blah blah blah hello word blah blah blah  | 0.6875
(2 rows)

postgres=# EXPLAIN SELECT texts, similarity(texts, 'hello word') FROM test_trgm 
WHERE texts % 'hello word';
QUERY PLAN
---
 Bitmap Heap Scan on test_trgm  (cost=52.01..56.03 rows=1 width=32)
   Recheck Cond: (texts % 'hello word'::text)
   ->  Bitmap Index Scan on test_trgm_texts_idx  (cost=0.00..52.01 rows=1 
width=0)
 Index Cond: (texts % 'hello word'::text)
(4 rows)

Conclusion: If I d'say 0.4 is my threshold, would this methodology meet
my requirements ?

Thanks for the help !



Hello,

If you use PostgreSQL 9.6, then word_similarity() can help you [1]. For 
example:


postgres=# SELECT texts, word_similarity('hello word', texts) FROM 
test_trgm WHERE 'hello word' <% texts;

   texts   | word_similarity
---+-
 blah blah blah hello world blah blah blah |0.818182
 blah blah blah hello word blah blah blah  |   1
(2 rows)

1. https://www.postgresql.org/docs/9.6/static/pgtrgm.html

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Full Text Search combined with Fuzzy

2017-03-03 Thread Nicolas Paris
Le 27 févr. 2017 à 10:32, Oleg Bartunov écrivait :
> 
> 
> On Sun, Feb 26, 2017 at 3:52 PM, Nicolas Paris  wrote:
> 
> Hello,
> 
> AFAIK there is no built-in way to combine full text search and fuzzy
> matching
> (https://www.postgresql.org/docs/current/static/fuzzystrmatch.html).
> By example, phrase searching with tipos in it.
> 
> First I don't know if postgresql concurrents (lucene based...) are able
> to do so.
> 
> 
> Usually, https://www.postgresql.org/docs/current/static/pgtrgm.html is used 
> for
> this.
> 

Hi Oleg,

Thanks. I thought pgtrgm was not able to index my long texts because of
limitation of 8191 bytes per index row for btree.

Then I found out it is possible to use pgtrgm over a GIN/GIST index.
My final use case is phrase mining in texts.

I want my application returns texts that contains approximatly the user
entry:

Eg: user search "Hello Word"
a text containing "blah blah blah hello world blah blah blah" would be
returned.

Test:
postgres=# CREATE table test_trgm (texts text);
CREATE TABLE
postgres=# CREATE INDEX ON test_trgm USING GIN(texts gin_trgm_ops);
CREATE INDEX
postgres=# SET enable_seqscan = OFF;
SET
postgres=# insert into test_trgm VALUES ('blah blah blah hello world blah blah 
blah');
INSERT 0 1
postgres=# insert into test_trgm VALUES ('blah blah blah hello word blah blah 
blah');
INSERT 0 1
postgres=# SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE 
texts % 'hello word';
   texts   | similarity 
---+
 blah blah blah hello world blah blah blah |   0.473684
 blah blah blah hello word blah blah blah  | 0.6875
(2 rows)

postgres=# EXPLAIN SELECT texts, similarity(texts, 'hello word') FROM test_trgm 
WHERE texts % 'hello word';
QUERY PLAN  
   
---
 Bitmap Heap Scan on test_trgm  (cost=52.01..56.03 rows=1 width=32)
   Recheck Cond: (texts % 'hello word'::text)
   ->  Bitmap Index Scan on test_trgm_texts_idx  (cost=0.00..52.01 rows=1 
width=0)
 Index Cond: (texts % 'hello word'::text)
(4 rows)

Conclusion: If I d'say 0.4 is my threshold, would this methodology meet
my requirements ?

Thanks for the help !


> 
> Second, is such feature is in the road map ?
> 
> Third, I wonder if it is a good idea to use the postgresql synonyms
> feature for such prupose.(https://www.postgresql.org/docs/current/static/
> textsearch-dictionaries.html)
> I mean, building up a synonyms dictionnary containing tipos. By eg:
> 
> postgres        pgsql
> postgresql      pgsql
> postgrez        pgsql
> postgre         pgsql
> gogle           googl
> gooogle         googl
> 
> There is multiple way to build such dictionary. But my question is about
> the implementation of dictionnaries in postgresql: Is postgresql
> supposed to take advantage of billion entries dictionaries ?
> 
> 
> dictionary is just a program, so it's  up to developer how to write efficient
> program to deal with billion entries. Specifically to synonym dictionary, it's
> not intended to work with a lot of entries. btw, have a look on contrib/
> dict_xsyn dictionary, which is more flexible than synonym.
> 
> 
> Thanks by advance for you answers,
>
> 
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
> 
> 


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Full Text Search combined with Fuzzy

2017-02-27 Thread Oleg Bartunov
On Sun, Feb 26, 2017 at 3:52 PM, Nicolas Paris  wrote:

> Hello,
>
> AFAIK there is no built-in way to combine full text search and fuzzy
> matching
> (https://www.postgresql.org/docs/current/static/fuzzystrmatch.html).
> By example, phrase searching with tipos in it.
>
> First I don't know if postgresql concurrents (lucene based...) are able
> to do so.
>

Usually, https://www.postgresql.org/docs/current/static/pgtrgm.html is used
for this.

>
> Second, is such feature is in the road map ?
>
> Third, I wonder if it is a good idea to use the postgresql synonyms
> feature for such prupose.(https://www.postgresql.org/docs/current/
> static/textsearch-dictionaries.html)
> I mean, building up a synonyms dictionnary containing tipos. By eg:
>
> postgrespgsql
> postgresql  pgsql
> postgrezpgsql
> postgre pgsql
> gogle   googl
> gooogle googl
>
> There is multiple way to build such dictionary. But my question is about
> the implementation of dictionnaries in postgresql: Is postgresql
> supposed to take advantage of billion entries dictionaries ?
>

dictionary is just a program, so it's  up to developer how to write
efficient program to deal with billion entries. Specifically to synonym
dictionary, it's not intended to work with a lot of entries. btw, have a
look on contrib/dict_xsyn dictionary, which is more flexible than synonym.

>
> Thanks by advance for you answers,
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


[GENERAL] Full Text Search combined with Fuzzy

2017-02-26 Thread Nicolas Paris
Hello,

AFAIK there is no built-in way to combine full text search and fuzzy matching
(https://www.postgresql.org/docs/current/static/fuzzystrmatch.html).
By example, phrase searching with tipos in it.

First I don't know if postgresql concurrents (lucene based...) are able
to do so.

Second, is such feature is in the road map ?

Third, I wonder if it is a good idea to use the postgresql synonyms
feature for such 
prupose.(https://www.postgresql.org/docs/current/static/textsearch-dictionaries.html)
I mean, building up a synonyms dictionnary containing tipos. By eg:

postgrespgsql
postgresql  pgsql
postgrezpgsql
postgre pgsql
gogle   googl
gooogle googl

There is multiple way to build such dictionary. But my question is about
the implementation of dictionnaries in postgresql: Is postgresql
supposed to take advantage of billion entries dictionaries ?

Thanks by advance for you answers,


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general