Re: [GENERAL] store key name pattern search

2017-04-04 Thread Oleg Bartunov
On Tue, Apr 4, 2017 at 11:41 AM, Armand Pirvu (home)  wrote:

> Hi
>
> I have the following case
>
>
> select * from foo;
>   col1
> 
> -
>  "show_id"=>"1", "group_id"=>"32", "group_name"=>"slb",
> "group_add_by"=>"557651"
>  "show_id"=>"2", "group_id"=>"33", "group_name"=>"slc",
> "item_add_by"=>"557652"
> (2 rows)
>
> Is there anyway I can do a pattern search by hstore key name something like
>
> select * from foo where skeys(col1) like '%add_by%';
>
> I looked on the doc but did not see anything , or did I miss it ?
>

No. You may convert to json and use jsquery
https://github.com/postgrespro/jsquery

>
> Thanks
>
> Armand
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] Incremental / Level -1 backup in PG

2017-03-22 Thread Oleg Bartunov
On Wed, Mar 22, 2017 at 3:27 AM, Rakesh Kumar 
wrote:

> PG does not have a concept of incremental  backup. The way it works in
> Oracle and other RDBMS is that incremental backup only backups up changed
> blocks since the last full backup.  So if only 10% of blocks changed since
> the last full backup, incremental backup will be only for 10%.
> I am wondering whether it is technically feasible to implement it like
> this:
>

Have you seen pg_probackup ? It's in development stage and needs somebody
to test it.



>
> 1 - At the time of full backup, note the last modified time of each data
> file in a repository.
> 2 - Next time when incremental backup runs, for every data file it will
> check the last modified time of it with the one in the repository to
> determine whether it has changed since last full backup. If yes, back it up.
>
> Now on to restore:
>
> 1 - First restore full backup.
> 2 - Restore incremental backup.
>
> My question: Will it work in PG?
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] Full Text Search combined with Fuzzy

2017-02-27 Thread Oleg Bartunov
On Sun, Feb 26, 2017 at 3:52 PM, Nicolas Paris  wrote:

> Hello,
>
> AFAIK there is no built-in way to combine full text search and fuzzy
> matching
> (https://www.postgresql.org/docs/current/static/fuzzystrmatch.html).
> By example, phrase searching with tipos in it.
>
> First I don't know if postgresql concurrents (lucene based...) are able
> to do so.
>

Usually, https://www.postgresql.org/docs/current/static/pgtrgm.html is used
for this.

>
> Second, is such feature is in the road map ?
>
> Third, I wonder if it is a good idea to use the postgresql synonyms
> feature for such prupose.(https://www.postgresql.org/docs/current/
> static/textsearch-dictionaries.html)
> I mean, building up a synonyms dictionnary containing tipos. By eg:
>
> postgrespgsql
> postgresql  pgsql
> postgrezpgsql
> postgre pgsql
> gogle   googl
> gooogle googl
>
> There is multiple way to build such dictionary. But my question is about
> the implementation of dictionnaries in postgresql: Is postgresql
> supposed to take advantage of billion entries dictionaries ?
>

dictionary is just a program, so it's  up to developer how to write
efficient program to deal with billion entries. Specifically to synonym
dictionary, it's not intended to work with a lot of entries. btw, have a
look on contrib/dict_xsyn dictionary, which is more flexible than synonym.

>
> Thanks by advance for you answers,
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] Searching array for multiple items

2017-01-25 Thread Oleg Bartunov
On Wed, Jan 25, 2017 at 11:29 AM, Alex Magnum  wrote:

> Hi,
> I can search an array with 1 = ANY('{1,3,4,7}'::int[])
>
> I need to check for one or multiple items in the array.
>
> e.g. '1,7,3'  = ANY('{1,3,4,7}'::int[]
>
> I do need to check if
> a) all items exist in the array
> b) at least one item exists in the array
>
> Is there a an operator that allows me to do these two?
> Does the order of left and right side matter?
> Right now I have a small function but I guess there is a more efficient
> way.
>

Look on our contrib/intarray


>
> Thanks for any help.
>
> A
>
>
>


Re: [GENERAL] Incorrect information about GIN-index in RUM's docs

2016-11-14 Thread Oleg Bartunov
On Tue, Nov 15, 2016 at 10:37 AM, Andreas Joseph Krogh 
wrote:

> This item isn't valid, is it?
>
>
>- There isn't phrase search with GIN index. This problem relates with
>previous problem. It is need position information to perform phrase search.
>
> RUM being an extention, and having index-access-methods as extentions is
> new in 9.6, where phrase-search was introduced, this isn't really true, or
> am I missing something?
>

RUM is very good for phrase search, since it has access to coordinates in
index, check slides from PGconf.eu (
http://www.sai.msu.su/~megera/postgres/talks/pgconfeu-fts-2016.pdf).  Where
did you find the citation ?



>
> --
> *Andreas Joseph Krogh*
> CTO / Partner - Visena AS
> Mobile: +47 909 56 963
> andr...@visena.com
> www.visena.com
> 
>


Re: [GENERAL] Trigram is slow when 10m rows

2016-11-13 Thread Oleg Bartunov
On Sun, Nov 13, 2016 at 5:25 PM, Aaron Lewis <the.warl0ck.1...@gmail.com>
wrote:

> Thanks Oleg.
>
> I've increased work_mem to 128MB, now the query falls down to 1.7s,
> faster but still not good enough.
>
> Is there any other thing I can do about it?
>

your query 'x264' is short in terms of the number of trigrams, so trigram
index isn't good. Did you tried text_pattern_ops for btree ? Something like
create index title_btree_idx on mytable using btree(title text_pattern_ops
);

>
> test=# explain analyze select * from mytable where title ilike 'x264';
>   QUERY PLAN
> 
> --
>  Bitmap Heap Scan on mytable  (cost=462.69..5639.67 rows=1380
> width=83) (actual time=1754.656..1754.656 rows=0 loops=1)
>Recheck Cond: (title ~~* 'x264'::text)
>Rows Removed by Index Recheck: 1220793
>Heap Blocks: exact=197567
>->  Bitmap Index Scan on title_trgm_idx  (cost=0.00..462.35
> rows=1380 width=0) (actual time=346.663..346.663 rows=1220793 loops=1)
>  Index Cond: (title ~~* 'x264'::text)
>  Planning time: 1.168 ms
>  Execution time: 1755.944 ms
>
>
> On Sun, Nov 13, 2016 at 10:04 PM, Oleg Bartunov <obartu...@gmail.com>
> wrote:
> >
> >
> > On Sun, Nov 13, 2016 at 2:54 PM, Aaron Lewis <the.warl0ck.1...@gmail.com
> >
> > wrote:
> >>
> >> I have a simple table with Trigram index,
> >>
> >> create table mytable(hash char(40), title text);
> >> create index title_trgm_idx on mytable using gin(title gin_trgm_ops);
> >>
> >> When I run a query with 10m rows, it uses the Trigram index, but takes
> >> 3s to execute, very slow.
> >> (I have 80m rows, but only inserted 10m for testing purpose)
> >>
> >> test=# select count(*) from mytable;
> >>   count
> >> --
> >>  13971887
> >> (1 row)
> >>
> >> test=# explain analyze select * from mytable where title ilike 'x264';
> >>   QUERY PLAN
> >>
> >> 
> --
> >>  Bitmap Heap Scan on mytable  (cost=462.69..5639.67 rows=1380
> >> width=83) (actual time=2937.308..2937.308 rows=0 loops=1)
> >>Recheck Cond: (title ~~* 'x264'::text)
> >>Rows Removed by Index Recheck: 11402855
> >>Heap Blocks: exact=39557 lossy=158010
> >>->  Bitmap Index Scan on title_trgm_idx  (cost=0.00..462.35
> >> rows=1380 width=0) (actual time=342.440..342.440 rows=1220793 loops=1)
> >>  Index Cond: (title ~~* 'x264'::text)
> >>  Planning time: 0.611 ms
> >>  Execution time: 2937.729 ms
> >> (8 rows)
> >>
> >> Any ideas to speed things up?
> >
> >
> >Rows Removed by Index Recheck: 11402855
> >Heap Blocks: exact=39557 lossy=158010
> >
> > You need to increase work_mem
> >>
> >>
> >> --
> >> Best Regards,
> >> Aaron Lewis - PGP: 0x13714D33 - http://pgp.mit.edu/
> >> Finger Print:   9F67 391B B770 8FF6 99DC  D92D 87F6 2602 1371 4D33
> >>
> >>
> >> --
> >> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> >> To make changes to your subscription:
> >> http://www.postgresql.org/mailpref/pgsql-general
> >
> >
>
>
>
> --
> Best Regards,
> Aaron Lewis - PGP: 0x13714D33 - http://pgp.mit.edu/
> Finger Print:   9F67 391B B770 8FF6 99DC  D92D 87F6 2602 1371 4D33
>


Re: [GENERAL] Why is this query not using GIN index?

2016-11-13 Thread Oleg Bartunov
On Sun, Nov 13, 2016 at 6:05 PM, Aaron Lewis <the.warl0ck.1...@gmail.com>
wrote:

> Sigh, didn't notice that. Thanks for the heads up.
>
> It takes 500ms with 10m rows, could it be faster?
>

sure.  Recheck with function call is pretty expensive, so I'd not recommend
to create functional index, just create separate column of type tsvector
(materialize to_tsvector) and create gin index on it.  You should surprise.


> I've increased work_mem to 256MB
>
> test=# explain analyze select * from mytable where
> to_tsvector('english', title) @@ 'x264'::tsquery limit 1000 offset 10;
>   QUERY PLAN
> 
> 
> ---
>  Limit  (cost=684.06..2949.42 rows=1000 width=83) (actual
> time=348.506..536.483 rows=1000 loops=1)
>->  Bitmap Heap Scan on mytable  (cost=661.41..158917.22 rows=69859
> width=83) (actual time=345.354..536.199 rows=1010 loops=1)
>  Recheck Cond: (to_tsvector('english'::regconfig, title) @@
> '''x264'''::tsquery)
>  Rows Removed by Index Recheck: 12242
>  Heap Blocks: exact=20 lossy=186
>  ->  Bitmap Index Scan on name_fts  (cost=0.00..643.95
> rows=69859 width=0) (actual time=333.703..333.703 rows=1044673
> loops=1)
>Index Cond: (to_tsvector('english'::regconfig, title)
> @@ '''x264'''::tsquery)
>  Planning time: 0.144 ms
>  Execution time: 537.212 ms
> (9 rows)
>
> On Sun, Nov 13, 2016 at 10:33 PM, Julien Rouhaud
> <julien.rouh...@dalibo.com> wrote:
> > On 13/11/2016 15:26, Aaron Lewis wrote:
> >> Hi Oleg,
> >>
> >> Can you elaborate on the title column? I don't get it.
> >>
> >
> >>>> create table mytable(hash char(40), title varchar(500));
> >>>> create index name_fts on mytable using gin(to_tsvector('english',
> >>>> 'title'));
> >
> > You created an index on the text 'title', not on the title column, so
> > the index is useless.
> >
> > Drop the existing index and create this one instead:
> >
> > create index name_fts on mytable using gin(to_tsvector('english',
> title));
> >
> >> On Sun, Nov 13, 2016 at 10:10 PM, Oleg Bartunov <obartu...@gmail.com>
> wrote:
> >>>
> >>>
> >>> On Sun, Nov 13, 2016 at 2:50 PM, Aaron Lewis <
> the.warl0ck.1...@gmail.com>
> >>> wrote:
> >>>>
> >>>> I have a simple table, and a gin index,
> >>>>
> >>>> create table mytable(hash char(40), title varchar(500));
> >>>> create index name_fts on mytable using gin(to_tsvector('english',
> >>>> 'title'));
> >>>
> >>>
> >>>
> >>> ^
> >>>
> >>>>
> >>>> create unique index md5_uniq_idx on mytable(hash);
> >>>>
> >>>> When I execute a query with tsquery, the GIN index was not in use:
> >>>>
> >>>> test=# explain analyze select * from mytable where
> >>>> to_tsvector('english', title) @@ 'abc | def'::tsquery limit 10;
> >>>>  QUERY PLAN
> >>>>
> >>>> 
> 
> >>>>  Limit  (cost=0.00..277.35 rows=10 width=83) (actual
> >>>> time=0.111..75.549 rows=10 loops=1)
> >>>>->  Seq Scan on mytable  (cost=0.00..381187.45 rows=13744 width=83)
> >>>> (actual time=0.110..75.546 rows=10 loops=1)
> >>>>  Filter: (to_tsvector('english'::regconfig, (title)::text) @@
> >>>> '''abc'' | ''def'''::tsquery)
> >>>>  Rows Removed by Filter: 10221
> >>>>  Planning time: 0.176 ms
> >>>>  Execution time: 75.564 ms
> >>>> (6 rows)
> >>>>
> >>>> Any ideas?
> >>>>
> >
> > --
> > Julien Rouhaud
> > http://dalibo.com - http://dalibo.org
>
>
>
> --
> Best Regards,
> Aaron Lewis - PGP: 0x13714D33 - http://pgp.mit.edu/
> Finger Print:   9F67 391B B770 8FF6 99DC  D92D 87F6 2602 1371 4D33
>


Re: [GENERAL] Why is this query not using GIN index?

2016-11-13 Thread Oleg Bartunov
On Sun, Nov 13, 2016 at 2:50 PM, Aaron Lewis 
wrote:

> I have a simple table, and a gin index,
>
> create table mytable(hash char(40), title varchar(500));
> create index name_fts on mytable using gin(to_tsvector('english',
> 'title'));
>


^


> create unique index md5_uniq_idx on mytable(hash);
>
> When I execute a query with tsquery, the GIN index was not in use:
>
> test=# explain analyze select * from mytable where
> to_tsvector('english', title) @@ 'abc | def'::tsquery limit 10;
>  QUERY PLAN
> 
> 
>  Limit  (cost=0.00..277.35 rows=10 width=83) (actual
> time=0.111..75.549 rows=10 loops=1)
>->  Seq Scan on mytable  (cost=0.00..381187.45 rows=13744 width=83)
> (actual time=0.110..75.546 rows=10 loops=1)
>  Filter: (to_tsvector('english'::regconfig, (title)::text) @@
> '''abc'' | ''def'''::tsquery)
>  Rows Removed by Filter: 10221
>  Planning time: 0.176 ms
>  Execution time: 75.564 ms
> (6 rows)
>
> Any ideas?
>
>
> --
> Best Regards,
> Aaron Lewis - PGP: 0x13714D33 - http://pgp.mit.edu/
> Finger Print:   9F67 391B B770 8FF6 99DC  D92D 87F6 2602 1371 4D33
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] Trigram is slow when 10m rows

2016-11-13 Thread Oleg Bartunov
On Sun, Nov 13, 2016 at 2:54 PM, Aaron Lewis 
wrote:

> I have a simple table with Trigram index,
>
> create table mytable(hash char(40), title text);
> create index title_trgm_idx on mytable using gin(title gin_trgm_ops);
>
> When I run a query with 10m rows, it uses the Trigram index, but takes
> 3s to execute, very slow.
> (I have 80m rows, but only inserted 10m for testing purpose)
>
> test=# select count(*) from mytable;
>   count
> --
>  13971887
> (1 row)
>
> test=# explain analyze select * from mytable where title ilike 'x264';
>   QUERY PLAN
> 
> --
>  Bitmap Heap Scan on mytable  (cost=462.69..5639.67 rows=1380
> width=83) (actual time=2937.308..2937.308 rows=0 loops=1)
>Recheck Cond: (title ~~* 'x264'::text)
>Rows Removed by Index Recheck: 11402855
>Heap Blocks: exact=39557 lossy=158010
>->  Bitmap Index Scan on title_trgm_idx  (cost=0.00..462.35
> rows=1380 width=0) (actual time=342.440..342.440 rows=1220793 loops=1)
>  Index Cond: (title ~~* 'x264'::text)
>  Planning time: 0.611 ms
>  Execution time: 2937.729 ms
> (8 rows)
>
> Any ideas to speed things up?
>

   Rows Removed by Index Recheck: 11402855
   Heap Blocks: exact=39557 lossy=158010

You need to increase work_mem

>
> --
> Best Regards,
> Aaron Lewis - PGP: 0x13714D33 - http://pgp.mit.edu/
> Finger Print:   9F67 391B B770 8FF6 99DC  D92D 87F6 2602 1371 4D33
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] Gin indexes on intarray is fast when value in array does not exists, and slow, when value exists

2016-11-08 Thread Oleg Bartunov
On Tue, Nov 8, 2016 at 8:43 PM, otar shavadze  wrote:

> I have table with 500 000 rows, I have int[] column "my_array" in this
> table, this array column contains minimum 1 and maximum 5 different values.
>

you didn't show us postgres version.


>
> I have GIN index on my_array column:
>
> * "CREATE INDEX idx ON table_name USING GIN(my_array gin__int_ops)"*
>
> Then I use this query: "*SELECT * FROM  table_name WHERE my_array @>
> '{3}'::integer[]  ORDER BY id LIMIT 50"*
>
> Execution time of this query is approximately 500-1000 ms. Then if I drop
> gin index "*idx*", query works extremely fast, less than 20 ms.
>

explain analyze would help us to see the problem.



>
> But, if I search value, which does not exists at all, for example no one
> array not contains number "77" and I search: * "WHERE my_array @>
> '{77}'::integer[]" *, then using gin index is much better and fast, (less
> than 20 ms), but without index, query takes 500-1000 ms.
>
>
> So, what to do? For values which does not in any one rows, using index is
> much better, but for values,  which are at least in several rows, using
>  index, slows down performance.
>
> Can somehow make, that searching was always fast (when value exist in
> array and when not)
>
>
>


Re: [GENERAL] ZSON, PostgreSQL extension for compressing JSONB

2016-10-04 Thread Oleg Bartunov
On Tue, Oct 4, 2016 at 4:20 PM, Simon Riggs  wrote:

> On 30 September 2016 at 16:58, Aleksander Alekseev
>  wrote:
>
> > I've just uploaded ZSON extension on GitHub:
> >
> > https://github.com/afiskon/zson
> >
> > ZSON learns on your common JSONB documents and creates a dictionary
> > with strings that are frequently used in all documents. After that you
> > can use ZSON type to compress documents using this dictionary. When
> > documents schema evolve and compression becomes inefficient you can
> > re-learn on new documents. New documents will be compressed with a new
> > dictionary, old documents will be decompressed using old dictionary.
> >
> > In some cases ZSON can save half of your disk space and give you about
> > 10% more TPS. Everything depends on your data and workload though.
> > Memory is saved as well. For more details see README.md.
> >
> > Please don't hesitate to ask any questions. Any feedback and pull
> > requests are welcome too!
>
> Very good. Oleg had mentioned that dictionary compression was being
> considered.
>
> It would be useful to be able to define compression dictionaries for
> many use cases.
>


Dictionary compression is a different project, we'll publish it after
testing.


>
> Will you be submitting this to core?
>
> --
> Simon Riggshttp://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] Results in ERROR: syntax error in tsquery:

2016-09-19 Thread Oleg Bartunov
On Mon, Sep 19, 2016 at 4:19 AM, Oleg Bartunov <obartu...@gmail.com> wrote:
> On Mon, Sep 19, 2016 at 3:46 AM, Kiran <bangalore.ki...@gmail.com> wrote:
>> Dear All,
>>
>> I have a ts_vector column in question table called weighted_tsv.
>> I am trying to search using ts_query as follows
>>
>> SELECT *
>>
>> FROM question
>>
>> WHERE weighted_tsv @@ to_tsquery('Hur ofta');
>
> you should use to_tsquery('Hur & ofta') ! Read
> https://www.postgresql.org/docs/9.5/static/datatype-textsearch.html

or use plainto_tsquery('Hur ofta')

>
>>
>>
>> But the query results in an error as follows:
>>
>> ERROR:  function ts_query(unknown) does not exist
>>
>> LINE 1: ...* from question where weighted_tsv @@ ts_query('...
>>
>> HINT:  No function matches the given name and argument types. You might need
>> to add explicit type casts.
>>
>>
>> What is the reason I am getting this error I am not able to figure it out.
>>
>> It would be of great help if anyone point me into right direction.
>>
>>
>> Thanks
>>
>> Kiran
>>
>>
>>
>>


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Results in ERROR: syntax error in tsquery:

2016-09-19 Thread Oleg Bartunov
On Mon, Sep 19, 2016 at 3:46 AM, Kiran  wrote:
> Dear All,
>
> I have a ts_vector column in question table called weighted_tsv.
> I am trying to search using ts_query as follows
>
> SELECT *
>
> FROM question
>
> WHERE weighted_tsv @@ to_tsquery('Hur ofta');

you should use to_tsquery('Hur & ofta') ! Read
https://www.postgresql.org/docs/9.5/static/datatype-textsearch.html

>
>
> But the query results in an error as follows:
>
> ERROR:  function ts_query(unknown) does not exist
>
> LINE 1: ...* from question where weighted_tsv @@ ts_query('...
>
> HINT:  No function matches the given name and argument types. You might need
> to add explicit type casts.
>
>
> What is the reason I am getting this error I am not able to figure it out.
>
> It would be of great help if anyone point me into right direction.
>
>
> Thanks
>
> Kiran
>
>
>
>


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Predicting query runtime

2016-09-13 Thread Oleg Bartunov
On Tue, Sep 13, 2016 at 2:54 PM, Istvan Soos  wrote:
> On Tue, Sep 13, 2016 at 2:06 AM, Vinicius Segalin  
> wrote:
>> 2016-09-12 18:22 GMT-03:00 Istvan Soos :
>>> At Heap we have non-trivial complexity in our analytical queries, and
>>> some of them can take a long time to complete. We did analyze features
>>> like the query planner's output, our query properties (type,
>>> parameters, complexity) and tried to automatically identify factors
>>> that contribute the most into the total query time. It turns out that
>>> you don't need to use machine learning for the basics, but at this
>>> point we were not aiming for predictions yet.
>>
>> And how did you do that? Manually analyzing some queries?
>
> In this case, it was automatic analysis and feature discovery. We were
> generating features out of our query parameters, out of the SQL
> string, and also out of the explain analyze output. For each of these
> features, we have examined the P(query is slow | feature is present),
> and measured its statistical properties (precision, recall,
> correlations...).
>
> With these we have built a decision tree-based partitioning, where our
> feature-predicates divided the queries into subsets. Such a tree could
> be used for predictions, or if we would like to be fancy, we could use
> the feature vectors to train a neural network.

FYI, please check https://pgconf.ru/2016/89977

>
> Hope this helps for now,
>   Istvan
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Fwd: [Snowball-discuss] Greek stemmer

2016-08-25 Thread Oleg Bartunov
This is a chance to add default configuration for Greek language if
somebody with good knowledge could follow this development.

Oleg

-- Forwarded message --
From: Oleg Smirnov 
Date: Thu, Aug 25, 2016 at 5:26 PM
Subject: [Snowball-discuss] Greek stemmer
To: "snowball-discu." 


Hi all,

 I have implemented a stemmer for Modern Greek language [1] based on a
thesis by G. Ntais [2] with improvements proposed by S. Saroukos [3]

 I'm pretty new to Snowball so it will be great if someone could
review my code. Any feedback is much appreciated.

 1. https://github.com/snowballstem/snowball/pull/44
 2. http://sais.se/mthprize/2007/ntais2007.pdf
 3. http://tampub.uta.fi/bitstream/handle/10024/80480/gradu03463.pdf

--
Regards,
Oleg Smirnov

___
Snowball-discuss mailing list
snowball-disc...@lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Updated RUM-index and support for bigint as part of index

2016-08-25 Thread Oleg Bartunov
Andreas,

sorry for delay,
it looks like a bug to me, could you please, share your dataset with me, so
I could reproduce the behaviour.

Regards,
Oleg

On Sun, Aug 7, 2016 at 11:05 AM, Andreas Joseph Krogh <andr...@visena.com>
wrote:

> På søndag 07. august 2016 kl. 08:27:06, skrev Oleg Bartunov <
> obartu...@gmail.com>:
>
> [snip]
> have you considered <=| and |=> operators ? <=> in ORDER BY works like KNN.
>
>
> I don't get how these operators should work. Neither give me the expected
> results.
>
> Using <=>
>
>
> SELECTdel.entity_id,
> del.folder_id,
> del.received_timestampFROM origo_email_delivery delWHERE del.fts_all @@ 
> to_tsquery('simple', 'andreas:*:*')ORDER BY '2000-01-01' :: TIMESTAMP 
> <=> del.received_timestampLIMIT 10;
>
>  entity_id | folder_id |   received_timestamp
> ---+---+-
>   1224278 |   1068087 | 2015-08-17 23:53:26
>   1224382 |   1068087 | 2015-08-18 03:07:55
>   1224404 |   1068087 | 2015-08-18 03:49:02
>   1505713 | 48496 | 2015-10-27 14:51:45
>142132 | 66658 | 2012-12-03 14:14:05.488
>122565 | 90115 | 2012-11-20 15:41:04.936
>200744 | 66655 | 2013-01-28 21:47:44.561
>   1445927 |888665 | 2015-09-29 00:26:56
>123671 | 83509 | 2012-11-21 14:16:26.448
>   1129928 | 66658 | 2015-05-09 08:39:14.128
> (10 rows)
>
>
> Using <=|
>
> SELECTdel.entity_id,
> del.folder_id,
> del.received_timestampFROM origo_email_delivery delWHERE del.fts_all @@ 
> to_tsquery('simple', 'andreas:*:*')ORDER BY '2000-01-01' :: TIMESTAMP 
> <=| del.received_timestampLIMIT 10;
>
>
>  entity_id | folder_id |   received_timestamp
> ---+---+-
>   1224278 |   1068087 | 2015-08-17 23:53:26
>   1224382 |   1068087 | 2015-08-18 03:07:55
>   1224404 |   1068087 | 2015-08-18 03:49:02
>   1505713 | 48496 | 2015-10-27 14:51:45
>142132 | 66658 | 2012-12-03 14:14:05.488
>122565 | 90115 | 2012-11-20 15:41:04.936
>200744 | 66655 | 2013-01-28 21:47:44.561
>   1445927 |888665 | 2015-09-29 00:26:56
>123671 | 83509 | 2012-11-21 14:16:26.448
>   1129928 | 66658 | 2015-05-09 08:39:14.128
> (10 rows)
>
>
> Neither are ordered by received_timestamp
>
> Can you explain how to get ORDER BY received_timestamp DESC?
>
> Thanks.
>
> --
> *Andreas Joseph Krogh*
> CTO / Partner - Visena AS
> Mobile: +47 909 56 963
> andr...@visena.com
> www.visena.com
> <https://www.visena.com>
>
>


Re: [GENERAL] Updated RUM-index and support for bigint as part of index

2016-08-07 Thread Oleg Bartunov
On Tue, Aug 2, 2016 at 9:08 PM, Andreas Joseph Krogh 
wrote:

> Hi.
>
> I see the RUM-index is updated, which is great!
>
> I wonder, to be able to sort by timestamp one has to create the index like
> this:
>
>
> CREATE INDEX rumidx ON origo_email_delivery USING rum (fts_all 
> rum_tsvector_timestamp_ops, received_timestamp)
> WITH (attach = 'received_timestamp', TO = 'fts_all', order_by_attach = 
> TRUE );
>
> Then, to be able to use the index for sorting by the
> "received_timestamp"-column one has to issue a query like this:
>
> EXPLAIN ANALYZE SELECTdel.entity_id,
> del.subject,
> del.received_timestamp,
> fts_all <=> to_tsquery('simple', 'andreas') AS rank
> FROM origo_email_delivery del
> WHERE del.fts_all @@ to_tsquery('simple', 'andreas')
> ORDER BY '2000-01-01' :: TIMESTAMP <=> del.received_timestamp
> LIMIT 10;
>
> QUERY 
> PLAN
> --
>  Limit  (cost=14.40..26.47 rows=10 width=89) (actual time=10.908..10.952 
> rows=10 loops=1)
>->  Index Scan using rumidx on origo_email_delivery del  
> (cost=14.40..3221.22 rows=2657 width=89) (actual time=10.906..10.947 rows=10 
> loops=1)
>  Index Cond: (fts_all @@ '''andreas'' & ''kr'''::tsquery)
>  Order By: (received_timestamp <=> '2000-01-01 00:00:00'::timestamp 
> without time zone)
>  Planning time: 0.491 ms
>  Execution time: 11.010 ms
> (6 rows)
>
>
> The ORDER BY part seems strange; It seems one has to find a value
> "lower than any other value" to use as a kind of base, why is this
> necessary? It also seems that in order to be able to sort DESC one has to
> provide a timestamp value "higher than any other value", is this correct?
>

have you considered <=| and |=> operators ? <=> in ORDER BY works like KNN.


>
> It would be great if the docs explained this.
>
> I really miss the opportunity to include a BIGINT as part of the index, so
> that the WHERE-clause could be like this:
>
> WHERE del.fts_all @@ to_tsquery('simple', 'andreas') AND del.folder_id IN
> (1,2,3)
>
> Having this would be perfect for my use-case searching in email in
> folders, sorted by received_date, and having it use ONE index.
>
> Will this be supported?
>
> Thanks.
>
> --
> *Andreas Joseph Krogh*
> CTO / Partner - Visena AS
> Mobile: +47 909 56 963
> andr...@visena.com
> www.visena.com
> 
>


Re: [GENERAL] jsonb search

2016-06-28 Thread Oleg Bartunov
On Tue, Jun 28, 2016 at 5:15 PM, Armand Pirvu (home)
 wrote:
> Hi
>
> In my quest of JSONB querying and searching without having to actually cast
> into a text, I found JSQuery
>
> I do admit my JSONB knowledge shortcoming and I am not a developer but a
> DBA. As such some examples would be greatly appreciated since I tend to
> understand better
>
> I compiled and installed the extension
>
> 1 - Exact matching without knowing the hierarchy, just the key and element,
> I built a set like
>
> col1 |   col2
> --+--
>1 | {"Home Email": {"EmailAddress": "1...@yahoo.com"}}
>2 | {"Home Email": {"EmailAddress": "2...@yahoo.com"}}
>3 | {"Home Email": {"EmailAddress": "3...@yahoo.com"}}
>
>
> JSQuqery is super
>
> SELECT * FROM test1 WHERE col2 @@ '*.EmailAddress = "1...@yahoo.com"';
>
> Now I can do a performance boost using
>
> CREATE INDEX idx1 ON test1 USING GIN (col2 jsonb_value_path_ops);
>
> I see this yield
>
> from
>
> testdb=# explain analyze^JSELECT * FROM test1 WHERE col2 @@ '*.EmailAddress
> = "1...@yahoo.com"';
> Seq Scan on test1  (cost=0.00..12423.00 rows=500 width=68) (actual
> time=0.016..160.777 rows=1 loops=1)
>   Filter: (col2 @@ '*."EmailAddress" = "1...@yahoo.com"'::jsquery)
>   Rows Removed by Filter: 49
> Planning time: 0.042 ms
> Execution time: 160.799 ms
> (5 rows)
>
>
> to
>
> testdb-# SELECT * FROM test1 WHERE col2 @@ '*.EmailAddress =
> "1...@yahoo.com"';
> Bitmap Heap Scan on test1  (cost=31.88..1559.32 rows=500 width=68) (actual
> time=0.018..0.019 rows=1 loops=1)
>   Recheck Cond: (col2 @@ '*."EmailAddress" = "1...@yahoo.com"'::jsquery)
>   Heap Blocks: exact=1
>   ->  Bitmap Index Scan on idx1  (cost=0.00..31.75 rows=500 width=0) (actual
> time=0.011..0.011 rows=1 loops=1)
> Index Cond: (col2 @@ '*."EmailAddress" = "1...@yahoo.com"'::jsquery)
> Planning time: 0.039 ms
> Execution time: 0.038 ms
> (7 rows)
>
> A whooping 4000 times improvement
>
>
>
>
> But I also noticed a vodka index
>
>
> testdb=# CREATE INDEX idx2 ON
> testdb-# test1 USING vodka (col2);
> ERROR:  access method "vodka" does not exist
>
> What am I missing ?
>
> 2 - Is there anyway I can accomplish a pattern and/or case insensitive
> search using JSQuery similar to
>
>
> select * from test2 where upper((col2 -> 'Home Email') ->> 'EmailAddress')
> ilike '%3%YAH%';
>
> select * from test2 where (col2 -> 'Home Email') ->> 'EmailAddress' like
> '%3%yah%';
>
>
> If so what indexing strategy can be used to have similar gains as above ?
>
>
> Many thanks for any help

Vodka is our experimental prototype of access method of next
generation and it doesn't exists in production-ready form. You can
check our presentation
http://www.sai.msu.su/~megera/postgres/talks/highload-2014-vodka.pdf
to understand jsquery limitation and why we stop its development.
Also, 2 years ago I wrote (in russian)
http://obartunov.livejournal.com/179422.html about jsonb query
language and our plans. Google translate might helps

https://translate.google.com/translate?sl=auto=en=y=_t=en=UTF-8=http%3A%2F%2Fobartunov.livejournal.com%2F179422.html==url


>
>
> Armand
>


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] questions about how to implement a gist index

2016-06-27 Thread Oleg Bartunov
On Tue, Jun 28, 2016 at 12:44 AM, Riccardo Vianello
 wrote:
> Hi all,
>
> I'm trying to contribute some improvements to the implementation of a gist
> index that is part of an open source project and it would be really nice if
> anyone could help me answer some questions.
>
> I would like to use different data structures to represent the internal and
> leaf entries. I think I have found several examples in the postgresql source
> code where a variable length array is used to pack different data
> consistently with the value of a "type" attribute. Is this the suggested way
> to proceed, or would it be also possible to use two actually different data
> structures? I at first considered this latter approach, but some erroneous
> behavior suggested that maybe I am not fully clear about which data
> structure should be produced or expected by some parts of the
> implementation.
>
> Could you please also help me understand the difference (if any) between
> using the GIST_LEAF macro or the leafkey attribute of the GISTENTRY data
> structure?

Yes, this is confused.

GIST_LEAF is TRUE if key is in leaf page.

bool leafkey points if TRUE that key contains value from heap.

>
> Thank you very much in advance,
>
> Riccardo
>
>


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Updated RUM?

2016-06-27 Thread Oleg Bartunov
On Fri, Jun 24, 2016 at 2:36 PM, Andreas Joseph Krogh 
wrote:

> Hi.
>
> @PostgresPro; Will there be any updates to the rum git-repo any time soon?
>

Sure.


>
> I understand you're working hard on this, care to share some
> vision/roadmap? I'm specifically interested in if RUM will support the same
> datatypes as GIN supports (arrays, JSON etc.), and what opclasses you plan
> to implement (for equalty and sorting). Will there be some kind of
> btree_rum like there is for btree_gin, to add RUM-support for the types
> there exists Btree-support for?
>

We are testing fts+timestamp combination right now.  The challenge is to
have posting list/tree sorted by addinfo (timestamp) not by item pointer.
It works already, but we are not quite satisfied with generated WAL traffic
and are researching some options in improving posting tree building and
generic wal.

We have in mind some sort of generalization like anyarray, but don't have
any real ideas.



>
> Thanks.
>
> --
> *Andreas Joseph Krogh*
> CTO / Partner - Visena AS
> Mobile: +47 909 56 963
> andr...@visena.com
> www.visena.com
> 
>


Re: [GENERAL] Question about RUM-index

2016-06-17 Thread Oleg Bartunov
On Fri, Jun 17, 2016 at 4:50 PM, Andreas Joseph Krogh <andr...@visena.com>
wrote:

> På fredag 17. juni 2016 kl. 15:47:08, skrev Oleg Bartunov <
> obartu...@gmail.com>:
>
>
>
> On Fri, Jun 17, 2016 at 3:52 PM, Andreas Joseph Krogh <andr...@visena.com>
> wrote:
>>
>> På fredag 17. juni 2016 kl. 13:53:34, skrev Oleg Bartunov <
>> obartu...@gmail.com>:
>>
>>
>>
>> On Fri, Jun 17, 2016 at 2:10 PM, Oleg Bartunov <obartu...@gmail.com>
>> wrote:
>>>
>>>
>>>
>>> On Fri, Jun 17, 2016 at 9:32 AM, Andreas Joseph Krogh <
>>> andr...@visena.com> wrote:
>>>>
>>>> På torsdag 16. juni 2016 kl. 00:50:45, skrev Jeff Janes <
>>>> jeff.ja...@gmail.com>:
>>>>
>>>> On Wed, Jun 15, 2016 at 3:56 AM, Andreas Joseph Krogh <
>>>> andr...@visena.com> wrote:
>>>>>
>>>>> Hi.
>>>>>
>>>>> First; Is this the correct forum to ask questions about the Postgres
>>>>> Pro's new RUM-index?
>>>>>
>>>>> If not, please point me to the right forum.
>>>>>
>>>>
>>>> I think that https://github.com/postgrespro/rum/issues might be the
>>>> best forum.
>>>>
>>>>
>>>> Oleg and friends; Should we use GitHub-issues as forum (one issue per
>>>> question/thread?), pgsql-general or something else?
>>>>
>>>
>>> Andreas,
>>>
>>> we are hardly working on our internal version of rum and will open it
>>> after resolving some issues. I think the best place to discuss it is
>>> -hackers.
>>>
>>
>> Ah, as someone corrected me, we are working hard !
>>
>>
>> He he, I figured that was what you meant:-)
>>
>> Ok, so basically - will RUM-index support the same indexing-properties as
>> GIN (being able to index tsvector, BIGINT-arrays, JSONB etc.) *and* be
>> able to use index for sorting on ie. timestamp, tsrank or some
>> BIGINT-column?
>>
>> Like my example, will it be possible to issue a query like this:
>>
>>
>> SELECT del.id
>> , del.sent
>> FROM delivery del
>> WHERE 1 = 1  AND del.fts_all @@ 
>> to_tsquery('simple', 'hi:*')
>>   AND del.folder_id = ANY(ARRAY[2,3]::BIGINT[])  
>>   ORDER BY  del.sent DESC LIMIT 101 OFFSET 0;
>>
>>
>> and have it use *one* RUM-index so the whole operation, including
>> sorting, is as efficient as possible?
>>
>>
>
> we have implementation for timestamp. One need to write opclass to deal
> with arrays, it shouldn't be difficult.
>
>
> Will the opclass dealing with bigint-arrays also handle the single-element
> case, that is only one bigint?
>

this is up to opclass author.



>
> --
> *Andreas Joseph Krogh*
> CTO / Partner - Visena AS
> Mobile: +47 909 56 963
> andr...@visena.com
> www.visena.com
> <https://www.visena.com>
>
>


Re: [GENERAL] Question about RUM-index

2016-06-17 Thread Oleg Bartunov
On Fri, Jun 17, 2016 at 3:52 PM, Andreas Joseph Krogh <andr...@visena.com>
wrote:

> På fredag 17. juni 2016 kl. 13:53:34, skrev Oleg Bartunov <
> obartu...@gmail.com>:
>
>
>
> On Fri, Jun 17, 2016 at 2:10 PM, Oleg Bartunov <obartu...@gmail.com>
> wrote:
>>
>>
>>
>> On Fri, Jun 17, 2016 at 9:32 AM, Andreas Joseph Krogh <andr...@visena.com
>> > wrote:
>>>
>>> På torsdag 16. juni 2016 kl. 00:50:45, skrev Jeff Janes <
>>> jeff.ja...@gmail.com>:
>>>
>>> On Wed, Jun 15, 2016 at 3:56 AM, Andreas Joseph Krogh <
>>> andr...@visena.com> wrote:
>>>>
>>>> Hi.
>>>>
>>>> First; Is this the correct forum to ask questions about the Postgres
>>>> Pro's new RUM-index?
>>>>
>>>> If not, please point me to the right forum.
>>>>
>>>
>>> I think that https://github.com/postgrespro/rum/issues might be the
>>> best forum.
>>>
>>>
>>> Oleg and friends; Should we use GitHub-issues as forum (one issue per
>>> question/thread?), pgsql-general or something else?
>>>
>>
>> Andreas,
>>
>> we are hardly working on our internal version of rum and will open it
>> after resolving some issues. I think the best place to discuss it is
>> -hackers.
>>
>
> Ah, as someone corrected me, we are working hard !
>
>
> He he, I figured that was what you meant:-)
>
> Ok, so basically - will RUM-index support the same indexing-properties as
> GIN (being able to index tsvector, BIGINT-arrays, JSONB etc.) *and* be
> able to use index for sorting on ie. timestamp, tsrank or some
> BIGINT-column?
>
> Like my example, will it be possible to issue a query like this:
>
>
> SELECT del.id
> , del.sent
> FROM delivery del
> WHERE 1 = 1  AND del.fts_all @@ 
> to_tsquery('simple', 'hi:*')
>   AND del.folder_id = ANY(ARRAY[2,3]::BIGINT[])   
>  ORDER BY  del.sent DESC LIMIT 101 OFFSET 0;
>
>
> and have it use *one* RUM-index so the whole operation, including
> sorting, is as efficient as possible?
>
>

we have implementation for timestamp. One need to write opclass to deal
with arrays, it shouldn't be difficult.



> Thanks.
>
> --
> *Andreas Joseph Krogh*
> CTO / Partner - Visena AS
> Mobile: +47 909 56 963
> andr...@visena.com
> www.visena.com
> <https://www.visena.com>
>
>


Re: [GENERAL] Question about RUM-index

2016-06-17 Thread Oleg Bartunov
On Fri, Jun 17, 2016 at 2:10 PM, Oleg Bartunov <obartu...@gmail.com> wrote:

>
>
> On Fri, Jun 17, 2016 at 9:32 AM, Andreas Joseph Krogh <andr...@visena.com>
> wrote:
>
>> På torsdag 16. juni 2016 kl. 00:50:45, skrev Jeff Janes <
>> jeff.ja...@gmail.com>:
>>
>> On Wed, Jun 15, 2016 at 3:56 AM, Andreas Joseph Krogh <andr...@visena.com
>> > wrote:
>>>
>>> Hi.
>>>
>>> First; Is this the correct forum to ask questions about the Postgres
>>> Pro's new RUM-index?
>>>
>>> If not, please point me to the right forum.
>>>
>>
>> I think that https://github.com/postgrespro/rum/issues might be the best
>> forum.
>>
>>
>> Oleg and friends; Should we use GitHub-issues as forum (one issue per
>> question/thread?), pgsql-general or something else?
>>
>
> Andreas,
>
> we are hardly working on our internal version of rum and will open it
> after resolving some issues. I think the best place to discuss it is
> -hackers.
>

Ah, as someone corrected me, we are working hard !



>
>
>
>>
>>
>> Note that GIN does almost what I want, except use the index when sorting
>>> by "sent"-timestamp.
>>>
>>> So I wonder if RUM can do any better?
>>> What I don't understand is how to have "folder_id" as part of the
>>> RUM-index so that I can search in *an array* of folders using the
>>> index, *AND* have the whole result sorted by "sent"-timestamp also
>>> using the RUM-index.
>>>
>>
>> I think you would have to implement an operator for integers for RUM much
>> like btree_gin does for GIN.  Sorry don't know how to do that, except to
>> say look in the RUM code to see how it does it for time-stamps.
>>
>>
>>>
>>> In the (limited) documentation sorting using timestamp is done like this:
>>>
>>> ORDER BY sent <-> '2000-01-01'::TIMESTAMP
>>>
>>> which I don't understand; Why must one specify a value here, and how
>>> does that value affect the result?
>>>
>>
>>
>> This is essentially identical to ORDER BY ABS(sent -
>> '2000-01-01'::TIMESTAMP);  except it can use the index.
>>
>> So maybe pick a constant outside the range of possible values, and use
>> that as one argument to <->.
>>
>>
>> This should be unnecessary and hidden from the user. Maybe some "ORDER BY
>> rum_timestamp(sent)" or something could abstract away stuff to make it much
>> clearer to the user?
>> --
>> *Andreas Joseph Krogh*
>> CTO / Partner - Visena AS
>> Mobile: +47 909 56 963
>> andr...@visena.com
>> www.visena.com
>> <https://www.visena.com>
>>
>>
>
>


Re: [GENERAL] Question about RUM-index

2016-06-17 Thread Oleg Bartunov
On Fri, Jun 17, 2016 at 9:32 AM, Andreas Joseph Krogh 
wrote:

> På torsdag 16. juni 2016 kl. 00:50:45, skrev Jeff Janes <
> jeff.ja...@gmail.com>:
>
> On Wed, Jun 15, 2016 at 3:56 AM, Andreas Joseph Krogh 
> wrote:
>>
>> Hi.
>>
>> First; Is this the correct forum to ask questions about the Postgres
>> Pro's new RUM-index?
>>
>> If not, please point me to the right forum.
>>
>
> I think that https://github.com/postgrespro/rum/issues might be the best
> forum.
>
>
> Oleg and friends; Should we use GitHub-issues as forum (one issue per
> question/thread?), pgsql-general or something else?
>

Andreas,

we are hardly working on our internal version of rum and will open it after
resolving some issues. I think the best place to discuss it is -hackers.



>
>
> Note that GIN does almost what I want, except use the index when sorting
>> by "sent"-timestamp.
>>
>> So I wonder if RUM can do any better?
>> What I don't understand is how to have "folder_id" as part of the
>> RUM-index so that I can search in *an array* of folders using the index,
>> *AND* have the whole result sorted by "sent"-timestamp also using the
>> RUM-index.
>>
>
> I think you would have to implement an operator for integers for RUM much
> like btree_gin does for GIN.  Sorry don't know how to do that, except to
> say look in the RUM code to see how it does it for time-stamps.
>
>
>>
>> In the (limited) documentation sorting using timestamp is done like this:
>>
>> ORDER BY sent <-> '2000-01-01'::TIMESTAMP
>>
>> which I don't understand; Why must one specify a value here, and how does
>> that value affect the result?
>>
>
>
> This is essentially identical to ORDER BY ABS(sent -
> '2000-01-01'::TIMESTAMP);  except it can use the index.
>
> So maybe pick a constant outside the range of possible values, and use
> that as one argument to <->.
>
>
> This should be unnecessary and hidden from the user. Maybe some "ORDER BY
> rum_timestamp(sent)" or something could abstract away stuff to make it much
> clearer to the user?
> --
> *Andreas Joseph Krogh*
> CTO / Partner - Visena AS
> Mobile: +47 909 56 963
> andr...@visena.com
> www.visena.com
> 
>
>


Re: [GENERAL] Slides for PGCon2016; "FTS is dead ? Long live FTS !"

2016-05-31 Thread Oleg Bartunov
On Sun, May 29, 2016 at 8:53 PM, Andreas Joseph Krogh <andr...@visena.com>
wrote:

> På søndag 29. mai 2016 kl. 19:49:06, skrev Oleg Bartunov <
> obartu...@gmail.com>:
>
> [snip]
>>
>> I want to run 9.6 beta in production right now because of this:-)
>>
>
> wait-wait :)  We'd be happy to have feedback from production, of course,
> but please, wait a bit. We are adding support of sorting posting list/tree
> not by item pointer as in gin, but make use of additional information, for
> example, timestamp, which will provide additional speedup to the existing
> one.
>
>
> Awesome!
>
>
>
> Also, we are sure there are some bugs :)
>
>
> He he, I reported 1st issue: https://github.com/postgrespro/rum/issues/1
>
> Would be cool to see this fixed so I actually could have a sip of the
> rum:-)
>


It's not easy to fix this. We don't want rum depends on  btree_gin, so
probably the easiest way is to have separate operator <=> in rum.

>
> --
> *Andreas Joseph Krogh*
> CTO / Partner - Visena AS
> Mobile: +47 909 56 963
> andr...@visena.com
> www.visena.com
> <https://www.visena.com>
>
>


Re: [GENERAL] Slides for PGCon2016; "FTS is dead ? Long live FTS !"

2016-05-30 Thread Oleg Bartunov
On Sun, May 29, 2016 at 12:59 AM, Oleg Bartunov <obartu...@gmail.com> wrote:

>
>
> On Thu, May 26, 2016 at 11:26 PM, Andreas Joseph Krogh <andr...@visena.com
> > wrote:
>
>> Hi.
>>
>> Any news about when slides for $subject will be available?
>>
>
> I submitted slides to pgcon site, but it usually takes awhile, so you can
> download our presentation directly
> http://www.sai.msu.su/~megera/postgres/talks/pgcon-2016-fts.pdf
>
>
Please, download new version of slides. I added CREATE INDEX commands in
examples.



> There are some missing features in rum index, but I hope we'll update
> github repository really soon.
>
>
>>
>> --
>> *Andreas Joseph Krogh*
>> CTO / Partner - Visena AS
>> Mobile: +47 909 56 963
>> andr...@visena.com
>> www.visena.com
>> <https://www.visena.com>
>>
>
>


Re: [GENERAL] Slides for PGCon2016; "FTS is dead ? Long live FTS !"

2016-05-30 Thread Oleg Bartunov
On Sun, May 29, 2016 at 10:04 PM, Karsten Hilbert
 wrote:
>>> I submitted slides to pgcon site, but it usually takes awhile, so you can
>>> download our presentation directly
>>> http://www.sai.msu.su/~megera/postgres/talks/pgcon-2016-fts.pdf
>
> Looking at slide 39 (attached) I get the impression that I
> should be able to do the following:
>
>
> - turn a coding system (say, ICD-10) into a dictionary
>   by splitting the terms into single words
>
> say, "diabetes mellitus -> "diabetes", "mellitus"
>
> - define stop words like "left", "right", ...
>
> say, "fracture left ulna" -> the "left" doesn't
> matter as far as coding is concerned
>
> - also turn that coding system into queries by splitting
>   the terms into single words, concatenating them
>   with "&", and setting the ICD 10 code as tag on them
>
> say, "diabetes mellitus" -> "diabetes & mellitus [E11]"
>
> - run an inverse FTS (FQS) against a user supplied string
>   thereby finding queries (= tags = ICD10 codes) likely
>   relevant to the input
>
> say, to_tsvector("patient was suspected to suffer from diabetes 
> mellitus")
> -> tag = E11
>
>
> Possible, not possible, insane, unintended use ?

why not, it's the same kind of usage I used at slide #39.

create table icd10 (q tsquery, code text);
insert into icd10 values(to_tsquery('diabetes & mellitus'), '[E11]');
select * from icd10 where to_tsvector('patient was suspected to suffer
from diabetes mellitus') @@ q;
   q   | code
---+---
 'diabet' & 'mellitus' | [E11]
(1 row)



>
> Thanks,
> Karsten
> --
> GPG key ID E4071346 @ eu.pool.sks-keyservers.net
> E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Slides for PGCon2016; "FTS is dead ? Long live FTS !"

2016-05-29 Thread Oleg Bartunov
On Sun, May 29, 2016 at 2:43 PM, Stefan Keller <sfkel...@gmail.com> wrote:

> Hi,
>
> Nice work from you postgrespro.ru guys! Especially the RUM index which
> demonstrates the power of 9.6 to let third party SW create access methods
> as extension: https://github.com/postgrespro/rum
>
> 1. I don't understand the benchmarks on slide 25 "20 mln descriptions"
> (and the one before "6.7 mln classifieds"): What does "Queries in 8 h 9.2
> +patch (9.6 rum)" mean?
>

We run queries for 8 hours and recorded the number of executed queries.
Four years ago, when I and Alexander developed an initial version of patch
we got results marked by "9.2+patch", and now we run the same queries on
the same database and put rum results into (). I'd not consider to this
numbers, since we used queries from 6 mln database. We'd be happy if
somebody run independent benchmarks.


>
> 2. What does R-U-M mean? (can't mean "Range Usage Metadata" which was
> finally coined range index BRIN)?
>


We chose RUM just because there are GIN and VODKA :) But some people
already suggested several meanings like Really Useful iMdex :)  We are open
for suggestion.


>
> :Stefan, co-organizer of Swiss PGDay
>
>
> 2016-05-29 11:29 GMT+02:00 Andreas Joseph Krogh <andr...@visena.com>:
>
>> På lørdag 28. mai 2016 kl. 23:59:55, skrev Oleg Bartunov <
>> obartu...@gmail.com>:
>>
>>
>>
>> On Thu, May 26, 2016 at 11:26 PM, Andreas Joseph Krogh <
>> andr...@visena.com> wrote:
>>>
>>> Hi.
>>>
>>> Any news about when slides for $subject will be available?
>>>
>>
>> I submitted slides to pgcon site, but it usually takes awhile, so you can
>> download our presentation directly
>> http://www.sai.msu.su/~megera/postgres/talks/pgcon-2016-fts.pdf
>>
>> There are some missing features in rum index, but I hope we'll update
>> github repository really soon.
>>
>>
>> This is simply amazing!
>>
>> I want to run 9.6 beta in production right now because of this:-)
>>
>> Hats off guys, congrats to PostgresPro, and huge thanks!!
>>
>> --
>> *Andreas Joseph Krogh*
>> CTO / Partner - Visena AS
>> Mobile: +47 909 56 963
>> andr...@visena.com
>> www.visena.com
>> <https://www.visena.com>
>>
>>
>
>


Re: [GENERAL] Slides for PGCon2016; "FTS is dead ? Long live FTS !"

2016-05-29 Thread Oleg Bartunov
On Sun, May 29, 2016 at 12:29 PM, Andreas Joseph Krogh <andr...@visena.com>
wrote:

> På lørdag 28. mai 2016 kl. 23:59:55, skrev Oleg Bartunov <
> obartu...@gmail.com>:
>
>
>
> On Thu, May 26, 2016 at 11:26 PM, Andreas Joseph Krogh <andr...@visena.com
> > wrote:
>>
>> Hi.
>>
>> Any news about when slides for $subject will be available?
>>
>
> I submitted slides to pgcon site, but it usually takes awhile, so you can
> download our presentation directly
> http://www.sai.msu.su/~megera/postgres/talks/pgcon-2016-fts.pdf
>
> There are some missing features in rum index, but I hope we'll update
> github repository really soon.
>
>
> This is simply amazing!
>
> I want to run 9.6 beta in production right now because of this:-)
>

wait-wait :)  We'd be happy to have feedback from production, of course,
but please, wait a bit. We are adding support of sorting posting list/tree
not by item pointer as in gin, but make use of additional information, for
example, timestamp, which will provide additional speedup to the existing
one. Also, we are sure there are some bugs :)


>
> Hats off guys, congrats to PostgresPro, and huge thanks!!
>
> --
> *Andreas Joseph Krogh*
> CTO / Partner - Visena AS
> Mobile: +47 909 56 963
> andr...@visena.com
> www.visena.com
> <https://www.visena.com>
>
>


Re: [GENERAL] Slides for PGCon2016; "FTS is dead ? Long live FTS !"

2016-05-28 Thread Oleg Bartunov
On Thu, May 26, 2016 at 11:26 PM, Andreas Joseph Krogh 
wrote:

> Hi.
>
> Any news about when slides for $subject will be available?
>

I submitted slides to pgcon site, but it usually takes awhile, so you can
download our presentation directly
http://www.sai.msu.su/~megera/postgres/talks/pgcon-2016-fts.pdf

There are some missing features in rum index, but I hope we'll update
github repository really soon.


>
> --
> *Andreas Joseph Krogh*
> CTO / Partner - Visena AS
> Mobile: +47 909 56 963
> andr...@visena.com
> www.visena.com
> 
>


Re: [GENERAL] Ascii Elephant for text based protocols - Final

2016-05-16 Thread Oleg Bartunov
On Mon, May 16, 2016 at 2:47 PM, Charles Clavadetscher
 wrote:
> SELECT row_dat FROM elephant ORDER BY row_num;


Very good ! Thanks !


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Insert only table and size of GIN index JSONB field.

2016-05-03 Thread Oleg Bartunov
On Tue, May 3, 2016 at 1:35 PM, Maxim Boguk  wrote:

>
>
> On Tue, May 3, 2016 at 8:09 PM, Maxim Boguk  wrote:
>
>> Hi,
>>
>> I started with empty table with index over
>>  custom_fields | jsonb
>> field
>> defined as:
>> "idx_learners_custom_fields" gin (custom_fields)
>> Globally gin_pending_list_limit set to 2MB.
>> Database version is 9.5.2.
>>
>> Now question:
>> If table populated with 1M records in single transaction then the final
>> size of the GIN index is:
>> 4265 MB
>> but after I performed reindex index idx_learners_custom_fields;
>> the index size had been reduced 15x to 295 MB.
>>
>> Is this behavior expected?
>>
>
> It's look easy to reproduce.
> I can send a sample dataset for analyze.
>

sample data are always welcome !



>
> drop table if exists test;
> create table test (custom_fields jsonb);
> create index test_gin_key on test USING GIN(custom_fields);
> insert into test select custom_fields from public.learners;
> INSERT 0 100
> \di+ test_gin_key
> List of relations
>  Schema | Name | Type  |  Owner   | Table |  Size   | Description
> +--+---+--+---+-+-
>  public | test_gin_key | index | postgres | test  | 4211 MB |
>
> reindex index test_gin_key;
>
> \di+ test_gin_key
> List of relations
>  Schema | Name | Type  |  Owner   | Table |  Size  | Description
> +--+---+--+---++-
>  public | test_gin_key | index | postgres | test  | 295 MB |
>
>
> ​
>
>
> --
> Maxim Boguk
> Senior Postgresql DBA
> http://www.postgresql-consulting.com/
>
> Phone RU: +7 910 405 4718
> Phone AU: +61 45 218 5678
>
> LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
> Skype: maxim.boguk
> Jabber: maxim.bo...@gmail.com
>
> "People problems are solved with people.
> If people cannot solve the problem, try technology.
> People will then wish they'd listened at the first stage."
>
>


Re: [GENERAL] postgresql & Fulltext & ranking & my own functions

2016-05-03 Thread Oleg Bartunov
On Tue, May 3, 2016 at 3:21 PM, Nicolas Paris  wrote:

> Hello,
>
> Documentation says : (
> http://www.postgresql.org/docs/9.5/static/textsearch-controls.html#TEXTSEARCH-RANKING
> )
> "The built-in ranking functions are only examples. You can write your own
> ranking functions and/or combine their results with additional factors to
> fit your specific needs."
>
> The build-in seems not enough for my use case. How to write my own ranking
> functions ?
>
> I would like to create something like  ts_rank_cd( tsv, 'foo & bar',
> my_new_normalization_integer)
>
>
you can look on src/backend/utils/adt/tsrank.c and write your own function.
google can help
https://www.google.ru/search?q=how+to+write+c+function+postgresql


> 
> pg 9.5
>


Re: [GENERAL] JSONB performance enhancement for 9.6

2016-05-01 Thread Oleg Bartunov
On Sun, May 1, 2016 at 6:46 AM, Tom Smith  wrote:

> Hello:
>
> I'd like to bring this JSONB performance issue again.
> Below is a link of MySQL way of storing/retrieving Json key/value
>
> https://dev.mysql.com/doc/refman/5.7/en/json.html
>
> Instead of providing column indexing(like GIN for JSONB in Postgresql).
> it provides only internal data structure level indexing within each
> individual json object
> for fast retrieval.  compression is not used.
>
> Perhaps without implementing  complicated column level GIN indexing,
> implementing
> a new variant JSON type that only handle  individual json object indexing
> would be
> feasible?  Combined with current JSONB implementation,   both common use
> cases
> (one is global doc indexing, the other is fast retrieval of individual
> values)
> would work out and make postgresql unbeatable.
>

It's called expression index ?


>
>
>
>
>
>
>
>
>
> On Tue, Jan 19, 2016 at 8:51 PM, Bruce Momjian  wrote:
>
>> On Mon, Jan 11, 2016 at 09:01:03PM -0500, Tom Smith wrote:
>> > Hi,
>> >
>> > Congrats on the official release of 9.5
>> >
>> > And I'd like bring up the issue again about if 9.6 would address the
>> jsonb
>> > performance issue
>> > with large number of top level keys.
>> > It is true that it does not have to use JSON format. it is about
>> serialization
>> > and fast retrieval
>> > of dynamic tree structure objects. (at top level, it might be called
>> dynamic
>> > columns)
>> > So if postgresql can have its own way, that would work out too as long
>> as it
>> > can have intuitive query
>> > (like what are implemented for json and jsonb) and fast retrieval of a
>> tree
>> > like object,
>> > it can be called no-sql data type. After all, most motivations of using
>> no-sql
>> > dbs like MongoDB
>> > is about working with dynamic tree object.
>> >
>> > If postgresql can have high performance on this, then many no-sql dbs
>> would
>> > become history.
>>
>> I can give you some backstory on this.  TOAST was designed in 2001 as a
>> way to store, in a data-type-agnostic way, long strings compressed and
>> any other long data type, e.g. long arrays.
>>
>> In all previous cases, _part_ of the value wasn't useful.  JSONB is a
>> unique case because it is one of the few types that can be processed
>> without reading the entire value, e.g. it has an index.
>>
>> We are going to be hesitant to do something data-type-specific for
>> JSONB.  It would be good if we could develop a data-type-agnostic
>> approach to has TOAST can be improved.  I know of no such work for 9.6,
>> and it is unlikely it will be done in time for 9.6.
>>
>> --
>>   Bruce Momjian  http://momjian.us
>>   EnterpriseDB http://enterprisedb.com
>>
>> + As you are, so once was I. As I am, so you will be. +
>> + Roman grave inscription +
>>
>
>


Re: [GENERAL] Shrinking TSvectors

2016-04-05 Thread Oleg Bartunov
On Tue, Apr 5, 2016 at 2:37 PM, Howard News 
wrote:

> Hi,
>
> does anyone have any pointers for shrinking tsvectors
>
> I have looked at the contents of some of these fields and they contain
> many details that are not needed. For example...
>
> "'+1':935,942 '-0500':72 '-0578':932 '-0667':938 '-266':937 '-873':944
> '-9972':945 '/partners/application.html':222
> '/partners/program/program-agreement.pdf':271
> '/partners/reseller.html':181,1073 '01756':50,1083 '07767':54,1087
> '1':753,771 '12':366 '14':66 (...)"
>

I am not interested in keeping the numbers or urls in the indexes.
>


select strip ('asd:23');
 strip
---
 'asd'
(1 row)



>
> Thanks,
>
> Howard.
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] MongoDB 3.2 beating Postgres 9.5.1?

2016-03-13 Thread Oleg Bartunov
On Mar 11, 2016 4:40 PM, "Paul Jones"  wrote:
>
> I have been running the EDB benchmark that compares Postgres and MongoDB.
> I believe EDB ran it against PG 9.4 and Mongo 2.6.  I am running it
> against PG 9.5.1 and Mongo 3.2 with WiredTiger storage using 10,000,000
> JSON records generated by the benchmark.  It looks like Mongo is winning,
> and apparently because of its cache management.

Dmitry was working on the same benchmarks. I think edb benchmark is broken
by design. Better,  use ycsb benchmarks. I hope, Dmitry will share his
results.

>
> The first queries on both run in ~30 min.  And, once PG fills its cache,
> it whips Mongo on repeats of the *same* query (vmstat shows no disk
> reads for PG).
>
> However, when different query on the same table is issued to both,
> vmstat shows that PG has to read the *entire* table again, and it takes
> ~30 min.  Mongo does a lot of reads initially but after about 5 minutes,
> it stops reading and completes the query, most likely because it is
> using its cache very effectively.
>
> Host:   Virtual Machine
> 4 CPUs
> 16 Gb RAM
> 200 Gb Disk
> RHEL 6.6
>
> PG: 9.5.1 compiled from source
> shared_buffers = 7GB
> effectve_cache_size = 12GB
>
> Mongo:  3.2 installed with RPM from Mongo
>
> In PG, I created the table by:
>
> CREATE TABLE json_tables
> (
> dataJSONB
> );
>
> After loading, it creates the index:
>
> CREATE INDEX json_tables_idx ON json_tables USING GIN (data
jsonb_path_ops);
>
> After a lot of experimentation, I discovered that the benchmark was not
> using PG's index, so I modified the four queries to be:
>
> SELECT data FROM json_tables WHERE data @> '{"brand": "ACME"}';
> SELECT data FROM json_tables WHERE data @> '{"name": "Phone Service Basic
Plan"}';
> SELECT data FROM json_tables WHERE data @> '{"name": "AC3 Case Red"}';
> SELECT data FROM json_tables WHERE data @> '{"type": "service"}';
>
> Here are two consecutive explain analyze for PG, for the same query.
> No functional difference in the plans that I can tell, but the effect
> of PG's cache on the second is dramatic.
>
> If anyone has ideas on how I can get PG to more effectively use the cache
> for subsequent queries, I would love to hear them.
>
> ---
>
> benchmark=# explain analyze select data from json_tables where data @>
'{"name": "AC3 Case Red"}';
>
>QUERY PLAN
>
>
>
-
>
> Bitmap Heap Scan on json_tables  (cost=113.50..37914.64 rows=1
width=1261)
> (actual time=2157.118..1259550.327 rows=909091 loops=1)
>Recheck Cond: (data @> '{"name": "AC3 Case Red"}'::jsonb)
>Rows Removed by Index Recheck: 4360296
>Heap Blocks: exact=37031 lossy=872059
>->  Bitmap Index Scan on json_tables_idx  (cost=0.00..111.00
rows=1 width =0) (actual time=2141.250..2141.250 rows=909091 loops=1)
>  Index Cond: (data @> '{"name": "AC3 Case Red"}'::jsonb)
> Planning time: 291.932 ms
> Execution time: 1259886.920 ms
> (8 rows)
>
> Time: 1261191.844 ms
>
> benchmark=# explain analyze select data from json_tables where data @>
'{"name": "AC3 Case Red"}';
>   QUERY PLAN
>
>
---
>
> Bitmap Heap Scan on json_tables  (cost=113.50..37914.64 rows=1
width=1261) (actual time=779.261..29815.262 rows=909091 loops=1)
>Recheck Cond: (data @> '{"name": "AC3 Case Red"}'::jsonb)
>Rows Removed by Index Recheck: 4360296
>Heap Blocks: exact=37031 lossy=872059
>->  Bitmap Index Scan on json_tables_idx  (cost=0.00..111.00
rows=1 width =0) (actual time=769.081..769.081 rows=909091 loops=1)
>  Index Cond: (data @> '{"name": "AC3 Case Red"}'::jsonb)
> Planning time: 33.967 ms
> Execution time: 29869.381 ms
>
> (8 rows)
>
> Time: 29987.122 ms
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Replacement for Oracle Text

2016-02-19 Thread Oleg Bartunov
On Fri, Feb 19, 2016 at 8:28 PM, Josh berkus  wrote:

> On 02/19/2016 05:49 AM, s d wrote:
>
>> On 19 February 2016 at 14:19, Bruce Momjian > > wrote:
>>
>> I wonder if PLPerl could be used to extract the words from a PDF
>> document and create a tsvector column from it.
>>
>>
>>   I don't know about PLPerl(I'm pretty sure it could be used for this
>> purpose, though.).  On the other hand I've written code for this in
>> Python which should be easy to adapt for PLPython, if necessary.
>>
>
> I'd swear someone already built something to do this.  All you need is a
> library which reads PDF and transforms it into text, and then you can FTS
> it.  I know there's a module for OpenOffice docs somewhere as well, but
> heck if I can remember where.
>

I used pdftotext for that.
I think it'd be useful to have extension{s}, which can be used to convert
anything to text. I remember someone indexed chemical formulae,  TeX/LaTeX,
DOC files.


>
> --
> --
> Josh Berkus
> Red Hat OSAS
> (any opinions are my own)
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] Query plan not updated after dropped index

2016-02-18 Thread Oleg Bartunov
On Thu, Feb 18, 2016 at 11:17 AM, Victor Blomqvist  wrote:

> Hello!
>
> We just had a major issue on our databases, after a index was replaced a
> user defined function didnt change its query plan to use the new index. At
> least this is our theory, since the function in question became much slower
> and as a result brought our system to a halt.
>
> Basically it went:
> 1. create new index (a simple btree on a bigint column index)
> 2. drop old index
> 3. rename new index to old index name
>

why do you did this !?


> 3. analyze table
>
> After these steps normally our functions will update their plans and use
> the new index just fine. However this time the function (only one function
> use this particular index) seemed to take forever to complete. This is a
> 40GB table so querying for something not indexed would take a long time.
> Therefor my suspicion is that the function didnt start to use the new index.
>
> Adding to the strangeness is that if I ran the function manually it was
> fast, only when called from our application through pg_bouncer it was slow.
> I should also say that the function is only used on our 3 read slaves setup
> to our database.
>
> Things we tried to fix this:
> 1. Analyze table
> 2. Restart our application
> 3. Recreate the function
> 4. Kill the slow running queries with pg_cancel_backend()
>
> These things did not help.
>
> Instead what helped in the end was to replace the function with an extra
> useless where clause (in the hope that it would force it to create a new
> plan)
>
> So, the function only have a single SELECT inside:
> RETURN QUERY
> SELECT * FROM table
>   WHERE bigint_column = X
>   LIMIT 100 OFFSET 0;
>
> And this is my modification that made it work again:
> RETURN QUERY
> SELECT * FROM table
>   WHERE bigint_column = X AND 1=1
>   LIMIT 100 OFFSET 0;
>
>
> Obviously we are now worried why this happened and how we can avoid it in
> the future? We run Postgres 9.3 on CentOS 6.
>
> Thanks!
> Victor
>


Re: [GENERAL] pgDay Asia / talks / lightning talks

2016-02-17 Thread Oleg Bartunov
It's pity I miss this conference, since I learned about it too late.

On Wed, Feb 17, 2016 at 8:46 AM, Satoshi Nagayasu  wrote:

> Hi Dan,
>
> 2016-02-16 20:43 GMT+09:00 Daniel Pocock :
> > Is this the place to ask questions about pgDay Asia[1] or is there
> > another mailing list for it?  The mailing list link on the pgDay Asia
> > web site just takes me to a marketing list[2].  The seasiapug list[3]
> > looks very quiet.
>
> > 2.
> http://uptime.us2.list-manage.com/subscribe/post?u=8b6e2840d44be26e9f646b9f9=128a96a18a
>
> This is the list I have set up, and it is intended to provide some
> update information
> about the conference and the web site, especially for the attendees.
>
> And if you have any question or comments, please send to pgday-asia
> [at] googlegroups.co.jp
>
> Regards,
> --
> Satoshi Nagayasu 
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] fast refresh materialized view

2016-02-11 Thread Oleg Bartunov
Похоже на то, что вы понимаете по-русски !

Сообщество примет вашу работу только под лицензией BSD. Если что непонятно,
пишите мне.

С Уважением,
Олег

On Thu, Feb 11, 2016 at 7:56 PM, Nguyễn Trần Quốc Vinh  wrote:

> We would like to contribute to the PostgreSQL community. What can you
> recommend us?
>
> Thank you.
>
> TS. Nguyễn Trần Quốc Vinh
> ---
> Chủ nhiệm khoa Tin học
> Trường ĐH Sư phạm - ĐH Đà Nẵng
> Website: http://it.ued.vn ; http://www.ued.vn
> ; http://www.ued.udn.vn
> LLKH: http://scv.ued.vn/~ntquocvinh 
> ĐT: 0511.6-512-586
> DĐ: 0914.78-08-98
> 
> Nguyen Tran Quoc Vinh, PhD
> Dean
> Faculty of Information Technology
> Danang University of Education
> Website: http://it.ued.udn.vn; http://www.ued.vn 
> ; http://www.ued.udn.vn
> SCV: http://scv.ued.vn/~ntquocvinh 
> Phone: (+84) 511.6-512-586
> Mobile: (+84) 914.78-08-98
>
> On Tue, Feb 9, 2016 at 7:51 PM, Michael Paquier  > wrote:
>
>> On Tue, Feb 9, 2016 at 12:32 AM, Nguyễn Trần Quốc Vinh
>>  wrote:
>> >
>> > Thank you very much. We did n't think about that. We would like to
>> choose
>> > APACHE LICENSE. We apologize for late reply.
>>
>> And that would be incompatible with the PostgreSQL license I guess,
>> per concerns with patents and similar stuff.
>> --
>> Michael
>>
>
>


Re: [GENERAL] BSD initdb without ICU support and switch later

2016-01-29 Thread Oleg Bartunov
On Fri, Jan 29, 2016 at 5:46 PM, Maeldron T.  wrote:

> Hello,
>
> the ICU patch isn’t ready for PostgreSQL on FreeBSD.
>
> Is there any risk (more than 0) in executing the initdb without ICU
> support and recompiling PostgreSQL later when the ICU patch is ready? I
> mean any risk without making a dump and import before the switch.
>

Teodor has made a patch for ICU and PostgreSQL 9.5.


>
> If this is okay for sure, what should I do later when the ICU is
> available? Do I have to reindex everything with the ICU patched database?
>

I think, yes.


>
> Thank you.
>
> M.
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] Using a german affix file for compound words

2016-01-28 Thread Oleg Bartunov
On Thu, Jan 28, 2016 at 6:04 PM, Wolfgang Winkler <
wolfgang.wink...@digital-concepts.com> wrote:

> Hi!
>
> We have a problem with importing a compound dictionary file for german.
>
> I downloaded the files here:
>
>
> http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/dicts/ispell/ispell-german-compound.tar.gz
>
> and converted them to utf-8 with iconv. The affix file seems ok when
> opened with an editor.
>
> When I try to create or alter a dictionary to use this affix file, I get
> the following error:
>
> alter TEXT SEARCH DICTIONARY german_ispell (
>   DictFile = german,
>   AffFile = german,
>   StopWords = german
> );
> ERROR:  syntax error
> CONTEXT:  line 224 of configuration file
> "/usr/local/pgsql/share/tsearch_data/german.affix": "   ABE > -ABE,äBIN
> "
>
> This is the first occurrence of an umlaut character in the file. I've
> found a view postings where the same file is used, e.g.:
>
>
> http://www.postgresql.org/message-id/flat/556c1411.4010...@tbz-pariv.de#556c1411.4010...@tbz-pariv.de
>
> This users has been able to import the file. Am I missing something
> obvious?
>

Arthur Zakirov could help you.


>
> ww
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] JSONB performance enhancement for 9.6

2016-01-20 Thread Oleg Bartunov
On Wed, Jan 20, 2016 at 4:51 AM, Bruce Momjian  wrote:

> On Mon, Jan 11, 2016 at 09:01:03PM -0500, Tom Smith wrote:
> > Hi,
> >
> > Congrats on the official release of 9.5
> >
> > And I'd like bring up the issue again about if 9.6 would address the
> jsonb
> > performance issue
> > with large number of top level keys.
> > It is true that it does not have to use JSON format. it is about
> serialization
> > and fast retrieval
> > of dynamic tree structure objects. (at top level, it might be called
> dynamic
> > columns)
> > So if postgresql can have its own way, that would work out too as long
> as it
> > can have intuitive query
> > (like what are implemented for json and jsonb) and fast retrieval of a
> tree
> > like object,
> > it can be called no-sql data type. After all, most motivations of using
> no-sql
> > dbs like MongoDB
> > is about working with dynamic tree object.
> >
> > If postgresql can have high performance on this, then many no-sql dbs
> would
> > become history.
>
> I can give you some backstory on this.  TOAST was designed in 2001 as a
> way to store, in a data-type-agnostic way, long strings compressed and
> any other long data type, e.g. long arrays.
>
> In all previous cases, _part_ of the value wasn't useful.  JSONB is a
> unique case because it is one of the few types that can be processed
> without reading the entire value, e.g. it has an index.
>
> We are going to be hesitant to do something data-type-specific for
> JSONB.  It would be good if we could develop a data-type-agnostic
> approach to has TOAST can be improved.  I know of no such work for 9.6,
> and it is unlikely it will be done in time for 9.6.
>

I'm looking on this time to time.


>
> --
>   Bruce Momjian  http://momjian.us
>   EnterpriseDB http://enterprisedb.com
>
> + As you are, so once was I. As I am, so you will be. +
> + Roman grave inscription +
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] WIP: CoC V4

2016-01-12 Thread Oleg Bartunov
On Tue, Jan 12, 2016 at 7:50 PM, Joshua D. Drake 
wrote:

> Tl;dr;
>
> * Removed specific examples and reworded #3 to be more concise
>
> PostgreSQL Global Development Group (PGDG) Code of Conduct (CoC):
>
> 1. The CoC is to provide community guidelines for creating and enforcing a
> safe, respectful, productive, and collaborative place for any person who is
> willing to contribute in a safe, respectful, productive and collaborative
> way.
>
> 2. The CoC is not about being offended. As with any diverse community,
> anyone can get offended at anything.
>
> 3. A safe, respectful, productive and collaborative environment is free of
> personal attacks and disparaging remarks of any kind.
>
> 4. Any sustained disruption of the collaborative space (mailing lists, IRC
> etc..) or other PostgreSQL events shall be construed as a violation of the
> CoC and appropriate action will be taken by the CoC committee.
>

Then we need to define what is "Coc committee".


>
> 5. The CoC is only about interaction with the PostgreSQL community. Your
> private and public lives outside of the PostgreSQL community are your own.
>


* We should specify here legal place of CoC, say, URL on postgresql.org
* We should translate it to other languages




>
> --
> Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
> PostgreSQL Centered full stack support, consulting and development.
> Announcing "I'm offended" is basically telling the world you can't
> control your own emotions, so everyone else should do it for you.
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] Code of Conduct: Is it time?

2016-01-10 Thread Oleg Bartunov
On Sun, Jan 10, 2016 at 9:44 PM, Regina Obe  wrote:

> > On 01/10/2016 08:07 AM, Bill Moran wrote:
>
> >> So, the purpose of a CoC is twofold:
> >>
> >> A) Define what "being excellent" means to this particular
> >> community.
> >> B) Provide a process for how to resolve things when "being
> >> excellent" doesn't happen.
> >>
> >> Without #1, nobody will want to do #2, as it's basically a
> >> job that can never be done correctly.
>
> > I agree with you completely. That is actually why I included the link to
> > the graphic in the last post. My point was, I have no intention of
> > having a CoC that is full of drivel. I would want a clear, concise,
> > no-B.S. CoC.
>
> > JD
>
> This may come as a big shock to many of you, but as a contributor
> I don't care if you are racist, sexist, transphobic or whatever as long as
> you
>
> 1) Are helpful when I ask a question
> 2) Stick to the topic
> 3) Don't get into petty etiquettes like "Please stop top posting"
> and if you really need to - A polite we prefer top posting would do
>
> 4) Are sensitive to people on other operating systems other than your
> preferred.
>

+1


>
>
> My other concern about CoCs is I fear someone is going to come and demand
> we change Master/Slave  to Leader/Follower, because Master is a male term
> and Slave is insensitive to grand-children of slaves.
>


Some people don't understand all these issues with she/he, for example, we
in Russia are not really concern about this.



>
>
> Thanks,
> Regina
>
>
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] Code of Conduct: Is it time?

2016-01-05 Thread Oleg Bartunov
On Wed, Jan 6, 2016 at 7:41 AM, Jim Nasby  wrote:

> On 1/5/16 10:03 PM, John R Pierce wrote:
>
>> On 1/5/2016 5:31 PM, Jim Nasby wrote:
>>
>>> IMHO, the real problem here is not simply a CoC, it is that the
>>> Postgres community doesn't focus on developing the community itself.
>>> The closest we come to "focus" is occasional talk on -hackers about
>>> how we need more developers. There is no formal
>>> discussion/leadership/coordination towards actively building and
>>> strengthening our community. Until that changes, I fear we will always
>>> have a lack of developers. More importantly, we will continue to lack
>>> all the other ways that people could contribute beyond writing code.
>>> IE: the talk shouldn't be about needing more developers, it should be
>>> about needing people who want to contribute time to growing the
>>> community.
>>>
>>
>>
>> That sounds like a bunch of modern marketing graduate mumbojumbo to
>> me.The postgres community are the people who actually support it on
>> the email lists and IRC, as well as the core development teams, and
>> INMO, they are quite strong and effective. when you start talking
>> about social marketing and facebook and twitter and stuff, thats just a
>> bunch of feelgood smoke and mirrors.The project's output is what
>> supports it, not having people going out 'growing community', that is
>> just a bunch of hot air.   you actively 'grow community' when you're
>> pushing worthless products (soda pop, etc) based on slick marketing
>> plans rather than actually selling something useful.
>>
>
> Then why is it that there is almost no contribution to the community other
> than code and mailing list discussion?
>
> Why is the infrastructure team composed entirely of highly experienced
> code contributors, of which there are ~200 on the planet, when there are
> literally 100s of thousands (if not millions) of people out there that
> could do that work (and could probably do it better if it's what they do
> for a living, no offense to the efforts of the infrastructure team).
>
> Why is there a lack of developers? And a serious lack of code reviewers?
>

I agree with Jim, something is wrong, I see our developers community isn't
growing and getting older. There is no formal problem to start contribute,
but steep learning curve and lack of mentoring practice scare people.

Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
> Experts in Analytics, Data Architecture and PostgreSQL
> Data in Trouble? Get it in Treble! http://BlueTreble.com
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] json indexing and data types

2015-12-12 Thread Oleg Bartunov
On Wed, Dec 2, 2015 at 11:48 AM, Kaare Rasmussen  wrote:

> Hi
>
> As json essentially only has three basic data types, string, int, and
> boolean, I wonder how much of this - to index, search, and sort on
> unstructured data -  is possible. I guess part of the answer would be
> 'jsquery and vodka', but let me describe the problem first.
>
> The basics is, that I have a column with what is essentially json data; a
> number of data structures of different depths. Perhaps 10 - 30 top levels,
> and probably no more than 3, max 4 levels deep. In total there are some
> hundred thousands of rows in each table. It would probably be best stored
> as jsonb. Right now it's text, because it's only used by the application
> itself.
>
> It would be incredibly useful to add an index to this column, and to be
> able to search, using the index, on arbitrary elements. This part seems
> already there, with jsquery.
>
> The hard part is that some of the data items really have another type.
> There are dates and floating points, as the most important ones. And the
> really hard part is that sorting and range searches are important,
> especially for these two types. Having dates is iso-format, and
> left-padding floats with zeros is a low tech solution, and especially the
> latter is not very efficient.
>
> The solution might be to add functional indexes for these data items, but
> it's cumbersome and not easily maintainable. If a one-stop solution is in
> the works, or already there, it could save a lot of time.
>
>
This is known problem, that's why we stop developing jsquery and are
working on sql-level query language for jsonb, then you'll use all power
and extendability of SQL.  The idea is to use power of subselects and
unnest to unroll jsonb to sql level.
There is presentation at pgconf.eu on this
https://wiki.postgresql.org/images/4/4e/Createam.pdf, see slide #27

But I'm afraid it'll come to 9.6.




> /kaare
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] JSON path wild cards?

2015-11-24 Thread Oleg Bartunov
On Tue, Nov 24, 2015 at 12:39 PM, Dennis  wrote:

> Is there a way to specify a wild card in a json path?
>
> For example I have the following json doc:
>
> [ {“a”:1,”b”: [ { “x”: 7,”y”:8,”z”:9} ] },  {“a”:2,”b”: [ { “x”:
> 4,”y”:5,”z”:6} ] }, … ]
>
> How do I write a select clause that can return the values for all b x
> values something like [{b:x}] that would return all the b:x values in the
> array? e.g. 7 and 4 ...
>
> Also is there a definition of the syntax of a proper json path for use in
> postgres?
>

Not in postgres currently, but you may try our jsquery extension
https://github.com/postgrespro/jsquery.


Oleg


>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] [to_tsvector] German Compound Words

2015-05-28 Thread Oleg Bartunov
ts_debug() ?

=# select * from ts_debug('english', 'messages');
   alias   |   description   |  token   |  dictionaries  |  dictionary  |
lexemes
---+-+--++--+--
 asciiword | Word, all ASCII | messages | {english_stem} | english_stem |
{messag}


On Thu, May 28, 2015 at 2:05 PM, Sven R. Kunze srku...@tbz-pariv.de wrote:

 Hi everybody,

 what do I need to do in order to enable compound word handling in
 PostgreSQL tsvector implementation?

 I run an Ubuntu 14.04 machine, PostgreSQL 9.3, have installed package
 hunspell-de-de and already created a new dictionary as described here:
 http://www.postgresql.org/docs/9.3/static/textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY

 CREATE TEXT SEARCH DICTIONARY german_hunspell (
 TEMPLATE = ispell,
 DictFile = de_de,
 AffFile = de_de,
 StopWords = german
 );

 Furthermore, created a new test text search configuration (copied from
 german) and updated all parser parts where the german_stem dictionary is
 used so that it uses german_hunspell first and then german_stem.

 However, ts_vector still does not work for the compound words such as:

 wasserkraft - wasserkraft, kraft
 schifffahrt - schifffahrt, fahrt
 blindflansch - blindflansch, flansch

 etc.


 What have I done wrong here?

 --
 Sven R. Kunze
 TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
 Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
 e-mail: srku...@tbz-pariv.de
 web: www.tbz-pariv.de

 Geschäftsführer: Dr. Reiner Wohlgemuth
 Sitz der Gesellschaft: Chemnitz
 Registergericht: Chemnitz HRB 8543



 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general



Re: [GENERAL] [tsvector] to_tsvector called multiple times

2015-05-26 Thread Oleg Bartunov
You can ask http://snowball.tartarus.org/ for stemmer. Meanwhile,
you can have small personal dictionary (before stemmer) with such
exceptions, for example, use synonym template

system system

Oleg


On Tue, May 26, 2015 at 11:18 AM, Sven R. Kunze srku...@tbz-pariv.de
wrote:

 Hi everybody,

 the following stemming results made me curious:

 select to_tsvector('german', 'systeme');  'system':1
 select to_tsvector('german', 'systemes');  'system':1
 select to_tsvector('german', 'systems');  'system':1
 select to_tsvector('german', 'systemen');  'system':1
 select to_tsvector('german', 'system');   'syst':1


 First of all, this seems to be a bug in the German stemmer. Where can I
 fix it?

 Second, and more importantly, as I understand it, the stemmed version of a
 word should be considered normalized. That is, all other versions of that
 stem should be mapped to it as well. The interesting problem here is that
 PostgreSQL maps the stem itself ('system') to a completely different stem
 ('syst').

 Should a stem not remain stable even when to_tsvector is called on it
 multiple times?

 --
 Sven R. Kunze
 TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
 Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
 e-mail: srku...@tbz-pariv.de
 web: www.tbz-pariv.de

 Geschäftsführer: Dr. Reiner Wohlgemuth
 Sitz der Gesellschaft: Chemnitz
 Registergericht: Chemnitz HRB 8543



 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general



Re: [GENERAL] Grouping By Similarity (using pg_trgm)?

2015-05-22 Thread Oleg Bartunov
Have you seen http://www.sai.msu.su/~megera/postgres/talks/pgcon-2012.pdf ?

On Thu, May 14, 2015 at 9:58 PM, Cory Tucker cory.tuc...@gmail.com wrote:

 [pg version 9.3 or 9.4]

 Suppose I have a simple table:

 create table data (
   my_value  TEXT NOT NULL
 );
 CREATE INDEX idx_my_value ON data USING gin(my_value gin_trgm_ops);


 Now I would like to essentially do group by to get a count of all the
 values that are sufficiently similar.  I can do it using something like a
 CROSS JOIN to join the table on itself, but then I still am getting all the
 rows with duplicate counts.

 Is there a way to do a group by query and only return a single my_value
 column and a count of the number of times other values are similar while
 also not returning the included similar values in the output, too?




Re: [GENERAL] Find similar records (compare tsvectors)

2015-03-06 Thread Oleg Bartunov
On Fri, Mar 6, 2015 at 5:05 PM, Patrick Dung patrick_...@yahoo.com.hk wrote:
 Resend.

 How to quickly compare the similarity of two tsvector?


check http://www.sai.msu.su/~megera/postgres/talks/pgcon-2012.pdf



 On Monday, March 2, 2015 11:01 PM, Patrick Dung patrick_...@yahoo.com.hk
 wrote:


 Hello,

 I had a database with articles or attachment stored in bytea format.
 I also had a trigger: it insert/update the tsv column when a record is
 added/updated.
 The tsv column had a GIN index.
 With this setting, I can do very fast keyword search on the tsv.

 Suppose I had a specific record (id=10).
 How to list similar records based on ranking?
 In that case, I had to compare a tsvector with another tsvector.

 I had this SQL which make the original tsv as a text and then to tsquery,
 Then I can compare a tsv and a tsquery.
 SELECT ts_rank(i.tsv, replace(strip(original.tsv)::text, ' ', '|')::tsquery)
 as similarity, i.company, i.industry, i.post_timestamp, i.id FROM items i,
 (SELECT tsv, id FROM items WHERE id=10) AS original WHERE i.id !=
 original.id ORDER BY similarity;

 items table:
 id bigint
 company varchar
 industry varchar
 description varchar
 post_timestamp timestamp
 attachment bytea
 tsv tsvector

 The problem is that this is very slow.
 Any comment?

 Thank and regards,
 Patrick




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] WAL supported extension

2015-01-18 Thread Oleg Bartunov
On Sun, Jan 18, 2015 at 9:15 AM, Michael Paquier michael.paqu...@gmail.com
wrote:

 On Sat, Jan 17, 2015 at 11:18 PM, Kouhei Sutou k...@cozmixng.org wrote:
  (Is this mailing list right mailing list for asking this
  question...?)
 Hackers would have been fine as well.

  Is there any plan to implement PostgreSQL API to implement
  WAL supported extension?
 Not that I know of, the last discussion I recall on the matter being this
 one:

 http://www.postgresql.org/message-id/capphfdsxwzmojm6dx+tjnpyk27kt4o7ri6x_4oswcbyu1rm...@mail.gmail.com
 --


We are eager for development of this API.


 Michael


 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general



Re: [GENERAL] WAL supported extension

2015-01-18 Thread Oleg Bartunov
On Sun, Jan 18, 2015 at 2:01 PM, Michael Paquier michael.paqu...@gmail.com
wrote:

 Oleg Bartunov wrote:
  We are eager for development of this API.
 Yeah, me too actually :) Oleg, are there plans on your side to do
 something in this area for 9.6?


Yes, Alexander Korotkov will continue working on this. Are you coming
Moscow for pgconf.ru ? We'll have a good time to discuss this.

Oleg


 --
 Michael



Re: [GENERAL] JSONB spaces in text presentation

2014-09-24 Thread Oleg Bartunov
Check slides 17-20 of
http://www.sai.msu.su/~megera/postgres/talks/hstore-dublin-2013.pdf to
understand, what 'binary format' means. The slides describes binary storage
for nested hstore, not jsonb, but you'll get the idea.

On Wed, Sep 24, 2014 at 6:22 PM, Seref Arikan serefari...@gmail.com wrote:

 This is interesting. Most binary encoding methods I use produce smaller
 files than the text files for the same content.
 Having read your mail, I've realized that I have no reason to accept the
 same from the jsonb. I did a quick google search to see if it is wrong to
 expect binary encoding to decrease size and saw that I'm not alone (which
 still does not mean I'm being reasonable).
 This project: http://ubjson.org/#size is one of the hits which mentions
 some nice space gains thanks to binary encoding.

 The much larger part is a bit scary. Is this documented somewhere?

 Best regards
 Seref


 On Wed, Sep 24, 2014 at 2:44 PM, Merlin Moncure mmonc...@gmail.com
 wrote:

 On Wed, Sep 24, 2014 at 2:44 AM, Ilya I. Ashchepkov koc...@gmail.com
 wrote:
  I'm sorry about sending email several times. I haven't understand, was
 it
  sent by gmail or not.
 
 
  On Wed, Sep 24, 2014 at 2:30 PM, John R Pierce pie...@hogranch.com
 wrote:
 
  On 9/24/2014 12:23 AM, Ilya I. Ashchepkov wrote:
 
 
  Is spaces is necessary in text presentation of JSONB?
  In my data resulting text contains ~12% of spaces.
 
 
  can you show us an example of this?
 
 
  One record
  # select data from events.data limit 1;
  {can: {lls: {1: 76.4}, mhs: 4674.85, rpm: 168.888, speed:
 74,
  runned: 166855895, fuel_consumption: 74213.5}, crc: 10084, gps:
 1,
  gsm: {signal: 100}, lls: {1: 733, 2: 717}, used: 19,
 speed:
  87.4, valid: 1, msg_id: 89, runned: 72.75, boot_no: 256,
 digital:
  {in: {1: 1, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0}, out: {1: 0,
  2: 0}}, visible: 20, ignition: 1, location: {course: 265,
  altitude: 143, latitude: 55.127888997395836, longitude:
  80.8046142578125}, protocol: 4, coldstart: 1, timesource:
 terminal,
  receiver_on: 1, external_power: 28.07, internal_power: 4.19}
 
  Whitespacis percents in this record:
  # select array_length(regexp_split_to_array(data::text, text ' '),
  1)*100./length(data::text) from events.data limit 1;
?column?
  -
   12.3417721518987342
 
  Whitespace in test data
   # select count(*),avg(array_length(regexp_split_to_array(data::text,
 text '
  '), 1)*100./length(data::text)) from events.data ;
   count  | avg
  +-
   24 | 12.3649234646118312


 For jsonb (unlike json), data is not actually stored as json but in a
 binary format.  It will generally be much larger than the text
 representation in fact but in exchange for that many operations will
 be faster.  The spaces you see are generated when the jsonb type is
 converted to text for output.  I actually think it's pretty reasonable
 to want to redact all spaces from such objects in all cases where
 converstion to text happens (output functions, xxxto_json, etc)
 because ~12% savings are nothing to sneeze at when moving large
 documents in and out of the database.

 On the flip side, a more verbose prettification would be pretty nice
 too.  I wonder if a hypothetical GUC is the best way to control this
 behavior...

 merlin


 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general





Re: [GENERAL] jsonb and comparison operators

2014-09-03 Thread Oleg Bartunov
jsquery (https://github.com/akorotkov/jsquery) should works for you.


On Wed, Sep 3, 2014 at 8:38 AM, Joe Van Dyk j...@tanga.com wrote:

 Is it possible to get this query (or a similar one) to use an index?

 I want to return all rows that have a value of less than 10. I have
 arbitrary keys I want to check (not just 'a').

 drop table if exists test;

 create table test (j jsonb);

 insert into test select json_build_object('a', i)::jsonb from
 generate_series(1, 10) i;
 create index on test using gin(j);

 vacuum analyze test;



 select * from test where (j-'a')::int  10;

 I tried
 select * from test where j-'a'  10::json::jsonb;
 but didn't seem to use the index.



Re: [GENERAL] Force specific index disuse

2014-05-21 Thread Oleg Bartunov
We routinely use plantuner. http://www.sai.msu.su/~megera/wiki/plantuner



On Tue, May 20, 2014 at 12:38 PM, Steve Crawford
scrawf...@pinpointresearch.com wrote:
 Is there a way to force a specific index to be removed from consideration in
 planning a single query?

 Specifically, on a 60-million-row table I have an index that is a candidate
 for removal. I have identified the sets of nightly queries that use the
 index but before dropping it I would like to run EXPLAIN and do timing tests
 on the queries to see the impact of not having that index available and
 rewrite the query to efficiently use other indexes if necessary.

 Cheers,
 Steve



 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Full-Text Search question

2014-05-13 Thread Oleg Bartunov
Easy, you need to extract text fields from json and construct tsvector
from them (use concatenation, for example).

On Tue, May 13, 2014 at 7:38 PM, Jesus Rafael Sanchez Medrano
jesusraf...@gmail.com wrote:
 can postgres do FTS (full text search) on a json column? if possible, please
 be so kindd to give some snippet/example.


 Att.
 ==
 Jesus Rafael Sanchez Medrano
 Life is a dream, of which all must wake up


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Full text: Ispell dictionary

2014-05-07 Thread Oleg Bartunov
btw, take a look on contrib/dict_xsyn, it's  more powerful than
synonym dictionary.

On Sat, May 3, 2014 at 2:26 AM, Tim van der Linden t...@shisaa.jp wrote:
 Hi Oleg

 Haha, understood!

 Thanks for helping me on this one.

 Cheers
 Tim


 On May 3, 2014 7:24:08 AM GMT+09:00, Oleg Bartunov obartu...@gmail.com
 wrote:

 Tim,

 you did answer yourself - don't use ispell :)

 On Sat, May 3, 2014 at 1:45 AM, Tim van der Linden t...@shisaa.jp wrote:

  On Fri, 2 May 2014 21:12:56 +0400
  Oleg Bartunov obartu...@gmail.com wrote:

  Hi Oleg

  Thanks for the response!

  Yes, it's normal for ispell dictionary, think about morphological
 dictionary.


  Hmm, I see, that makes sense. I thought the morphological aspect of the
 Ispell only dealt with splitting up compound words, but it also deals with
 deriving the word to a more stem like form, correct?

  As a last question on this, is there a way to disable this dictionary to
 emit multiple lexemes?


 The reason I am asking is because in my (fairly new) understanding of
 PostgreSQL's full text it is always best to have as few lexemes as possible
 saved in the vector. This to get smaller indexes and faster matching
 afterwards. Also, if you run a tsquery afterwards to, you can still employ
 the power of these multiple lexemes to find a match.

  Or...probably answering my own question...if I do not desire this
 behavior I should maybe not use Ispell and simply use another dictionary :)

  Thanks again.

  Cheers,
  Tim

  On Fri, May 2, 2014 at 11:54 AM, Tim van der Linden t...@shisaa.jp
 wrote:

  Good morning/afternoon all

  I am currently writing a few articles about PostgreSQL's full text
 capabilities and have a question about the Ispell dictionary which I
 cannot seem to find an answer to. It is probably a very simple issue, so
 forgive my ignorance.

  In one article I am explaining about dictionaries and I have setup a
 sample configuration which maps most token categories to only use a Ispell
 dictionary (timusan_ispell) which has a default configuration:

  CREATE TEXT SEARCH DICTIONARY timusan_ispell (
  TEMPLATE = ispell,
  DictFile = en_us,
  AffFile = en_us,
  StopWords = english
  );

  When I run a simple query like SELECT
 to_tsvector('timusan-ispell','smiling') I get back the following 
 tsvector:

  'smile':1 'smiling':1

  As you can see I get two lexemes with the same pointer.
  The question here is: why does this happen?

  Is it normal behavior for the Ispell dictionary to emit multiple
 lexemes for a single token? And if so, is this efficient? I
 mean, why could it not simply save one lexeme 'smile' which (same as
 the snowball dictionary) would match 'smiling' as well if later matched 
 with
 the accompanying tsquery?

  Thanks!

  Cheers,
  Tim


  --
  Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
  To make changes to your subscription:
  http://www.postgresql.org/mailpref/pgsql-general



  --
  Tim van der Linden t...@shisaa.jp


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Full text: Ispell dictionary

2014-05-02 Thread Oleg Bartunov
Yes, it's normal for ispell dictionary, think about morphological dictionary.

On Fri, May 2, 2014 at 11:54 AM, Tim van der Linden t...@shisaa.jp wrote:
 Good morning/afternoon all

 I am currently writing a few articles about PostgreSQL's full text 
 capabilities and have a question about the Ispell dictionary which I cannot 
 seem to find an answer to. It is probably a very simple issue, so forgive my 
 ignorance.

 In one article I am explaining about dictionaries and I have setup a sample 
 configuration which maps most token categories to only use a Ispell 
 dictionary (timusan_ispell) which has a default configuration:

 CREATE TEXT SEARCH DICTIONARY timusan_ispell (
 TEMPLATE = ispell,
 DictFile = en_us,
 AffFile = en_us,
 StopWords = english
 );

 When I run a simple query like SELECT 
 to_tsvector('timusan-ispell','smiling') I get back the following tsvector:

 'smile':1 'smiling':1

 As you can see I get two lexemes with the same pointer.
 The question here is: why does this happen?

 Is it normal behavior for the Ispell dictionary to emit multiple lexemes for 
 a single token? And if so, is this efficient? I mean, why could it not simply 
 save one lexeme 'smile' which (same as the snowball dictionary) would match 
 'smiling' as well if later matched with the accompanying tsquery?

 Thanks!

 Cheers,
 Tim


 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Manipulating jsonb

2014-05-02 Thread Oleg Bartunov
Andreas,

take a look on https://github.com/erthalion/jsonbx. This is a place,
where all hstore functionality will be eventually ported. See this
table - https://gist.github.com/erthalion/10890778

Oleg

On Fri, May 2, 2014 at 10:32 PM, Andreas Heiduk ashei...@gmail.com wrote:
 Hello,

 I'm playing with a 9.4 preview and the upcoming jsonb datatype. But
 I'm missing the basic modification syntax.

 Given the following situation:

 CREATE TABLE test(key int, jsonval jsonb);
 INSERT INTO test VALUES(1, '{a: 1, c: 3}');

 How can I UPDATE that row with '{b: 2, d: 4}' ? Something like
 this does not work:

 UPDATE test SET jsonval = jsonval || '{a: 1, c: 3}'::jsonb
 where key = 1;

 The result should be

 {a: 1, b: 2, c: 3, d: 4}

 The same goes for removing values.

 Did I overlook something obvious or is there really no way to do that right 
 now?

 Best regards,
 Andreas Heiduk


 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Full text: Ispell dictionary

2014-05-02 Thread Oleg Bartunov
Tim,

you did answer yourself - don't use ispell :)

On Sat, May 3, 2014 at 1:45 AM, Tim van der Linden t...@shisaa.jp wrote:
 On Fri, 2 May 2014 21:12:56 +0400
 Oleg Bartunov obartu...@gmail.com wrote:

 Hi Oleg

 Thanks for the response!

 Yes, it's normal for ispell dictionary, think about morphological dictionary.

 Hmm, I see, that makes sense. I thought the morphological aspect of the 
 Ispell only dealt with splitting up compound words, but it also deals with 
 deriving the word to a more stem like form, correct?

 As a last question on this, is there a way to disable this dictionary to emit 
 multiple lexemes?

 The reason I am asking is because in my (fairly new) understanding of 
 PostgreSQL's full text it is always best to have as few lexemes as possible 
 saved in the vector. This to get smaller indexes and faster matching 
 afterwards. Also, if you run a tsquery afterwards to, you can still employ 
 the power of these multiple lexemes to find a match.

 Or...probably answering my own question...if I do not desire this behavior I 
 should maybe not use Ispell and simply use another dictionary :)

 Thanks again.

 Cheers,
 Tim

 On Fri, May 2, 2014 at 11:54 AM, Tim van der Linden t...@shisaa.jp wrote:
  Good morning/afternoon all
 
  I am currently writing a few articles about PostgreSQL's full text 
  capabilities and have a question about the Ispell dictionary which I 
  cannot seem to find an answer to. It is probably a very simple issue, so 
  forgive my ignorance.
 
  In one article I am explaining about dictionaries and I have setup a 
  sample configuration which maps most token categories to only use a Ispell 
  dictionary (timusan_ispell) which has a default configuration:
 
  CREATE TEXT SEARCH DICTIONARY timusan_ispell (
  TEMPLATE = ispell,
  DictFile = en_us,
  AffFile = en_us,
  StopWords = english
  );
 
  When I run a simple query like SELECT 
  to_tsvector('timusan-ispell','smiling') I get back the following tsvector:
 
  'smile':1 'smiling':1
 
  As you can see I get two lexemes with the same pointer.
  The question here is: why does this happen?
 
  Is it normal behavior for the Ispell dictionary to emit multiple lexemes 
  for a single token? And if so, is this efficient? I mean, why could it not 
  simply save one lexeme 'smile' which (same as the snowball dictionary) 
  would match 'smiling' as well if later matched with the accompanying 
  tsquery?
 
  Thanks!
 
  Cheers,
  Tim
 
 
  --
  Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
  To make changes to your subscription:
  http://www.postgresql.org/mailpref/pgsql-general


 --
 Tim van der Linden t...@shisaa.jp


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Manipulating jsonb

2014-05-02 Thread Oleg Bartunov
No way, Andreas !

But, we hope, Dimitri will release his extension before 9.4, so
anybody could install it.

Oleg

On Sat, May 3, 2014 at 1:21 AM, Andreas Heiduk ashei...@gmail.com wrote:
 Hello Oleg,

 how are the odds that the '||' and '-' operators from jsonbx will be
 included in the public 9.4 release?

 Andreas


 On 2 May 2014 21:21, Oleg Bartunov obartu...@gmail.com wrote:
 Andreas,

 take a look on https://github.com/erthalion/jsonbx. This is a place,
 where all hstore functionality will be eventually ported. See this
 table - https://gist.github.com/erthalion/10890778

 Oleg

 On Fri, May 2, 2014 at 10:32 PM, Andreas Heiduk ashei...@gmail.com wrote:
 Hello,

 I'm playing with a 9.4 preview and the upcoming jsonb datatype. But
 I'm missing the basic modification syntax.

 Given the following situation:

 CREATE TABLE test(key int, jsonval jsonb);
 INSERT INTO test VALUES(1, '{a: 1, c: 3}');

 How can I UPDATE that row with '{b: 2, d: 4}' ? Something like
 this does not work:

 UPDATE test SET jsonval = jsonval || '{a: 1, c: 3}'::jsonb
 where key = 1;

 The result should be

 {a: 1, b: 2, c: 3, d: 4}

 The same goes for removing values.

 Did I overlook something obvious or is there really no way to do that right 
 now?

 Best regards,
 Andreas Heiduk


 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] hstore - jsonb

2014-04-04 Thread Oleg Bartunov
On Fri, Apr 4, 2014 at 12:20 PM, Armand Turpel
armand.turpel.m...@gmail.com wrote:
 Hi,

 A few questions about jsonb and hstore:

 1. does jsonb replace hstore?

no, it's different data type

 2. compatibility of jsonb  hstore?

hstore is a simple key-value model, while jsonb - richer  document-based model.

 3. will the development of hstore continue?

I don't believe so, we froze nested hstore to concentrate development
resources to jsonb.   Nested hstore still available from
http://www.sigaev.ru/git/gitweb.cgi?p=hstore.git;a=summary
It's should be compatible with 9.3


 4. is it recommended to use jsonb when planning new projects?

yes, we are working on jsonb support.


 Thanks
 Armand


 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] hstore - jsonb

2014-04-04 Thread Oleg Bartunov
We'll work on contrib/jsonxtra with all operators ported from hstore
and release it after 9.4 as separate extension.

On Fri, Apr 4, 2014 at 4:32 PM, Thom Brown t...@linux.com wrote:
 On 4 April 2014 13:04, Oleg Bartunov obartu...@gmail.com wrote:
 On Fri, Apr 4, 2014 at 12:20 PM, Armand Turpel
 armand.turpel.m...@gmail.com wrote:
 Hi,

 A few questions about jsonb and hstore:

 1. does jsonb replace hstore?

 no, it's different data type

 2. compatibility of jsonb  hstore?

 hstore is a simple key-value model, while jsonb - richer  document-based 
 model.

 3. will the development of hstore continue?

 I don't believe so, we froze nested hstore to concentrate development
 resources to jsonb.   Nested hstore still available from
 http://www.sigaev.ru/git/gitweb.cgi?p=hstore.git;a=summary
 It's should be compatible with 9.3


 4. is it recommended to use jsonb when planning new projects?

 yes, we are working on jsonb support.

 One major advantage of hstore over json/jsonb at the moment is data
 manipulation, which could make json/jsonb a non-starter for some.

 For example, in hstore one can do:

 -- remove a key/value pair by key
 UPDATE mytable SET hcolumn = hcolumn - 'mykey'::text;

 or:

 -- remove a key/value pair by key/value
 UPDATE mytable SET hcolumn = hcolumn - 'mykey=myvalue'::hstore;

 or:

 -- add/replace a key/value pair
 UPDATE mytable SET hcolumn = hcolumn || 'newkey=newvalue'::hstore;


 You can't do something like that with json/jsonb at the moment, and
 that's not going to be possible in the upcoming version either.  You'd
 probably have to resort to application-side modification, or use
 something like pl/v8.

 --
 Thom


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] hstore - jsonb

2014-04-04 Thread Oleg Bartunov
On Fri, Apr 4, 2014 at 7:17 PM, Thom Brown t...@linux.com wrote:
 On 4 April 2014 16:15, Oleg Bartunov obartu...@gmail.com wrote:
 We'll work on contrib/jsonxtra with all operators ported from hstore
 and release it after 9.4 as separate extension.

 That would be useful. :)

 Would there be an aim of getting that in-core for 9.5?

I hope so.


 --
 Thom


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Need Help to implement Proximity search feature

2014-01-15 Thread Oleg Bartunov
check knn search, http://www.sai.msu.su/~megera/postgres/talks/pgcon-2010-1.pdf

On Wed, Jan 15, 2014 at 8:04 PM, itishree sukla
itishree.su...@gmail.com wrote:
 Deal all,

 In my project, there is a requirement to implement proximity search feature.
 We are running a mobile app, for which proximity search is require. Can any
 one guide me how i can achieve this using postgis, or is there any other way
 i can achieve this.

 We are using postgresql 9.2.

 Thanks in advance .

 Regards,
 Itishree


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] [tsearch2] Problem with case sensitivity (or with creating own dictionary)

2013-08-05 Thread Oleg Bartunov

Please,

take a look on contrib/dict_int and create your own dict_noop.
It should be easy.  I think you could document it and share
with people (wiki.postgresql.org ?), since there were other people
interesting in noop dictionary. Also, don't forget to modify
your configuration - use ts_debug(), it will helps you.

Regards,
Oleg

On Sat, 3 Aug 2013, Krzysztof xaru Rajda wrote:


Hello,

I encountered such a problem. my goal is to extract links from a text using 
tsearch2. Everything seemed to be well, unless I got some youtube links - 
there are some small and big letters inside, and a tsearch parser is lowering 
everything (from http://youtube.com/Y6dsHDX I got http://youtube.com/y6dshdx, 
which is not working). I went through PostgreSQL docs, and it seem that each 
of default dictionaries (simple, ispell, snowball) are lowering lexems during 
normalization, and there is no option to disable it.


I started to look for some tutorials, how to create own dictionary, or modify 
existing one (I'm talking about dictionary like snowball, with my own source 
code - not just a dictionary created by 'CREATE DICTIONARY...' query), but 
all I found is really out-of-date, and uses some mechanisms that are 
deprecated in latest version of Postgres (I'm working on v 9.2) - like 
'contrib/gendict' here: 
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/custom-dict.html 
http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/custom-dict.html 

So now, I have no idea what to do with my case sensitivity problem... Is 
there any other way to overcome it, apart from creating own dictionary? If no 
- how to create one on the Postgres 9.2?


Regards,
xaru





Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Analyzing last run query in psql

2013-07-03 Thread Oleg Bartunov

It was my dream to have something we already have in shell -

explain analyze !$

I think it should be not very difficult.

Oleg
On Tue, 2 Jul 2013, Joe Van Dyk wrote:


I frequently need to analyze the last query in psql:
   select * from table where id = 1;
   explain analyze select * from table where id = 1;

It would be nice to be able to do this:
   explain analyze $LAST

(or can I do something like that already?)

I'm not using psql interactively, I pipe files into psql (if it matters).

Joe



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to do a full-text search words within some proximity of each other?

2012-09-24 Thread Oleg Bartunov

something like this ?

http://www.sai.msu.su/~megera/wiki/2009-08-12
http://www.sai.msu.su/~megera/postgres/talks/algebra-fts.pdf

Unfortunately, we get no support for this work, so we stop maintaining 
phrase-search patch. I even thinking about kikstarter.com to get money

for this project :)

Oleg
On Mon, 24 Sep 2012, W. Matthew Wilson wrote:


I noticed in elastic search (ES), you can do queries like

   a b~4

I think this query will match stuff like a b and a x x b but not
something like a x x x x x x x x b.

I'm not sure if this kind of thing is possible with postgresql full
text search.   Is it possible?

I understand that I can do a query and rank the results by how closely
the words are to each other, but I want to exclude any matches where
the words are not within two words of each other.

Thanks in advance!

Matt




Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Simplifying the tsvector format for simple glossaries

2012-01-29 Thread Oleg Bartunov

Always use strip(to_tsvector()) and you'll be happy. Stop words affect
tsvector size, so if you don't search them, don't store.

strip() function described in docs.

Oleg

On Sun, 29 Jan 2012, Marc Mamin wrote:



Hello,

We  have a text search on data from error logs, and our application
offer a rather simple search on lexemes only (no weighting, no
neighbouring ...).
This works quite well, except when the applications generating the logs
get mad and we have to handle millions of messages per day :-)
We also have an ETL (perl) tool, that first transform the logs to CSV
files for COPY

My idea is to let perl create a list of single words for each message,
and run the search only on these glossaries.
Going further, I'd like to import these lists directly as tsvectors to
save a processing step within Postgres.

The standard tsvector representation in CSV would then look like

'lex_1':1 'lex_2':2 'lex_3':3 ...

when casting from text to tsvector, I've notice with 9.1 that this simpler 
format is valid too:

'lex_1 lex_2 lex_3 ...'

So my questions:
Is it safe to define tsvectors that way, or should I expect problems
with future release being stricter with the tsvector format?

Do I have to respect the lexemes ordering within a tsvector (using which
NLS Format) ?

Is it an issue if some tsvectors contains stop words, or is it just
annoying noise ?

For the case when this simplification is fine, I'd suggest to add a
description on this possible tsvector representation to the doc.

best regards,

Marc Mamin



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Full Text Search, avoiding lexemes search

2012-01-27 Thread Oleg Bartunov

Daniel,

just use different fts configuration for search, which doesn't
includes stemmers.

Regards,
Oleg
On Fri, 27 Jan 2012, Daniel V?zquez wrote:


Hi guys!

Full text search, searches by lexemes, this minds that if you are finding
for gato word you are really finding for {gat} lexeme.
I you construct vectors for the words gato, gatos, gata, gatas, all
have the same lexema {gat}
Then the search gato that is to say the search {gat} matches with all
previous vectors.

There some way (configuration, query) to match only for gato and
avoid gatos gata gatas, with FTS ??
Or match only for gato gatos buy no for gata gatas?

Tnks!



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Full text search - How to build a filtering dictionary

2012-01-18 Thread Oleg Bartunov

Antonio,

you can see contrib/unaccent dictionary, which is a filtering 
dictionary. I have a page about it - http://mira.sai.msu.su/~megera/wiki/unaccent





Oleg
On Wed, 18 Jan 2012, Antonio Franzoso wrote:


Hi all,
I need to build a synonym dictionary that performs a normalization of 
tokens just like a filtering dictionary does. I've searched for a filtering 
dictionary template but I've found it. Where Can I find it? Or, if there 
isn't such a template, How can I build a simple filter dictionary that 
simply maps a term with another (in a synonym dict-like way)?


Thanks in advance,
Antonio




Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] text search synonym dictionary anomaly with numbers

2011-11-27 Thread Oleg Bartunov

Richard,

you should check your mapping - '1st' belongs to 'numword' and may be processed
in a different way than 'first' or '1'.

Oleg
On Sat, 26 Nov 2011, Richard Greenwood wrote:


I am working with street address data in which 'first st' has been
entered as '1 st' and so on. So I have created a text search
dictionary with entries:
first  1
1st  1
And initially it seems to be working properly:

SELECT ts_lexize('rwg_synonym','first');
ts_lexize
---
{1}


SELECT ts_lexize('rwg_synonym','1st');
ts_lexize
---
{1}

But my queries on '1st' are not returning the expected results:

SELECT count(*) FROM parcel_attrib WHERE txtsrch @@ to_tsquery('1');
count
---
  403  - this is what I want

SELECT count(*) FROM parcel_attrib WHERE txtsrch @@ to_tsquery('first');
count
---
  403  - this is also good

SELECT count(*) FROM parcel_attrib WHERE txtsrch @@ to_tsquery('1st');
count
---
4  - this is not good. There are 4 records that do have '1st',
but why am I not getting 403 records?

Thanks for reading,
Rich




Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Searching for bare letters

2011-10-02 Thread Oleg Bartunov
I don't see the problem - you can have a dictionary, which does all work on 
recognizing bare letters and output several versions. Have you seen unaccent

dictionary ?

Oleg
On Sun, 2 Oct 2011, Uwe Schroeder wrote:


Hi, everyone.  Uwe wrote:

What kind of client are the users using?  I assume you will have some
kind of user interface. For me this is a typical job for a user
interface. The number of letters with equivalents in different
languages are extremely limited, so a simple matching routine in the
user interface should give you a way to issue the proper query.


The user interface will be via a Web application.  But we need to store
the data with the European characters, such as ?, so that we can display
them appropriately.  So much as I like your suggestion, we need to do
the opposite of what you're saying -- namely, take a bare letter, and
then search for letters with accents and such on them.

I am beginning to think that storing two versions of each name, one bare
and the other not, might be the easiest way to go.   But hey, I'm open
to more suggestions.

Reuven



That still doesn't hinder you from using a matching algorithm. Here a simple
example (to my understanding of the problem)
You have texts stored in the db both containing a n and a ?. Now a client
enters n on the website. What you want to do is look for both variations, so
n translates into n or ?.
There you have it. In the routine that receives the request you have a
matching method that matches on n (or any of the few other characters with
equivalents) and the routine will issue a query with a xx like %n% or xx
like %?% (personally I would use ilike, since that eliminates the case
problem).

Since you're referring to a name, I sure don't know the specifics of the
problem or data layout, but by what I know I think you can tackle this with a
rather primitive match - translate to kind of algorithm.

One thing I'd not do: store duplicate versions. There's always a way to deal
with data the way it is. In my opinion storing different versions of the same
data just bloats a database in favor of a smarter way to deal with the initial
data.

Uwe






Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] ts_rank seems very slow (140 ranked documents / second on my machine)

2011-07-13 Thread Oleg Bartunov

On Tue, 12 Jul 2011, Nicolas Grilly wrote:


On Tue, Jul 12, 2011 at 22:25, Oleg Bartunov o...@sai.msu.su wrote:

I don't see your query uses index :)


Yes, I know. :)

I ran VACUUM ANALYZE and re-ran the query but the output of EXPLAIN
ANALYZE stays exactly the same: no index used.

Any idea why?


there is problem with estimating of cost scanning gin index in  9.1 versions,
so you can 
set enable_seqscan=off;

or try 9.1 which beta3 now.



By the way, does ts_rank is supposed to use a GIN index when it's available?


no, I see no benefit :)

Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] ts_rank seems very slow (140 ranked documents / second on my machine)

2011-07-13 Thread Oleg Bartunov

I didn't notice, reading 40K tuples in random order takes a long time and this
is a problem of any database. Can you measure time to read all documents found ?
 :( The only solution I see is to store enough
information for ranking in index.

Oleg
On Wed, 13 Jul 2011, Nicolas Grilly wrote:


Hi Oleg and all,

On Wed, Jul 13, 2011 at 08:16, Oleg Bartunov o...@sai.msu.su wrote:

there is problem with estimating of cost scanning gin index in  9.1
versions,
so you can set enable_seqscan=off;
or try 9.1 which beta3 now.


I re-ran my queries using enable seqscan=off.

Now the first query, without ts_rank, uses the GIN index:

set enable_seqscan=off;
explain analyze select id
from posts_100
where to_tsquery('english', 'crare') @@ document_vector
limit 50;

Limit  (cost=42290.12..42306.31 rows=50 width=4) (actual
time=16.259..16.412 rows=50 loops=1)
 -  Bitmap Heap Scan on posts_100  (cost=42290.12..57877.02
rows=48152 width=4) (actual time=16.256..16.344 rows=50 loops=1)
   Recheck Cond: ('''crare'''::tsquery @@ document_vector)
   -  Bitmap Index Scan on index_posts_documents_100
(cost=0.00..42278.08 rows=48152 width=0) (actual time=13.265..13.265
rows=49951 loops=1)
 Index Cond: ('''crare'''::tsquery @@ document_vector)
Total runtime: 16.484 ms

But the second query, the one that uses ts_rank, is still very slow...
Any idea why? Is ts_rank efficient enough to find the best 50 matches
among 50 000 documents?

set enable_seqscan=off;
explain analyze select id
from posts_100
where to_tsquery('english', 'crare') @@ document_vector
order by ts_rank_cd(document_vector, to_tsquery('english', 'crare'), 32) desc
limit 50;

Limit  (cost=59596.98..59597.10 rows=50 width=22) (actual
time=296212.052..296212.257 rows=50 loops=1)
 -  Sort  (cost=59596.98..59717.36 rows=48152 width=22) (actual
time=296186.928..296187.007 rows=50 loops=1)
   Sort Key: (ts_rank_cd(document_vector, '''crare'''::tsquery, 32))
   Sort Method:  top-N heapsort  Memory: 27kB
   -  Bitmap Heap Scan on posts_100
(cost=42290.12..57997.40 rows=48152 width=22) (actual
time=70.861..296059.515 rows=49951 loops=1)
 Recheck Cond: ('''crare'''::tsquery @@ document_vector)
 -  Bitmap Index Scan on index_posts_documents_100
(cost=0.00..42278.08 rows=48152 width=0) (actual time=24.922..24.922
rows=49951 loops=1)
   Index Cond: ('''crare'''::tsquery @@ document_vector)
Total runtime: 296220.493 ms


By the way, does ts_rank is supposed to use a GIN index when it's
available?


no, I see no benefit :)


Ok. But what is the solution to improve ts_rank execution time? Am I
doing something wrong?

Thanks for your help,

Nicolas



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] ts_rank seems very slow (140 ranked documents / second on my machine)

2011-07-12 Thread Oleg Bartunov

I don't see your query uses index :)

On Tue, 12 Jul 2011, Nicolas Grilly wrote:


Hello,

I'm testing PostgreSQL full-text search on a table containing
1.000.000 documents. Document average length is 5.700 chars.
Performance is good and very similar to what I can get with Xapian if
I don't use ts_rank. But response time collapses if I use ts_rank to
select the 50 best matching documents.

This is the table and index definition:

create table posts_100 (
 id serial primary key,
 document_vector tsvector
);
create index index_posts_documents_100 ON posts_100 USING
gin(document_vector);

This is the query without ts_rank (the word 'crare' matches 5 % of documents):

select id
from posts_100
where to_tsquery('english', 'crare') @@ document_vector
limit 50

Limit  (cost=0.00..27.93 rows=50 width=4) (actual time=0.303..12.559
rows=50 loops=1)
 Output: id
 -  Seq Scan on posts_100  (cost=0.00..27472.51 rows=49184
width=4) (actual time=0.299..12.451 rows=50 loops=1)
   Output: id
   Filter: ('''crare'''::tsquery @@ document_vector)
Total runtime: 12.642 ms

Now, this is the query using ts_rank:

select id
from posts_100
where to_tsquery('english', 'crare') @@ document_vector
order by ts_rank_cd(document_vector, to_tsquery('english', 'crare'), 32) desc
limit 50

Limit  (cost=29229.33..29229.45 rows=50 width=22) (actual
time=355516.233..355516.339 rows=50 loops=1)
 Output: id
 -  Sort  (cost=29229.33..29352.29 rows=49184 width=22) (actual
time=355516.230..355516.268 rows=50 loops=1)
   Output: id
   Sort Key: (ts_rank_cd(document_vector, '''crare'''::tsquery, 32))
   Sort Method:  top-N heapsort  Memory: 27kB
   -  Seq Scan on posts_100  (cost=0.00..27595.47 rows=49184
width=22) (actual time=0.251..355389.367 rows=49951 loops=1)
 Output: id
 Filter: ('''crare'''::tsquery @@ document_vector)
Total runtime: 355535.063 ms

The ranking is very slow: 140 ranked documents / second on my machine!

I'm afraid this is because ts_rank needs to read document_vector, and
because that column is stored in TOAST table, it triggers a random
access for each matching row. Am I correct? Is it the expected
behavior? Is there a way to reduce the execution time?

I use PostgreSQL 8.4 with shared_buffers = 256 MB, work_mem = 256 MB.

Thanks for your help and advice.




Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] simple update query too long

2011-05-13 Thread Oleg Bartunov

On Fri, 13 May 2011, F T wrote:


Thanks for your ideas.

I have rerun my tests and I agree with Merlin, PostgreSQL is not adapted at
all to handle wide updates.

Summary :
The table contains 2 millions rows.

Test 1 :
UPDATE grille SET inter=0; - It tooks 10 hours

Test 2 :
I remove the spatial Gist index, and the constraints : I just keep the
primary key.
UPDATE grille SET inter=0; - it tooks 6 hours.

This is better but it is still not acceptable.

And if I run CREATE TABLE test AS SELECT * FROM grille, it only takes 11
seconds, incredible...


I don't surprised, sequential read is a way faster than random.




Fabrice





2011/5/9 Merlin Moncure mmonc...@gmail.com


On Mon, May 9, 2011 at 10:29 AM,  t...@fuzzy.cz wrote:

On 05/09/2011 04:39 PM, F T wrote:

Hi list

I use PostgreSQL 8.4.4. (with Postgis 1.4)

I have a simple update query that takes hours to run.
The table is rather big (2 millions records) but it takes more than 5
hours
to run !!

The query is just :
*UPDATE grille SET inter = 0*




So any ideas why is it soo long???



You've got three indexes, so you have the update on the table *and* the
three indexes. Moreover, one of your indexes is a GiST with some PostGIS
geometry. It takes usuaully quite some (long) time to update such index.


That only holds if the index needs to be updated. He's updating a column
that is not indexed, so with a bit of luck the HOT might kick in. In that
case the table would not bloat, the indexes would not need to be updated
(and would no bloat) etc.

The question is whether HOT may work in this particular case.


HOT unfortunately does not provide a whole lot of benefit for this
case. HOT like brief, small transactions to the in page cleanup work
can be done as early as possible.  The nature of postgres is such that
you want to do everything you can to avoid table wide updates (up to
and including building a new table instead).

merlin





Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] full text search to_tsquery performance with ispell dictionary

2011-05-11 Thread Oleg Bartunov

On Wed, 11 May 2011, Stanislav Raskin wrote:





Yes, loading a large dictionary is known to be a fairly expensive
operation.  There's been discussions about how to make it cheaper, but
nothing's been done yet.

   regards, tom lane


Hi Tom,

thanks for the quick response. Bad news for me ;(
We develop ajax-driven web apps, which sort of rely on quick calls to data
services. Each call to a service opens a new connection. This makes the
search service, if using fts and ispell, about 100 times slower than a
dumb ILIKE-implementation.

Is there any way of hack or compromise to achieve good performance without
losing fts ability?
I am thinking, for example, of a way to permanently keep a loaded
dictionary in memory instead of loading it for every connection. As I
wrote in response to Pavel Stehule's post, connection pooling is not
really an option.
Our front-end is strictly PHP, so I was thinking about using a single
persistent connection
(http://de.php.net/manual/en/function.pg-pconnect.php) for all calls. Is
there some sort of major disadvantage in this approach from the database
point of view?

Kind regards

--

Stanislav Raskin






Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] GIN index not used

2011-04-28 Thread Oleg Bartunov

It should be better in 9.1
http://archives.postgresql.org/message-id/4c2ddc9b.1060...@sigaev.ru

Oleg
On Wed, 27 Apr 2011, Mark wrote:


I have problem with GIN index. Queries over it takes a lot of time. Some
informations:

I've got a table with tsvector- textvector:
CREATE TABLE mediawiki.pagecontent
(
 old_id integer NOT NULL DEFAULT
nextval('mediawiki.text_old_id_seq'::regclass),
 old_text text,
 old_flags text,
 textvector tsvector,
 CONSTRAINT pagecontent_pkey PRIMARY KEY (old_id)
)
The table has about 311 000 rows.
I've created GIN index over textvector:
CREATE INDEX gin_index2 ON mediawiki.pagecontent USING gin (textvector);

At first all was ok, but last days I find out that the queries take a lot of
time (10sec and more). When I start EXPLAIN ANALIZE of the query
SELECT old_id FROM mediawiki.pagecontent WHERE (textvector @@
(to_tsquery('den')))
the result is this:
Bitmap Heap Scan on pagecontent  (cost=8677.26..26663.01 rows=9037 width=4)
(actual time=8.258..8131.677 rows=9093 loops=1)
  Recheck Cond: (textvector @@ to_tsquery('den'::text))
  -  Bitmap Index Scan on gin_index2  (cost=0.00..8675.00 rows=9037
width=0) (actual time=6.002..6.002 rows=9093 loops=1)
Index Cond: (textvector @@ to_tsquery('den'::text))
Total runtime: 8150.949 ms

It seems that the GIN index was not used. At first I though, that there is
some mess in the db, cause I provided some experiments(delete and import
data and GIN index), so I used vacuum, afterwards I tried pg_dump and
pg_restore, but it did not help.
Could you please point me in the right direction, where could be the
problem?
Thanks a lot
Mark

--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/GIN-index-not-used-tp4344826p4344826.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.




Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] fulltext search and hunspell

2011-02-08 Thread Oleg Bartunov

Jens,

have you tried german compound dictionary from 
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/


Oleg
On Tue, 8 Feb 2011, Jens Sauer wrote:


Hey,

thanks for your answer.

First I checked the links in the tsearch_data directory
de_de.affix, and de_de.dict are symlinks to the corresponding files in
/var/cache/postgresql/dicts/
Then I recreated them by using pg_updatedicts.

This is an extract of the de_de.affix file:

# this is the affix file of the de_DE Hunspell dictionary
# derived from the igerman98 dictionary
#
# Version: 20091006 (build 20100127)
#
# Copyright (C) 1998-2009 Bjoern Jacke bjo...@j3e.de
#
# License: GPLv2, GPLv3 or OASIS distribution license agreement
# There should be a copy of both of this licenses included
# with every distribution of this dictionary. Modified
# versions using the GPL may only include the GPL

SET ISO8859-1
TRY esijanrtolcdugmphbyfvkwqxz??ESIJANRTOLCDUGMPHBYFVKWQXZ-.

PFX U Y 1
PFX U   0 un   .

PFX V Y 1
PFX V   0 ver  .

SFX F Y 35
[...]

I cannot find compoundwords controlled z there, so I manually added it.

[...]
# versions using the GPL may only include the GPL

compoundwords  controlled z

SET ISO8859-1
TRY esijanrtolcdugmphbyfvkwqxz??ESIJANRTOLCDUGMPHBYFVKWQXZ-.
[...]

Then I restarted PostgreSQL.

Now I get an error:
SELECT * FROM ts_debug('Schokoladenfabrik');
FEHLER:  falsches Affixdateiformat f?r Flag
CONTEXT:  Zeile 18 in Konfigurationsdatei
?/usr/share/postgresql/8.4/tsearch_data/de_de.affix?: ?PFX U Y 1
?
SQL-Funktion ?ts_debug? Anweisung 1
SQL-Funktion ?ts_debug? Anweisung 1

Which means:
ERROR: wrong Affixfileformat for flag
CONTEXT: Line 18 in Configuration ...

If I add
COMPOUNDFLAG Z
ONLYINCOMPOUND L

instead of compoundwords  controlled z

I didn't get an error:

SELECT * FROM ts_debug('Schokoladenfabrik');
  alias   |   description   |   token   |
dictionaries  | dictionary  |  lexemes
---+-+---+---+-+---
asciiword | Word, all ASCII | Schokoladenfabrik |
{german_hunspell,german_stem} | german_stem | {schokoladenfabr}
(1 row)

But it seems that the hunspell dictionary is not working for compound words.

Maybe pg_updatedicts has a bug and generates affix files in the wrong format?

Jens

2011/2/7 Oleg Bartunov o...@sai.msu.su:

Jens,

could you check affix file for
compoundwords  controlled z

also, can you provide link to dictionary files, so we can check if they
supported, since we have only rudiment support of hunspell.
btw,it'd be nice to have output from ts_debug() to make sure dictionaries
actually used.

Oleg




Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] fulltext search and hunspell

2011-02-07 Thread Oleg Bartunov

Jens,

could you check affix file for
compoundwords  controlled z

also, can you provide link to dictionary files, so we can check if they
supported, since we have only rudiment support of hunspell.
btw,it'd be nice to have output from ts_debug() to make sure dictionaries
actually used.

Oleg

On Mon, 7 Feb 2011, Jens Sauer wrote:


Hey,

I want to use hunspell as a dictionary for the full text search by

* using PostgresSQL 8.4.7
* installing hunspell-de-de, hunspell-de-med
* creating a dictionary:

CREATE TEXT SEARCH DICTIONARY german_hunspell (
   TEMPLATE = ispell,
   DictFile = de_de,
   AffFile = de_de,
   StopWords = german
);

* changing the config

ALTER TEXT SEARCH CONFIGURATION german
   ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
 word, hword, hword_part
   WITH german_hunspell, german_stem;

* now testing the lexizer:

SELECT ts_lexize('german_hunspell', 'Schokaladenfarik');
ts_lexize
---

(1 Zeile)

Shouldn't it be something like this:
SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
  {sjokoladefabrikk,sjokolade,fabrikk}
(from the 8.4 documentation of PostgreSQL)


The dict and affix files in the tsearch_data directory were
automatically generated by pg_updatedicts.

Is this a problem of the splitting compound word functionality? Should
I use ispell instead of hunspell?

Thanks




Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Full Text Index Scanning

2011-01-29 Thread Oleg Bartunov

What version of Pg you run ? Try latest version.

Oleg

On Sat, 29 Jan 2011, Matt Warner wrote:


Reverse isn't a built-in Postgres function, so I found one and installed it.
However, attempting to use it in creating an index gets me the message
ERROR:  functions in index expression must be marked IMMUTABLE, even
though the function declaration already has the immutable argument.

Is there a specific version of the reverse function you're using? Or am I
just missing something obvious? This is Postgres 9, BTW.

Thanks,

Matt

On Sat, Jan 29, 2011 at 6:46 AM, Matt Warner m...@warnertechnology.comwrote:


Thanks Oleg. I'm going to have to experiment with this so that I understand
it better.

Matt


On Fri, Jan 28, 2011 at 1:12 PM, Oleg Bartunov o...@sai.msu.su wrote:


Matt, I'd try to use prefix search on original string concatenated with
reverse string:

Just tried on some spare table

knn=# \d spot_toulouse
   Table public.spot_toulouse
  Column|   Type| Modifiers
-+---+---
 clean_name  | character varying |


1. create index knn=# create index clean_name_tlz_idx on spot_toulouse
using gin(to_tsvector('french', clean_name || ' ' || reverse(clean_name)));
2.
select clean_name from spot_toulouse where to_tsvector('french',
clean_name|| ' ' ||  reverse(clean_name) ) @@ to_tsquery('french','the:* |
et:*');

Select looks cumbersome, but you can always write wrapper functions. The
only drawback I see for now is that ranking function will a bit confused,
since coordinates of original and reversed words will be not the same, but
again, it's possible to obtain tsvector by custom function, which aware
about reversing.

Good luck and let me know if this help you.

Oleg


On Fri, 28 Jan 2011, Matt Warner wrote:

 I'm in the process of migrating a project from Oracle to Postgres and

have
run into a feature question. I know that Postgres has a full-text search
feature, but it does not allow scanning the index (as opposed to the
data).
Specifically, in Oracle you can do select * from table where
contains(colname,'%part_of_word%')1. While this isn't terribly
efficient,
it's much faster than full-scanning the raw data and is relatively quick.

It doesn't seem that Postgres works this way. Attempting to do this
returns
no rows: select * from table where to_tsvector(colname) @@
to_tsquery('%part_of_word%')

The reason I want to do this is that the partial word search does not
involve dictionary words (it's scanning names).

Is this something Postgres can do? Or is there a different way to do scan
the index?

TIA,

Matt



   Regards,
   Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83








Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Full Text Index Scanning

2011-01-28 Thread Oleg Bartunov

Matt, I'd try to use prefix search on original string concatenated with reverse 
string:

Just tried on some spare table

knn=# \d spot_toulouse
Table public.spot_toulouse
   Column|   Type| Modifiers 
-+---+---

 clean_name  | character varying |


1. create index 
knn=# create index clean_name_tlz_idx on spot_toulouse using gin(to_tsvector('french', clean_name || ' ' || reverse(clean_name)));

2.
select clean_name from spot_toulouse where to_tsvector('french', clean_name|| ' 
' ||  reverse(clean_name) ) @@ to_tsquery('french','the:* | et:*');

Select looks cumbersome, but you can always write wrapper functions. 
The only drawback I see for now is that ranking function will a bit confused,
since coordinates of original and reversed words will be not the same, 
but again, it's possible to obtain tsvector by custom function, which 
aware about reversing.


Good luck and let me know if this help you.

Oleg

On Fri, 28 Jan 2011, Matt Warner wrote:


I'm in the process of migrating a project from Oracle to Postgres and have
run into a feature question. I know that Postgres has a full-text search
feature, but it does not allow scanning the index (as opposed to the data).
Specifically, in Oracle you can do select * from table where
contains(colname,'%part_of_word%')1. While this isn't terribly efficient,
it's much faster than full-scanning the raw data and is relatively quick.

It doesn't seem that Postgres works this way. Attempting to do this returns
no rows: select * from table where to_tsvector(colname) @@
to_tsquery('%part_of_word%')

The reason I want to do this is that the partial word search does not
involve dictionary words (it's scanning names).

Is this something Postgres can do? Or is there a different way to do scan
the index?

TIA,

Matt



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] FTS phrase searches

2010-12-20 Thread Oleg Bartunov

On Sun, 19 Dec 2010, Glenn Maynard wrote:


2010/12/19 Oleg Bartunov o...@sai.msu.su:

You might be interested in http://www.sai.msu.su/~megera/wiki/2009-08-12


Thanks, that looks pretty much like what I had in mind.  Hopefully
that'll get merged for 9.0+1; phrases are a major part of all text
searches.


Several companies interested in phrase search, but actually we got no 
support for this, so we postpone it.



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] FTS phrase searches

2010-12-19 Thread Oleg Bartunov

You might be interested in http://www.sai.msu.su/~megera/wiki/2009-08-12

Oleg
On Sun, 19 Dec 2010, Glenn Maynard wrote:


I guess no response means it's not possible.  I ended up doing a
manual substring match for quoted strings, but that's a poor hack.
Maybe I'll take a poke at implementing something like
tsvector_contains_phrase; it seems like a natural extension of what's
in there now.


On Mon, Nov 1, 2010 at 4:35 PM, Glenn Maynard gl...@zewt.org wrote:

How are adjacent word searches handled with FTS?  tsquery doesn't do
this, so I assume this has to be done as a separate filter step, eg.:

 # large house sales
 SELECT * FROM data WHERE fts @@ to_tsquery('large  house  sales')
AND tsvector_contains_phrase(fts, to_tsvector('large house')));

to do an indexed search for large  house  sales and then to narrow
the results to where large house actually appears as a phrase (eg.
adjacent positions at the same weight).  I can't find any function to
do that, though.  (Presumably, it would return true if all of the
words in the second tsvector exist in the first, with the same
positions relative to each other.)

tsvector @ tsvector seems logical, but isn't supported.

This isn't as simple as using LIKE, since that'll ignore stemming,
tokenization rules, etc.  If the language rules allow this to match
larger house or large-house, then a phrase restriction should,
too.  It's also painful when the FTS column is an aggregate of several
other columns (eg. title and body), since a LIKE match needs to know
that and check all of them separately.

Any hints?  This is pretty important to even simpler search systems.





Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] FTS GIN Index Question

2010-09-29 Thread Oleg Bartunov

Christian,

On Wed, 29 Sep 2010, Christian Ramseyer wrote:


Hi List

I have a largish partitioned table, it has ~60 million records in each of 12 
partitions. It appears that a Full Text Index could speed up some user 
queries a lot.


A quick test with an additional tsvector column revealed that this would take 
up around 35 GB of space for this column and then maybe 5 more for the gin 
index on it. As this is a lot of space (~ 480 GB), I'm a bit tempted to use a 
gin index without the separate tsvector column. However, the doc says that 
this will be slower.


do you have problem with disk space ? Searching index is usually very fast
operation, only small part of index readed.  Did you checked time to read
index ?



Does anyone have an idea of how much slower we're talking here? The index 
defintion would be a concatenation of two setweights(), i.e.:


... using gin(
 (setweight(to_tsvector('config',coalesce(col1,'')), 'A') ||
  setweight(to_tsvector('config',coalesce(col2,'')), 'B')))

Also, general recommendations regarding full text search configurations of 
that size are very welcome.


Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Visualize GiST Index

2010-09-27 Thread Oleg Bartunov

Get gevel from cvs, address is on http://www.sai.msu.su/~megera/wiki/Gevel

btw, have you seen http://www.sai.msu.su/~megera/wiki/Rtree_Index ?

Oleg

On Mon, 27 Sep 2010, paolo wrote:



Hi all,
we are making some experiments with postgresql and postgis. We need to
visualize R-trees and are trying to use GiST and gevel. During the
installation phase of gevel we had the following output:


sed 's,MODULE_PATHNAME,$libdir/gevel,g' gevel.sql.in gevel.sql
gcc -no-cpp-precomp -O2 -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels -fno-strict-aliasing -fwrapv
-I. -I../../src/include   -c -o gevel.o gevel.c
gevel.c: In function ЪЪgist_dumptreeЪЪ:
gevel.c:99: warning: format ЪЪ%dЪЪ expects type ЪЪintЪЪ, but argument 10 has
type ЪЪSizeЪЪ
gevel.c: In function ЪЪgist_treeЪЪ:
gevel.c:134: error: too many arguments to function
ЪЪstringToQualifiedNameListЪЪ
gevel.c: In function ЪЪgist_statЪЪ:
gevel.c:225: error: too many arguments to function
ЪЪstringToQualifiedNameListЪЪ
gevel.c: In function ЪЪsetup_firstcallЪЪ:
gevel.c:325: error: too many arguments to function
ЪЪstringToQualifiedNameListЪЪ
make: *** [gevel.o] Error 1


We are running Postgresql 8.4.4, can you please help us?

Paolo  Waqas



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Visualize GiST Index

2010-09-26 Thread Oleg Bartunov
We never expected gevel will be used by users :-) 
We'll fix.


Oleg
On Thu, 23 Sep 2010, Tom Lane wrote:


Andrew Hunter ahun...@ucalgary.ca writes:

Here is the content of the gevel Makefile



subdir = contrib/gevel
top_builddir = ../..
include $(top_builddir)/src/Makefile.global



MODULES = gevel
DATA_built = gevel.sql
DOCS = README.gevel
REGRESS = gevel



include $(top_srcdir)/contrib/contrib-global.mk


Oh ... well, there's your problem: it's not even *trying* to use PGXS.

Replace it with this:


MODULES = gevel
DATA_built = gevel.sql
DOCS = README.gevel
REGRESS = gevel

ifdef USE_PGXS
PG_CONFIG = pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)
else
subdir = contrib/gevel
top_builddir = ../..
include $(top_builddir)/src/Makefile.global
include $(top_srcdir)/contrib/contrib-global.mk
endif


... and file a bug with the gevel authors, whoever they are,
telling them to fix that upstream.

regards, tom lane




Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] GiN indexes

2010-08-31 Thread Oleg Bartunov

On Tue, 31 Aug 2010, Stavroula Gasparotto wrote:


Currently, only the B-tree, GiST and GIN index types support
multicolumn indexes.



What does this mean exactly if I'm trying to create a multicolumn GIN
index? Does this mean the index can contain one or more tsvector type
fields only, or can I combine tsvector type field with other type fields
in the index, such as a timestamp column?


It does both.

Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Differences between to_tsquery() and plainto_tsquery() when it comes to applying an index?

2010-08-25 Thread Oleg Bartunov

Denis,

we need examples of your explain analyze. I don't want to waste my time
reading theoretical reasoning :)

btw, Be sure you use the same search configuration as in create index or
index will not be used at all.

Oleg

On Wed, 25 Aug 2010, Denis Papathanasiou wrote:


As a follow-up to my question from this past Saturday
(http://archives.postgresql.org/pgsql-general/2010-08/msg00624.php), I
experimented with adding two types of indices to the same text column:

CREATE INDEX item_eng_searchable_text_idx ON item USING
gin(to_tsvector('english', searchable_text));

and

CREATE INDEX item_searchable_text_idx ON item (searchable_text);

Running my queries under explain, I noticed that queries of the form:

select pk from item where searchable_text @@ plainto_tsquery('search
phrase');

Actually run *slower* with the item_eng_searchable_text_idx index applied.

But when I switch the query to use to_tsquery() instead, e.g. something
like this:

select pk from item where searchable_text @@ to_tsquery('phrase');

The performance is better.

Is this because the gin/to_tsvector() index works differently for
to_tsquery() compared to plainto_ts_query() ?

If so, how can I create an effective index for queries that will use
plainto_tsquery() ?

Note that I need the plainto_tsquery() form b/c my search phrases will
correspond to exact blocks of text, and therefore they will contain
spaces and punctuation, etc.





Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Differences between to_tsquery() and plainto_tsquery() when it comes to applying an index?

2010-08-25 Thread Oleg Bartunov

Try this select

alerts= explain analyze select item_pk from node where
 tag='primaryIssuer.entityType' and val @@ plainto_tsquery('english','Limited
 Partnership');

Read 12.2.2. Creating Indexes at 
http://www.postgresql.org/docs/8.4/static/textsearch-tables.html

CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector('english', body));

Notice that the 2-argument version of to_tsvector is used. Only text search 
functions that specify a configuration name can be used in expression indexes 
(Section 11.7). This is because the index contents must be unaffected by 
default_text_search_config. If they were affected, the index contents might be 
inconsistent because different entries could contain tsvectors that were 
created with different text search configurations, and there would be no way to 
guess which was which. It would be impossible to dump and restore such an index 
correctly.



Oleg

On Wed, 25 Aug 2010, Denis Papathanasiou wrote:




we need examples of your explain analyze. I don't want to waste my time
reading theoretical reasoning :)


Here's an actual 'explain analyze' example:

alerts= CREATE INDEX node_val_tsv_idx ON node USING 
gin(to_tsvector('english', val));

CREATE INDEX
alerts= explain analyze select item_pk from node where 
tag='primaryIssuer.entityType' and val @@ plainto_tsquery('Limited 
Partnership');
QUERY PLAN 
-
Bitmap Heap Scan on node  (cost=204.26..5792.92 rows=4 width=16) (actual 
time=2.952..131.868 rows=953 loops=1)

  Recheck Cond: (tag = 'primaryIssuer.entityType'::text)
  Filter: (val @@ plainto_tsquery('Limited Partnership'::text))
  -  Bitmap Index Scan on node_tag_idx  (cost=0.00..204.26 rows=3712 
width=0) (actual time=1.628..1.628 rows=3631 loops=1)

Index Cond: (tag = 'primaryIssuer.entityType'::text)
Total runtime: 133.345 ms
(6 rows)

alerts= DROP INDEX node_val_tsv_idx;
DROP INDEX
alerts= explain analyze select item_pk from node where 
tag='primaryIssuer.entityType' and val @@ plainto_tsquery('Limited 
Partnership');
QUERY PLAN 
-
Bitmap Heap Scan on node  (cost=204.26..5792.92 rows=4 width=16) (actual 
time=2.938..93.239 rows=953 loops=1)

  Recheck Cond: (tag = 'primaryIssuer.entityType'::text)
  Filter: (val @@ plainto_tsquery('Limited Partnership'::text))
  -  Bitmap Index Scan on node_tag_idx  (cost=0.00..204.26 rows=3712 
width=0) (actual time=1.614..1.614 rows=3631 loops=1)

Index Cond: (tag = 'primaryIssuer.entityType'::text)
Total runtime: 94.696 ms
(6 rows)

The table this is run against is defined like this:

CREATE TABLE node (
   pk uuid primary key,
   item_pk uuid not null references item (pk),
   tag text not null,
   val text
);

In addition to the gin/ts_vector index on node.val shown above, there are two 
other explicit indices on this table:


CREATE INDEX node_tag_idx ON node (tag);
CREATE INDEX node_val_idx ON node (val);

The reason for the node_val_idx index is that there will be cases where the 
query phrase is known exactly, so the where clause in the select statement 
will be just val = 'Limited Partnership'.



btw, Be sure you use the same search configuration as in create index or
index will not be used at all.


Is this indeed the problem here?

The explain output references val @@ plainto_tsquery() but as a filter, 
whereas the tag portion of the statement mentions node_tag_idx as the index 
it used.




Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] FTS wildcard and custom ispell dictionary problem

2010-08-19 Thread Oleg Bartunov

On Thu, 19 Aug 2010, darklow wrote:


Hello,

I am using PostgreSQL 8.4 full text search in following way:

Custom FTS configuration called dc2 with these dictionaries in
following order for asciihword token: latvian_ispell, english_stem,
russian_stem


russian_stem will never called ! Sorry, I have no time right now, will 
be back later.




Latvian ispell dictionary contains words with different endings but
same meaning (latvian langiage specifics, plural words, etc)
The problem starts when using wildcard :* to_tsquery syntax.

For example. If i look for the word kriev i am automatically adding
wildcard using syntax:  to_tsquery('dc2', 'kriev:*');

By searching kriev:* FTS founds word krievs in latvian_ispell
dictionary which is totally ok.

SELECT * from ts_debug('dc2', 'kriev:*');
 alias   |   description   | token | dictionaries
 |   dictionary   | lexemes
---+-+---+--++--
asciiword | Word, all ASCII | kriev |
{latvian_ispell,english_stem,russian_stem} | latvian_ispell | {krievs}
blank | Space symbols   | :*| {}

If understand correctly now database uses not kriev:* but krievs:* for
following queries.

And here is the problem, data contains also word: Krievija, and in
this case search doesn't find it, because now it looks for Krievs:*
and not Kriev:* anymore.

Is there any solution anone could suggest to get results by both
criterias - kriev:* (starting query) and krievs:* (founded in ispell
dict).

Only idea i had is to somehow combine two tsqueries one -
to_tsquery('dc2', 'kriev:*') and to_tsquery('english', 'kriev:*'); so
the search looks for both - kriev:* and krievs:* but anyway didnt
figured out any syntax i could use :(

Thanks




Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] FTS wildcard and custom ispell dictionary problem

2010-08-19 Thread Oleg Bartunov

On Thu, 19 Aug 2010, darklow wrote:


Hello,

I am using PostgreSQL 8.4 full text search in following way:

Custom FTS configuration called dc2 with these dictionaries in
following order for asciihword token: latvian_ispell, english_stem,
russian_stem

Latvian ispell dictionary contains words with different endings but
same meaning (latvian langiage specifics, plural words, etc)
The problem starts when using wildcard :* to_tsquery syntax.

For example. If i look for the word kriev i am automatically adding
wildcard using syntax:  to_tsquery('dc2', 'kriev:*');

By searching kriev:* FTS founds word krievs in latvian_ispell
dictionary which is totally ok.

SELECT * from ts_debug('dc2', 'kriev:*');
 alias   |   description   | token | dictionaries
 |   dictionary   | lexemes
---+-+---+--++--
asciiword | Word, all ASCII | kriev |
{latvian_ispell,english_stem,russian_stem} | latvian_ispell | {krievs}
blank | Space symbols   | :*| {}

If understand correctly now database uses not kriev:* but krievs:* for
following queries.

And here is the problem, data contains also word: Krievija, and in
this case search doesn't find it, because now it looks for Krievs:*
and not Kriev:* anymore.

Is there any solution anone could suggest to get results by both
criterias - kriev:* (starting query) and krievs:* (founded in ispell
dict).

Only idea i had is to somehow combine two tsqueries one -
to_tsquery('dc2', 'kriev:*') and to_tsquery('english', 'kriev:*'); so
the search looks for both - kriev:* and krievs:* but anyway didnt
figured out any syntax i could use :(


select to_tsquery('dc2', 'kriev:*') || to_tsquery('english', 'kriev:*');



Thanks




Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Need help with full text index configuration

2010-07-29 Thread Oleg Bartunov

Brian,

you have two options:
1. Use your own parser (just modify default)
2. Use replace function, like
postgres=# select to_tsvector( replace('qw/er/ty','/',' '));
 to_tsvector 
--

 'er':2 'qw':1 'ty':3
(1 row)


Oleg


On Wed, 28 Jul 2010, Brian Hirt wrote:


I have some data that can be searched, and it looks like the parser is making 
some assumptions about the data that aren't true in our case and I'm trying to 
figure out how to exclude a token type.   I haven't been able to find the 
answer to my question so far, so I thought I would ask here.

The data I have are english words, and sometimes there are words separated by a 
/ without spaces.   The parser finds these things and tokenizes them as files.  
 I'm sure in some situations that's the right assumption, but based on my data, 
I know there will never be a file name in the column.

For example instead of the parser recognizing three asciiword it recognizes one 
asciiword and one file.   I'd like a way to have the / just get parsed as blank.

db=# select * from ts_debug('english','maybe five/six');
  alias   |description|  token   |  dictionaries  |  dictionary  |  
lexemes
---+---+--++--+
asciiword | Word, all ASCII   | maybe| {english_stem} | english_stem | 
{mayb}
blank | Space symbols |  | {} |  |
file  | File or path name | five/six | {simple}   | simple   | 
{five/six}
(3 rows)

I thought that maybe I could create a new configuration and drop the file 
mapping, but that doesn't seem to work either.

db=# CREATE TEXT SEARCH CONFIGURATION public.testd ( COPY = pg_catalog.english 
);
CREATE TEXT SEARCH CONFIGURATION
db=# ALTER TEXT SEARCH CONFIGURATION testd DROP MAPPING FOR file;
ALTER TEXT SEARCH CONFIGURATION
db=# SELECT * FROM ts_debug('testd','mabye five/six');
  alias   |description|  token   |  dictionaries  |  dictionary  | 
lexemes
---+---+--++--+-
asciiword | Word, all ASCII   | mabye| {english_stem} | english_stem | 
{maby}
blank | Space symbols |  | {} |  |
file  | File or path name | five/six | {} |  |
(3 rows)


Is there anyway to do this?

Thanks for the help in advance.  I'm running 8.4.4



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Incorrect FTS result with GIN index

2010-07-28 Thread Oleg Bartunov

Tom,

you can download dump http://mira.sai.msu.su/~megera/tmp/search_tab.dump

Oleg
On Tue, 27 Jul 2010, Tom Lane wrote:


Oleg Bartunov o...@sai.msu.su writes:

I recommend post your problem to -hackers mailing list. I have no idea,
what could be a problem.


I wonder whether the problem is not windows versus non windows but
original database versus copies.  If it is a GIN bug it seems quite
possible that it would depend on the order of insertion of the index
entries, which a simple dump-and-reload probably wouldn't duplicate.

If you were working from a dump it'd be easy to try creating the index
before populating the table to see if the bug can be reproduced then,
but there's no certainty that would provoke the bug.

The rest of us have not seen the dump data, so we have no hope of
doing anything with this report anyway.

regards, tom lane




Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Incorrect FTS result with GIN index

2010-07-24 Thread Oleg Bartunov

Artur,

you could get much more problems in future. Full text search problem may be
signature of more general problem with your postgres setup. So, I'd recommend
to find a source of the problem


Oleg
On Tue, 20 Jul 2010, Artur Dabrowski wrote:



Oleg,

thanks for your help.

I sent a post to pg-hackers list:
http://old.nabble.com/Query-results-differ-depending-on-operating-system-%28using-GIN%29-ts29213082.html

As to compiling pg... I will no do this since I do not really feel
comfortable doing it and cannot dedicate too much time to this problem.

Artur



Oleg Bartunov wrote:


Artur,

I recommend post your problem to -hackers mailing list. I have no idea,
what could be a problem.

My machine is:
uname -a
Linux mira 2.6.33-020633-generic #020633 SMP Thu Feb 25 10:10:03 UTC 2010
x86_64 GNU/Linux

PostgreSQL 8.4.4 on x86_64-unknown-linux-gnu, compiled by GCC gcc (Ubuntu
4.4.1-4ubuntu9) 4.4.1, 64-bit

As a last resort I recommend you to compile pg yourself and see if the
problem exists.

Oleg


On Tue, 20 Jul 2010, Artur Dabrowski wrote:



I tested the same backup on our CentOS 5.4 virtual machine (running on
xen
server) and the results are really weird (118 rows, comparing to 116 on
win
xp and 123 expected):

Aggregate  (cost=104.00..104.01 rows=1 width=0) (actual
time=120.373..120.374 rows=1 loops=1)
 -  Bitmap Heap Scan on search_tab  (cost=5.35..103.93 rows=25 width=0)
(actual time=59.418..120.137 rows=118 loops=1)
   Recheck Cond: ((to_tsvector('german'::regconfig, keywords) @@
'''ee'':*'::tsquery) AND (to_tsvector('german'::regconfig, keywords) @@
'''dd'':*'::tsquery))
   -  Bitmap Index Scan on idx_keywords_ger  (cost=0.00..5.34
rows=25
width=0) (actual time=59.229..59.229 rows=495 loops=1)
 Index Cond: ((to_tsvector('german'::regconfig, keywords) @@
'''ee'':*'::tsquery) AND (to_tsvector('german'::regconfig, keywords) @@
'''dd'':*'::tsquery))
Total runtime: 120.670 ms

And here are the configuration details:

PostgreSQL:
postgresql84-server-8.4.4-1.el5_5.1

# uname -r
2.6.18-164.15.1.el5xen

# cat /etc/redhat-release
CentOS release 5.4 (Final)

# cat /proc/cpuinfo
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 15
model name  : Intel(R) Xeon(R) CPU5140  @ 2.33GHz
stepping: 6
cpu MHz : 2333.416
cache size  : 4096 KB
physical id : 0
siblings: 1
core id : 0
cpu cores   : 1
fpu : yes
fpu_exception   : yes
cpuid level : 10
wp  : yes
flags   : fpu de tsc msr pae cx8 apic sep cmov pat clflush acpi
mmx
fxsr sse sse2 ss ht syscall lm constant_tsc pni cx16 lahf_lm
bogomips: 5835.83
clflush size: 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:







Oleg Bartunov wrote:


Artur,

I don't know, but could you try linux machine ?

Oleg






Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general







Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Clarification of the simple dictionary

2010-07-22 Thread Oleg Bartunov

Don't guess, but read docs
http://www.postgresql.org/docs/8.4/interactive/textsearch-dictionaries.html#TEXTSEARCH-SIMPLE-DICTIONARY

12.6.2. Simple Dictionary

The simple dictionary template operates by converting the input token to lower 
case and checking it against a file of stop words. If it is found in the file 
then an empty array is returned, causing the token to be discarded. If not, the 
lower-cased form of the word is returned as the normalized lexeme. 
Alternatively, the dictionary can be configured to report non-stop-words as 
unrecognized, allowing them to be passed on to the next dictionary in the list.

d=# \dFd+ simple
  List of text search dictionaries
   Schema   |  Name  | Template  | Init options |Description 
++---+--+---

 pg_catalog | simple | pg_catalog.simple |  | simple dictionary: 
just lower case and check for stopword

By default it has no Init options, so it doesn't check for stopwords.

On Thu, 22 Jul 2010, Andreas Joseph Krogh wrote:


On 07/22/2010 06:27 PM, John Gage wrote:
The easiest way to look at this is to give the simple dictionary a document 
with to_tsvector() and see if stopwords pop out.


In my experience they do.  In my experience, the simple dictionary just 
breaks the document down into the space etc. separated words in the 
document.  It doesn't analyze further.


That's my experience too, I just want to make sure it doesn't actually have 
any stopwords which I've missed. Trying many phrases and checking for 
stopwords isn't really proving anything.


Can anybody confirm the simple dict. only lowercases the words and 
uniques them?





Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Clarification of the simple dictionary

2010-07-22 Thread Oleg Bartunov

Andreas,

I'd create myself copy of dictionary to be independent on system changes.

Oleg
On Thu, 22 Jul 2010, Andreas Joseph Krogh wrote:


On 07/22/2010 07:44 PM, Oleg Bartunov wrote:

Don't guess, but read docs
http://www.postgresql.org/docs/8.4/interactive/textsearch-dictionaries.html#TEXTSEARCH-SIMPLE-DICTIONARY 


12.6.2. Simple Dictionary

The simple dictionary template operates by converting the input token to 
lower case and checking it against a file of stop words. If it is found in 
the file then an empty array is returned, causing the token to be 
discarded. If not, the lower-cased form of the word is returned as the 
normalized lexeme. Alternatively, the dictionary can be configured to 
report non-stop-words as unrecognized, allowing them to be passed on to the 
next dictionary in the list.


d=# \dFd+ simple
  List of text search dictionaries
   Schema   |  Name  | Template  | Init options | 
Description 
++---+--+---
 pg_catalog | simple | pg_catalog.simple |  | simple 
dictionary: just lower case and check for stopword


By default it has no Init options, so it doesn't check for stopwords.


Guess what - I *have* read the docs which sais ...and checking it against a 
file of stop words. What was unclear to me was whether or not it was 
configured with a stopwords-file or not as default, which is not the case I 
understand from your reply. Very good, fits my needs like a glove:-) It might 
be worth considering updating the docs to make this clearer?


So - can we rely on simple to remain this way forever (no Init options) or 
is it better to make a copy of it with the same properties as today?


It seems simple + the unaccent dict. available in 9.0 saves my day, thanks 
Mr. Bartunov.





Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


  1   2   3   4   5   >