Re: [GENERAL] DISTINCT vs GROUP BY - was Re: is (not) distinct from

2017-03-03 Thread George Neuner

Hi David,

On Sat, 4 Mar 2017 02:32:48 +1300, David Rowley
 wrote:

>On 3 March 2017 at 18:26, George Neuner  wrote:
>> I know most people here don't pay much - or any - attention to
>> SQLServer, however there was an interesting article recently regarding
>> significant performance differences between DISTINCT and GROUP BY as
>> used to remove duplicates.
>>
>> https://sqlperformance.com/2017/01/t-sql-queries/surprises-assumptions-group-by-distinct
>>
>>
>> Now I'm wondering if something similar might be lurking in Postgresql?
>
>Yes things lurk there in PostgreSQL too. But to be honest I find the
>examples in the URL you included a bit strange. There's almost
>certainly going to be a table called "orders" that you'd use for the
>outer part of the query. In that case the orderid would already be
>unique.  To do the same in PostgreSQL you'd just use: select orderid,
>string_agg(description,'|') from orderitems group by orderid; assuming
>all orders had at least one line, you'd get the same result.


The author mentioned at the beginning that the simple queries:

  SELECT DISTINCT Description FROM Sales.OrderLines
  SELECT Description FROM Sales.OrderLines GROUP BY Description;

wouldn't display the subject behavior.

Of course, analyzing the much more complex queries is much more
difficult.  It begs the question: what actually is going on there?


But I don't use SQL Server ... my interest is in how Postgresql deals
with a similar situation.


>Assuming that parts.partcode is the PRIMARY KEY of parts, this query
>is legal in PostgreSQL. In some other databases, and I believe SQL
>Server might be one of them, you would have been forced to include
>part.description in the GROUP BY clause. Since PostgreSQL 9.6, if
>you'd have done the same with that, internally the database would
>ignore the parts.description in the GROUP BY clause, as its smart
>enough to know that including parts.description in the clause is not
>going to change anything as the description is always the same for
>each parts.partcode, and no two records can share the same partcode.
>
>There's no such optimisation when it comes to DISTINCT. In PostgreSQL
>as of today DISTINCT is a bit naive, and will just uniquify the
>results on each column in the select clause.  Although quite possibly
>the same optimisation could apply to DISTINCT too, just nobody has
>thought to add it yet.
>
>In short, the main difference is going to be the fewer columns you're
>using to identify the groups the better. If you included all of the
>columns in the GROUP BY clause as you put in the select list with the
>DISTINCT query then in most cases the performance would be the same. I
>believe the only exception to this is in regards to parallel query, as
>currently only GROUP BYs may be parallelised, not DISTINCT.
>
>Historically with older now unsupported versions of PostgreSQL (pre
>8.4) you may have also preferred to use GROUP BY over DISTINCT as
>GROUP BY could be implemented internally by sorting or hashing the
>results, whereas DISTINCT used to only be implemented by Sorting the
>results. Although this has long since been the case.


I often have occasion to use multiple mapping relations: e.g., 
  A{1}->B{N}
  C{1}->B{N}
together in a query where C is provided and I need to find the
corresponding A(s).  Frequently these queries result in the same A
being found multiple times.

Although the mapping tuples are small [usually just a pair of keys],
the number of rows in the mapping tables may be very large, and a
given query may need to join/work its way through several such
mappings.

Typically in such situations, I divide the query using CTEs and (try
to) minimize the volume of data at each step by filtering duplicates
from any results that might include them.

I have always used DISTINCT to filter duplication, reserving GROUP BY
for aggregations (counting, etc.).  But if I understand correctly, you
are saying that GROUP BY should be preferred even for the simpler use.


George



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Querying JSON Lists

2017-03-03 Thread Adrian Klaver

On 03/03/2017 11:50 AM, Sven R. Kunze wrote:

On 03.03.2017 16:05, Adrian Klaver wrote:

https://www.postgresql.org/docs/9.6/static/functions-json.html

As to why it works on JSON arrays:

Table 9-43. Additional jsonb Operators
"
? text Does the string exist as a top-level key within the
JSON value?
"

So to be picky it not does call out JSON object it says JSON value.
And right above the table:

" For a full description of jsonb containment and existence semantics,
see Section 8.14.3. Section 8.14.4 describes how these operators can
be used to effectively index jsonb data."

As to how that behavior was decided on I have no idea, it just is.


I think it would even be possible to add the integer-variant of the ?
operator.

Something I learned right now: integers cannot be object keys in json.
On the flip side, they can be array elements. So, I can see a certain
logic because of a uncertainty of integers.


Python differs here from PostgreSQL:


json.dumps({4: '34'})

'{"4": "34"}'


# select '{4:4}'::jsonb;

ERROR:  invalid input syntax for type json
LINE 1: select '{4:4}'::jsonb;
   ^
DETAIL:  Expected string or "}", but found "4".
CONTEXT:  JSON data, line 1: {4...


Python wraps it up, PostgreSQL fails loudly. Not that PostgreSQL is


With the caveat:

https://docs.python.org/3/library/json.html#py-to-json-table
"
Note

Keys in key/value pairs of JSON are always of the type str. When a 
dictionary is converted into JSON, all the keys of the dictionary are 
coerced to strings. As a result of this, if a dictionary is converted 
into JSON and then back into a dictionary, the dictionary may not equal 
the original one. That is, loads(dumps(x)) != x if x has non-string keys.

"

I know because it's bit me.

I use Python and I get a lot done with it, but it has its 
inconsistencies also:


In [11]: d = {1: 'one', 2: 'two'}

In [12]: dict(**d)
---
TypeError Traceback (most recent call last)
 in ()
> 1 dict(**d)

TypeError: keyword arguments must be strings

In [13]: d = {'1': 'one', '2': 'two'}

In [14]: dict(**d)
Out[14]: {'1': 'one', '2': 'two'}





confused by Python, but it's weird when ? operator works on both keys
and arrays with text but works only on arrays with integers. So, I guess
no support for integers for now might have seen like a good idea.


Though there looks to be some implicit casting going on:

test=> select '["12","34","45"]'::jsonb @> '"12"'::text;

ERROR:  operator does not exist: jsonb @> text

LINE 1: select '["12","34","45"]'::jsonb @> '"12"'::text;

to get '"12"' to be '"12"'::jsonb.

As to why, I don't know.


This makes sense to me at least, as we test structural json containment.
So, testing json to be contained by other json requires it to be json. :)

The confusing fact is that one can omit the array brackets in case of a
single primitive value. Don't get me wrong. I don't complain as it's a
usability feature. However I didn't expect it to be there in the first
place and adding some brackets wouldn't hurt IMO. I'd rather consider
brackets a readability support such as "this is json".


It can be there if you want to maintain readability in your code:

test=> select '["12","34","45"]'::jsonb @> '["12"]';
 ?column?
--
 t

or you can use an explicit cast:

test=> select '["12","34","45"]'::jsonb @> '"12"'::jsonb;
 ?column?
--
 t





Sven



--
Adrian Klaver
adrian.kla...@aklaver.com


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Querying JSON Lists

2017-03-03 Thread Sven R. Kunze

On 03.03.2017 16:05, Adrian Klaver wrote:

https://www.postgresql.org/docs/9.6/static/functions-json.html

As to why it works on JSON arrays:

Table 9-43. Additional jsonb Operators
"
? text Does the string exist as a top-level key within the 
JSON value?

"

So to be picky it not does call out JSON object it says JSON value. 
And right above the table:


" For a full description of jsonb containment and existence semantics, 
see Section 8.14.3. Section 8.14.4 describes how these operators can 
be used to effectively index jsonb data."


As to how that behavior was decided on I have no idea, it just is.


I think it would even be possible to add the integer-variant of the ? 
operator.


Something I learned right now: integers cannot be object keys in json. 
On the flip side, they can be array elements. So, I can see a certain 
logic because of a uncertainty of integers.



Python differs here from PostgreSQL:

>>> json.dumps({4: '34'})
'{"4": "34"}'

# select '{4:4}'::jsonb;
ERROR:  invalid input syntax for type json
LINE 1: select '{4:4}'::jsonb;
   ^
DETAIL:  Expected string or "}", but found "4".
CONTEXT:  JSON data, line 1: {4...


Python wraps it up, PostgreSQL fails loudly. Not that PostgreSQL is 
confused by Python, but it's weird when ? operator works on both keys 
and arrays with text but works only on arrays with integers. So, I guess 
no support for integers for now might have seen like a good idea.



Though there looks to be some implicit casting going on:

test=> select '["12","34","45"]'::jsonb @> '"12"'::text;

ERROR:  operator does not exist: jsonb @> text

LINE 1: select '["12","34","45"]'::jsonb @> '"12"'::text;

to get '"12"' to be '"12"'::jsonb.

As to why, I don't know.


This makes sense to me at least, as we test structural json containment. 
So, testing json to be contained by other json requires it to be json. :)


The confusing fact is that one can omit the array brackets in case of a 
single primitive value. Don't get me wrong. I don't complain as it's a 
usability feature. However I didn't expect it to be there in the first 
place and adding some brackets wouldn't hurt IMO. I'd rather consider 
brackets a readability support such as "this is json".



Sven


Re: [GENERAL] odd optimizer result, index condition "is not null" on column defined as "not null"

2017-03-03 Thread Martin F

On 03/03/2017 17:33, Tom Lane wrote:

Martin F  writes:

The select with filter choose an IMHO better plan

Index Only Scan using tbl_foo_date on public.tbl_foo

But the bigger picture here, which would become more obvious if you were
working with a non-toy amount of data, is that you're asking the planner
to choose between two bad options.

I agree "better" was the wrong term. "different"

And yes they are both bad.

And in fact after some more research, I think I found 
https://wiki.postgresql.org/wiki/Index-only_scans#What_types_of_queries_may_be_satisfied_by_an_index-only_scan.3F
which explains why the aggregate-with-filter is potentially much worse 
(as it accesses more rows)


Lets see, if I am closer to a correct understanding.
Lets see if my following assumptions are (closer to being) correct:

So the real missing feature here is the opposite of what I expected.
select min(id) filter(where  created_at >= '2017-01-15') from tbl_foo
is NOT rewritten to
   select id from tbl_foo where created_at >= '2017-01-15' and id is 
not null order by id limit 1

That is the filter is currently not transformed to a where.


On the other hand, looking at the explain of
 explain analyze verbose select min(id) filter(where  created_at >= 
'2017-01-15') from tbl_foo;

QUERY PLAN
---
 Aggregate  (cost=13.28..13.29 rows=1 width=16) (actual 
time=0.799..0.804 rows=1 loops=1)
   Output: min(id) FILTER (WHERE (created_at >= '2017-01-15 
00:00:00'::timestamp without time zone))
   ->  Index Only Scan using tbl_foo_id_date on public.tbl_foo 
(cost=0.14..13.00 rows=57 width=16) (actual time=0.024..0.437 rows=57 
loops=1)

 Output: created_at, id
 Heap Fetches: 57
 Planning time: 0.080 ms
 Execution time: 0.901 ms

57 heap fetches, so one for every row.

It seems that min() does a heap fetch for every row, even if the value 
for that row is bigger than the current aggregated value. That is the 
heap fetch happens, even if the value's visibility does not matter / the 
value will be discarded anyway. (Of course that is because the function 
can not affect the scanners decision if a row is required or not)


Are my above observation and conclusions correct, or am I missing 
something crucially (apart from that I am only looking at a tiny 
fraction of reality)




If you are concerned about the performance of this specific query shape,
what you actually want is an index on (id, created_at).  That allows
stopping at the first index entry satisfying the created_at condition,
knowing that it must have the min id value that does so.

Thanks indeed, taking in account the true nature of "index only", the 
above is a good idea.


regards
Martin


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] odd optimizer result, index condition "is not null" on column defined as "not null"

2017-03-03 Thread Martin F

On 03/03/2017 17:33, Tom Lane wrote:

Martin F  writes:

Index Cond: (tbl_foo.id IS NOT NULL)
only "id" is the pk, and declared "not null".
So why this index condition?

You're right that we could observe that the NOT NULL is implied by a table
constraint and drop it, but it seems unlikely to be worth the planner
cycles to do so; the condition doesn't cost much at runtime.  (Also,
plans that depend on table constraints for correctness have added
bookkeeping costs from tracking such dependency.)

Thanks for the explanation.

I looked at more example and yet found another case.
The planer can choose an index, where the index has the same condition 
as (part of) the query conditions.


I added the 2 indexes
create index tbl_foo_ID_1 on tbl_foo using btree (id) where (id <>1);
create index tbl_foo_ID_null on tbl_foo using btree (id) where (id is 
not null);


and used the 2 queries (already transformed, so the first does not have 
the "not null")
explain analyze verbose select id from tbl_foo where created_at >= 
'2017-01-15' and id <>1 order by id limit 1;
explain analyze verbose select id from tbl_foo where created_at >= 
'2017-01-15' and id is not null order by id limit 1;


both choose the index with the matching condition
... Index Scan using tbl_foo_id_1
... Index Scan using tbl_foo_id_null

The "<> 1" condition does not appear in the plan (as it is covered by 
the chosen index)

But the "is not null condition is kept, why?

Yes I understand, it makes probably little difference in the end, but I 
think it is somewhat curious.


This also happens, if I change id to:
  id   bigserial
(that is make in null-able / yet the index remains filtered to none null 
only)





explain analyze verbose select id from tbl_foo where created_at >= 
'2017-01-15' and id <>1 order by id limit 1;

 QUERY PLAN
-
 Limit  (cost=0.14..0.45 rows=1 width=8) (actual time=0.039..0.044 
rows=1 loops=1)

   Output: id
   ->  Index Scan using tbl_foo_id_1 on public.tbl_foo 
(cost=0.14..13.26 rows=42 width=8) (actual time=0.026..0.026 rows=1 loops=1)

 Output: id
 Filter: (tbl_foo.created_at >= '2017-01-15 
00:00:00'::timestamp without time zone)

 Rows Removed by Filter: 13
 Planning time: 0.162 ms
 Execution time: 0.087 ms
(8 rows)

explain analyze verbose select id from tbl_foo where created_at >= 
'2017-01-15' and id is not null order by id limit 1;

   QUERY PLAN

 Limit  (cost=0.14..0.45 rows=1 width=8) (actual time=0.042..0.047 
rows=1 loops=1)

   Output: id
   ->  Index Scan using tbl_foo_id_null on public.tbl_foo 
(cost=0.14..13.28 rows=43 width=8) (actual time=0.029..0.029 rows=1 loops=1)

 Output: id
 Index Cond: (tbl_foo.id IS NOT NULL)
 Filter: (tbl_foo.created_at >= '2017-01-15 
00:00:00'::timestamp without time zone)

 Rows Removed by Filter: 14
 Planning time: 0.129 ms
 Execution time: 0.096 ms







--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Understanding pg_last_xlog_receive_location

2017-03-03 Thread Zach Walton
Thanks Michael- That was indeed the issue. We have a very complex wrapper
application that walks the server through multiple state transitions, and
it turned out that in the state I was running the query from, streaming
replication wasn't configured.
On Wed, Mar 1, 2017 at 4:36 PM Michael Paquier 
wrote:

> On Thu, Mar 2, 2017 at 5:53 AM, Zach Walton  wrote:
> > I was able to test 9.4.11 and am seeing the same behavior:
> >
> > postgres=# SELECT pg_is_in_recovery(), pg_last_xlog_receive_location(),
> > pg_last_xlog_replay_location();
> >  pg_is_in_recovery | pg_last_xlog_receive_location |
> > pg_last_xlog_replay_location
> >
> ---+---+--
> >  t |   | 0/3000198
>
> Okay, you said that you are using here streaming replication, but the
> standby you are performing this query on seems just to be a hot
> standby recovering WAL from a WAL archive, not via streaming. I would
> bet that there is no WAL receiver running.
> pg_last_xlog_receive_location() get the last WAL position received
> from a streaming node, something that is set to NULL if there is no
> streaming happening, while pg_last_xlog_replay_location() is set by
> the startup process when replaying WAL records.
>
> Again I see no bugs here, you should check if a WAL receiver is
> running on this standby server.
> --
> Michael
>


Re: [GENERAL] odd optimizer result, index condition "is not null" on column defined as "not null"

2017-03-03 Thread Martin Frb

Hi,
following up my own post:

I noted that I included the "set enable_seqscan=off; ". But the results 
I mentioned are from before this statement.


I also compared some more statements

explain analyze verbose select min(id) from tbl_foo where created_at >= 
'2017-01-15'  ;
explain analyze verbose select id  from tbl_foo where  created_at >= 
'2017-01-15'  order by id limit 1;

Those 2 are the same, but the 2nd skips the "not null" index condition.


explain analyze verbose select min(id) filter(where  created_at >= 
'2017-01-15') from tbl_foo;
explain analyze verbose select min(id) filter(where  created_at >= 
'2017-01-15') from tbl_foo where  created_at >= '2017-01-15';


They should also be considered the same, as the "where" only removes 
rows, that are skipped by the "filter" anyway.

It seems the filter changes the plan to the other index.
But adding the where reduces the amount of "rows" that is scanned on 
this index.


This is nothing todo with the original question of the "is not null" 
condition on the "not null" field.
But it seems that, if "created_at" is only in the where part, the 
optimizer does not consider using "created_at" from the index (and doing 
an index only scan).
If "created_at" is in the select part, then the optimizer considers the 
"index only scan". (and even uses it for the "where" part)


To check this I tried
explain analyze verbose select min(created_at), min(id) filter(where  
created_at >= '2017-01-15') from tbl_foo;

and it gives an index only as well.

Out of interest, anyone with 9.6.2, does it yield the same results?



On 03/03/2017 16:41, Martin F wrote:


CREATE TABLE if not exists tbl_foo(
  id  bigserial NOT NULL primary key,
  created_at  timestamp without time zone NOT NULL
);
create index tbl_foo_date on tbl_foo using btree (created_at,id);

insert into tbl_foo (created_at) values ('2017-01-01'), 
('2017-01-02'), ('2017-01-03'), ('2017-01-04'), ('2017-01-05'), 
('2017-01-06'), ('2017-01-07'), ('2017-01-08'), ('2017-01-09'), 
('2017-01-10'), ('2017-01-11'), ('2017-01-12'), ('2017-01-13'), 
('2017-01-14'), ('2017-01-15'), ('2017-01-16'), ('2017-01-17'), 
('2017-01-18'), ('2017-01-19'), ('2017-01-20'), ('2017-01-21'), 
('2017-01-22'), ('2017-01-23'), ('2017-01-24'), ('2017-01-25'), 
('2017-01-26'), ('2017-01-27'), ('2017-01-28'), ('2017-01-29'), 
('2017-02-02'), ('2017-02-02'), ('2017-02-03'), ('2017-02-04'), 
('2017-02-05'), ('2017-02-06'), ('2017-02-07'), ('2017-02-08'), 
('2017-02-09'), ('2017-02-10'), ('2017-02-11'), ('2017-02-12'), 
('2017-02-13'), ('2017-02-14'), ('2017-02-15'), ('2017-02-16'), 
('2017-02-17'), ('2017-02-18'), ('2017-02-19'), ('2017-02-20'), 
('2017-02-21'), ('2017-02-22'), ('2017-02-23'), ('2017-02-24'), 
('2017-02-25'), ('2017-02-26'), ('2017-02-27'), ('2017-02-28');

analyze tbl_foo;

explain analyze verbose select min(id) from tbl_foo where created_at 
>= '2017-01-15';
explain analyze verbose select min(id) filter(where  created_at >= 
'2017-01-15') from tbl_foo;


set enable_seqscan=off;
explain analyze verbose select min(id) from tbl_foo where created_at 
>= '2017-01-15';
explain analyze verbose select min(id) filter(where  created_at >= 
'2017-01-15') from tbl_foo;


drop TABLE  tbl_foo;







--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] odd optimizer result, index condition "is not null" on column defined as "not null"

2017-03-03 Thread Tom Lane
Martin F  writes:
>Index Cond: (tbl_foo.id IS NOT NULL)
> only "id" is the pk, and declared "not null".
> So why this index condition?

The IS NOT NULL condition is generated as part of transforming a "min(x)"
query into an indexscan, on the basis that "select min(x) from ..." is
equivalent to "select x from ... where x is not null order by x limit 1".
Without the NOT NULL check, that's an incorrect transformation.
You're right that we could observe that the NOT NULL is implied by a table
constraint and drop it, but it seems unlikely to be worth the planner
cycles to do so; the condition doesn't cost much at runtime.  (Also,
plans that depend on table constraints for correctness have added
bookkeeping costs from tracking such dependency.)

> The select with filter choose an IMHO better plan
>> Index Only Scan using tbl_foo_date on public.tbl_foo

[ shrug... ]  Can't get too excited about that.  The first rule of working
with the Postgres planner is that planning results on toy tables do not
scale to large tables; too many of the cost factors are nonlinear.
But the bigger picture here, which would become more obvious if you were
working with a non-toy amount of data, is that you're asking the planner
to choose between two bad options.  Basically it can either scan the data
in id order (using the pkey index) and stop when it hits the first row
satisfying the created_at condition (which might be arbitrarily far in);
or it can scan all the data satisfying the created_at condition (possibly
using the other index to optimize that) and remember the smallest id seen
while doing so.  Your query with the aggregate FILTER condition is a
further-pessimized version of the second approach, because it has to scan
the *whole* index; a per-aggregate FILTER doesn't get applied as an index
condition, but only while executing that aggregate.

If you are concerned about the performance of this specific query shape,
what you actually want is an index on (id, created_at).  That allows
stopping at the first index entry satisfying the created_at condition,
knowing that it must have the min id value that does so.

On a toy table the performance of any of these variants is going to be so
close that it's not certain which one the planner will pick (and it will
hardly matter anyway).  On a large table the correctly-chosen index will
make an enormous difference.

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] odd optimizer result, index condition "is not null" on column defined as "not null"

2017-03-03 Thread Martin F

Hi,
following up my own post:

I noted that I included the "set enable_seqscan=off; ". But the results 
I mentioned are from before this statement.


I also compared some more statements

explain analyze verbose select min(id) from tbl_foo where created_at >= 
'2017-01-15'  ;
explain analyze verbose select id  from tbl_foo where  created_at >= 
'2017-01-15'  order by id limit 1;

Those 2 are the same, but the 2nd skips the "not null" index condition.


explain analyze verbose select min(id) filter(where  created_at >= 
'2017-01-15') from tbl_foo;
explain analyze verbose select min(id) filter(where  created_at >= 
'2017-01-15') from tbl_foo where  created_at >= '2017-01-15';


They should also be considered the same, as the "where" only removes 
rows, that are skipped by the "filter" anyway.

It seems the filter changes the plan to the other index.
But adding the where reduces the amount of "rows" that is scanned on 
this index.


This is nothing todo with the original question of the "is not null" 
condition on the "not null" field.
But it seems that, if "created_at" is only in the where part, the 
optimizer does not consider using "created_at" from the index (and doing 
an index only scan).
If "created_at" is in the select part, then the optimizer considers the 
"index only scan". (and even uses it for the "where" part)


To check this I tried
explain analyze verbose select min(created_at), min(id) filter(where 
created_at >= '2017-01-15') from tbl_foo;

and it gives an index only as well.

Out of interest, anyone with 9.6.2, does it yield the same results?



On 03/03/2017 16:41, Martin F wrote:


CREATE TABLE if not exists tbl_foo(
  id  bigserial NOT NULL primary key,
  created_at  timestamp without time zone NOT NULL
);
create index tbl_foo_date on tbl_foo using btree (created_at,id);

insert into tbl_foo (created_at) values ('2017-01-01'),
('2017-01-02'), ('2017-01-03'), ('2017-01-04'), ('2017-01-05'),
('2017-01-06'), ('2017-01-07'), ('2017-01-08'), ('2017-01-09'),
('2017-01-10'), ('2017-01-11'), ('2017-01-12'), ('2017-01-13'),
('2017-01-14'), ('2017-01-15'), ('2017-01-16'), ('2017-01-17'),
('2017-01-18'), ('2017-01-19'), ('2017-01-20'), ('2017-01-21'),
('2017-01-22'), ('2017-01-23'), ('2017-01-24'), ('2017-01-25'),
('2017-01-26'), ('2017-01-27'), ('2017-01-28'), ('2017-01-29'),
('2017-02-02'), ('2017-02-02'), ('2017-02-03'), ('2017-02-04'),
('2017-02-05'), ('2017-02-06'), ('2017-02-07'), ('2017-02-08'),
('2017-02-09'), ('2017-02-10'), ('2017-02-11'), ('2017-02-12'),
('2017-02-13'), ('2017-02-14'), ('2017-02-15'), ('2017-02-16'),
('2017-02-17'), ('2017-02-18'), ('2017-02-19'), ('2017-02-20'),
('2017-02-21'), ('2017-02-22'), ('2017-02-23'), ('2017-02-24'),
('2017-02-25'), ('2017-02-26'), ('2017-02-27'), ('2017-02-28');
analyze tbl_foo;

explain analyze verbose select min(id) from tbl_foo where created_at
>= '2017-01-15';
explain analyze verbose select min(id) filter(where  created_at >=
'2017-01-15') from tbl_foo;

set enable_seqscan=off;
explain analyze verbose select min(id) from tbl_foo where created_at
>= '2017-01-15';
explain analyze verbose select min(id) filter(where  created_at >=
'2017-01-15') from tbl_foo;

drop TABLE  tbl_foo;







--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] odd optimizer result, index condition "is not null" on column defined as "not null"

2017-03-03 Thread Martin F

Hi.

I am new, and not sure which mailinglist this should go to, so I start 
with the general list. (please advice, if I should send this to a more 
specific list)
This is tested with postgresql 9.5.5 (Maybe someone can confirm, if it 
is the same with later versions, saving me the work to upgrade right 
now, thanks)


See the sql below. The select without "filter" produces
 Result  (cost=0.45..0.46 rows=1 width=0) (actual time=0.229..0.234 
rows=1 loops=1)

   Output: $0
   InitPlan 1 (returns $0)
 ->  Limit  (cost=0.14..0.45 rows=1 width=8) (actual 
time=0.161..0.166 rows=1 loops=1)

   Output: tbl_foo.id
   ->  Index Scan using tbl_foo_pkey on public.tbl_foo 
(cost=0.14..13.28 rows=43 width=8) (actual time=0.045..0.045 rows=1 
loops=1)

 Output: tbl_foo.id
 Index Cond: (tbl_foo.id IS NOT NULL)
 Filter: (tbl_foo.created_at >= '2017-01-15 
00:00:00'::timestamp without time zone)

 Rows Removed by Filter: 14
 Planning time: 1.792 ms
 Execution time: 0.273 ms



  Index Cond: (tbl_foo.id IS NOT NULL)
only "id" is the pk, and declared "not null".
So why this index condition?

The select with filter choose an IMHO better plan

Index Only Scan using tbl_foo_date on public.tbl_foo


Should the first optimizer result be considered a bug? Should it be 
reported somewhere?



CREATE TABLE if not exists tbl_foo(
  id  bigserial NOT NULL primary key,
  created_at  timestamp without time zone NOT NULL
);
create index tbl_foo_date on tbl_foo using btree (created_at,id);

insert into tbl_foo (created_at) values ('2017-01-01'), ('2017-01-02'), 
('2017-01-03'), ('2017-01-04'), ('2017-01-05'), ('2017-01-06'), 
('2017-01-07'), ('2017-01-08'), ('2017-01-09'), ('2017-01-10'), 
('2017-01-11'), ('2017-01-12'), ('2017-01-13'), ('2017-01-14'), 
('2017-01-15'), ('2017-01-16'), ('2017-01-17'), ('2017-01-18'), 
('2017-01-19'), ('2017-01-20'), ('2017-01-21'), ('2017-01-22'), 
('2017-01-23'), ('2017-01-24'), ('2017-01-25'), ('2017-01-26'), 
('2017-01-27'), ('2017-01-28'), ('2017-01-29'), ('2017-02-02'), 
('2017-02-02'), ('2017-02-03'), ('2017-02-04'), ('2017-02-05'), 
('2017-02-06'), ('2017-02-07'), ('2017-02-08'), ('2017-02-09'), 
('2017-02-10'), ('2017-02-11'), ('2017-02-12'), ('2017-02-13'), 
('2017-02-14'), ('2017-02-15'), ('2017-02-16'), ('2017-02-17'), 
('2017-02-18'), ('2017-02-19'), ('2017-02-20'), ('2017-02-21'), 
('2017-02-22'), ('2017-02-23'), ('2017-02-24'), ('2017-02-25'), 
('2017-02-26'), ('2017-02-27'), ('2017-02-28');

analyze tbl_foo;

explain analyze verbose select min(id) from tbl_foo where created_at >= 
'2017-01-15';
explain analyze verbose select min(id) filter(where  created_at >= 
'2017-01-15') from tbl_foo;


set enable_seqscan=off;
explain analyze verbose select min(id) from tbl_foo where created_at >= 
'2017-01-15';
explain analyze verbose select min(id) filter(where  created_at >= 
'2017-01-15') from tbl_foo;


drop TABLE  tbl_foo;



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] PortalSuspended

2017-03-03 Thread Tom Lane
Rui Pacheco  writes:
> Is there a way to force the backend to send a PortalSuspended message to
> the front-end?

In your Execute message, request fewer rows than you know the portal will
produce.

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Querying JSON Lists

2017-03-03 Thread Adrian Klaver

On 03/02/2017 01:09 PM, Sven R. Kunze wrote:



On 28.02.2017 17:33, Adrian Klaver wrote:

On 02/26/2017 03:26 AM, Sven R. Kunze wrote:

Hello everyone,

playing around with jsonb and coming from this SO question
http://stackoverflow.com/questions/19925641/check-if-a-postgres-json-array-contains-a-string

I wonder why PostgreSQL behaves differently for text and integers on the
? and @> operators.


Let's have a look at 4 different but similar queries:

-- A) ? + text
select '{"food": ["12","34","45"]}'::jsonb->'food' ? '12';
 ?column?
--
 t

-- B) ? + integer
select '{"food": [12,34,45]}'::jsonb->'food' ? 12;
ERROR:  operator does not exist: jsonb ? integer
LINE 1: select '{"food": [12,34,45]}'::jsonb->'food' ? 12;
 ^
HINT:  No operator matches the given name and argument type(s). You
might need to add explicit type casts.


https://www.postgresql.org/docs/9.6/static/datatype-json.html#JSON-CONTAINMENT


"jsonb also has an existence operator, which is a variation on the
theme of containment: it tests whether a string (given as a text
value) appears as an object key or array element at the top level of
the jsonb value. These examples return true except as noted

-- String exists as array element:
SELECT '["foo", "bar", "baz"]'::jsonb ? 'bar';
"



-- C) @> + text
select '{"food": ["12","34","45"]}'::jsonb->'food' @> '["12"]',
'{"food": ["12","34","45"]}'::jsonb->'food' @> '"12"', '{"food":
["12","34","45"]}'::jsonb->'food' @> '12';
 ?column? | ?column? | ?column?
--+--+--
 t| t| f

-- D) @> + integer
select '{"food": [12,34,45]}'::jsonb->'food' @> '[12]', '{"food":
[12,34,45]}'::jsonb->'food' @> '12';--, '{"food":
[12,34,45]}'::jsonb->'food' @> 12;
 ?column? | ?column?
--+--
 t| t


Now my questions:

1) Why does A) work? Docs tells us that ? works for keys, not values.
2) Why does B) not work although A) works?
3) Why do the variants without the brackets on the right side of @> work
in C) and D)? Is there json data where their results differ from the
ones with the brackets?
4) What is the recommended way of testing inclusion in json lists?


I have not worked through your examples, but I suspect the answer's
lie here:

https://www.postgresql.org/docs/9.6/static/datatype-json.html#JSON-CONTAINMENT


8.14.3. jsonb Containment and Existence


More details yes, but not really an explanation on the 'why'. Especially
not on 2)


https://www.postgresql.org/docs/9.6/static/datatype-json.html#JSON-CONTAINMENT

"-- Object containing pairs of keys and values
-- Note that object keys must always be quoted strings
SELECT '{"bar": "baz", "balance": 7.77, "active": false}'::json;"

Which I believe comes from:

https://tools.ietf.org/html/rfc7159

"4.  Objects

   An object structure is represented as a pair of curly brackets
   surrounding zero or more name/value pairs (or members).  A name is a
   string.  A single colon comes after each name, separating the name
   from the value.  A single comma separates a value from a following
   name.  The names within an object SHOULD be unique.

"

https://www.postgresql.org/docs/9.6/static/functions-json.html

As to why it works on JSON arrays:

Table 9-43. Additional jsonb Operators
"
?   textDoes the string exist as a top-level key within the JSON value?
"

So to be picky it not does call out JSON object it says JSON value. And 
right above the table:


" For a full description of jsonb containment and existence semantics, 
see Section 8.14.3. Section 8.14.4 describes how these operators can be 
used to effectively index jsonb data."


As to how that behavior was decided on I have no idea, it just is.



 and 3). These feel like holes in the implementation.

As to this:

test=> select '["12","34","45"]'::jsonb @> '"12"';
 ?column?
--
 t

"
As a special exception to the general principle that the structures must 
match, an array may contain a primitive value:


-- This array contains the primitive string value:
SELECT '["foo", "bar"]'::jsonb @> '"bar"'::jsonb;

-- This exception is not reciprocal -- non-containment is reported here:
SELECT '"bar"'::jsonb @> '["bar"]'::jsonb;  -- yields false
"

Though there looks to be some implicit casting going on:

test=> select '["12","34","45"]'::jsonb @> '"12"'::text; 



ERROR:  operator does not exist: jsonb @> text 



LINE 1: select '["12","34","45"]'::jsonb @> '"12"'::text;

to get '"12"' to be '"12"'::jsonb.

As to why, I don't know.




Sven




--
Adrian Klaver
adrian.kla...@aklaver.com


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Autoanalyze oddity

2017-03-03 Thread Adrian Klaver

On 03/03/2017 12:33 AM, Peter J. Holzer wrote:

This is with PostgreSQL 9.5.6 on Debian Linux.

I noticed that according to pg_stat_user_tables autoanalyze has never
run on a lot of tables. Here is one example:

wdsah=> select * from pg_stat_user_tables where schemaname='public' and 
relname='facttable_wds_indexstats';
─[ RECORD 1 ]───┬─
relid   │ 112723
schemaname  │ public
relname │ facttable_wds_indexstats
seq_scan│ 569
seq_tup_read│ 474779212
idx_scan│ 59184
idx_tup_fetch   │ 59184
n_tup_ins   │ 47128
n_tup_upd   │ 0
n_tup_del   │ 0
n_tup_hot_upd   │ 0
n_live_tup  │ 47128
n_dead_tup  │ 0
n_mod_since_analyze │ 47128
last_vacuum │ (∅)
last_autovacuum │ (∅)
last_analyze│ (∅)
last_autoanalyze│ (∅)
vacuum_count│ 0
autovacuum_count│ 0
analyze_count   │ 0
autoanalyze_count   │ 0

wdsah=> select count(*) from facttable_wds_indexstats;
 count

 857992
(1 row)

So, n_live_tup is way off, and n_tup_ins and n_mod_since_analyze also
seem to be wrong. Looks like this hasn't been updated in a year or so.
But track_counts is on:

wdsah=> show track_counts;
 track_counts
──
 on
(1 row)


What are your settings for autovacuum?:

https://www.postgresql.org/docs/9.5/static/runtime-config-autovacuum.html

Have the storage parameters for the table been altered?:

https://www.postgresql.org/docs/9.5/static/sql-createtable.html#SQL-CREATETABLE-STORAGE-PARAMETERS



And even if it wasn't, shouldn't the autovacuum daemon notice that
n_mod_since_analyze is greater than n_live_tup *
autovacuum_analyze_scale_factor and run an autoanalyze?


That value is added to autovacuum_analyze_threshold:

autovacuum_analyze_scale_factor (floating point)

Specifies a fraction of the table size to add to 
autovacuum_analyze_threshold when deciding whether to trigger an 
ANALYZE. The default is 0.1 (10% of table size). This parameter can only 
be set in the postgresql.conf file or on the server command line; but 
the setting can be overridden for individual tables by changing table 
storage parameters.




But the really weird thing is that pg_stats seems to be reasonably
current: I see entries in most_common_vals which were only inserted in
January. Is it possible that autoanalyze runs without updating
pg_stat_user_tables?

hp





--
Adrian Klaver
adrian.kla...@aklaver.com


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Full Text Search combined with Fuzzy

2017-03-03 Thread Artur Zakirov

On 03.03.2017 16:17, Nicolas Paris wrote:

Nice ! I do have 9.6 version.

Would this kind of index could handle more than 20M large texts ? The
recheck condition looks ressource consuming.


You are right. I think pg_trgm will be not good for such large texts, 
unfortunately.




The full text index + phrase search + synonym dictionnary is the only
other alternativ to deal with typo-phrase mining ?


I suppose there are no other options now. Though, prefix search maybe 
will help you [1].




Is there any possibility in the future to add typo in the full text
road-map ?


As far as I know, there is no plans in the near future to add similarity 
full text search.


1. 
https://www.postgresql.org/docs/current/static/textsearch-dictionaries.html


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] DISTINCT vs GROUP BY - was Re: is (not) distinct from

2017-03-03 Thread David Rowley
On 3 March 2017 at 18:26, George Neuner  wrote:
> I know most people here don't pay much - or any - attention to
> SQLServer, however there was an interesting article recently regarding
> significant performance differences between DISTINCT and GROUP BY as
> used to remove duplicates.
>
> https://sqlperformance.com/2017/01/t-sql-queries/surprises-assumptions-group-by-distinct
>
>
> Now I'm wondering if something similar might be lurking in Postgresql?

Yes things lurk there in PostgreSQL too. But to be honest I find the
examples in the URL you included a bit strange. There's almost
certainly going to be a table called "orders" that you'd use for the
outer part of the query. In that case the orderid would already be
unique.  To do the same in PostgreSQL you'd just use: select orderid,
string_agg(description,'|') from orderitems group by orderid; assuming
all orders had at least one line, you'd get the same result.

In more general terms, PostgreSQL will allow you to GROUP BY and
non-aggregated columns which are functionally dependent on the GROUP
BY clause, for example:

SELECT parts.partcode,parts.description,sum(sales.quantity) from sales
inner join parts on sales.partcode = parts.partcode GROUP BY
parts.partcode;

Assuming that parts.partcode is the PRIMARY KEY of parts, this query
is legal in PostgreSQL. In some other databases, and I believe SQL
Server might be one of them, you would have been forced to include
part.description in the GROUP BY clause. Since PostgreSQL 9.6, if
you'd have done the same with that, internally the database would
ignore the parts.description in the GROUP BY clause, as its smart
enough to know that including parts.description in the clause is not
going to change anything as the description is always the same for
each parts.partcode, and no two records can share the same partcode.

There's no such optimisation when it comes to DISTINCT. In PostgreSQL
as of today DISTINCT is a bit naive, and will just uniquify the
results on each column in the select clause.  Although quite possibly
the same optimisation could apply to DISTINCT too, just nobody has
thought to add it yet.

In short, the main difference is going to be the fewer columns you're
using to identify the groups the better. If you included all of the
columns in the GROUP BY clause as you put in the select list with the
DISTINCT query then in most cases the performance would be the same. I
believe the only exception to this is in regards to parallel query, as
currently only GROUP BYs may be parallelised, not DISTINCT.

Historically with older now unsupported versions of PostgreSQL (pre
8.4) you may have also preferred to use GROUP BY over DISTINCT as
GROUP BY could be implemented internally by sorting or hashing the
results, whereas DISTINCT used to only be implemented by Sorting the
results. Although this has long since been the case.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Full Text Search combined with Fuzzy

2017-03-03 Thread Nicolas Paris
Le 03 mars 2017 à 14:08, Artur Zakirov écrivait :
> On 03.03.2017 15:49, Nicolas Paris wrote:
> >
> >Hi Oleg,
> >
> >Thanks. I thought pgtrgm was not able to index my long texts because of
> >limitation of 8191 bytes per index row for btree.
> >
> >Then I found out it is possible to use pgtrgm over a GIN/GIST index.
> >My final use case is phrase mining in texts.
> >
> >I want my application returns texts that contains approximatly the user
> >entry:
> >
> >Eg: user search "Hello Word"
> >a text containing "blah blah blah hello world blah blah blah" would be
> >returned.
> >
> >Test:
> >postgres=# CREATE table test_trgm (texts text);
> >CREATE TABLE
> >postgres=# CREATE INDEX ON test_trgm USING GIN(texts gin_trgm_ops);
> >CREATE INDEX
> >postgres=# SET enable_seqscan = OFF;
> >SET
> >postgres=# insert into test_trgm VALUES ('blah blah blah hello world blah 
> >blah blah');
> >INSERT 0 1
> >postgres=# insert into test_trgm VALUES ('blah blah blah hello word blah 
> >blah blah');
> >INSERT 0 1
> >postgres=# SELECT texts, similarity(texts, 'hello word') FROM test_trgm 
> >WHERE texts % 'hello word';
> >   texts   | similarity
> >---+
> > blah blah blah hello world blah blah blah |   0.473684
> > blah blah blah hello word blah blah blah  | 0.6875
> >(2 rows)
> >
> >postgres=# EXPLAIN SELECT texts, similarity(texts, 'hello word') FROM 
> >test_trgm WHERE texts % 'hello word';
> >QUERY PLAN
> >---
> > Bitmap Heap Scan on test_trgm  (cost=52.01..56.03 rows=1 width=32)
> >   Recheck Cond: (texts % 'hello word'::text)
> >   ->  Bitmap Index Scan on test_trgm_texts_idx  (cost=0.00..52.01 rows=1 
> > width=0)
> > Index Cond: (texts % 'hello word'::text)
> >(4 rows)
> >
> >Conclusion: If I d'say 0.4 is my threshold, would this methodology meet
> >my requirements ?
> >
> >Thanks for the help !
> >
> 
> Hello,
> 
> If you use PostgreSQL 9.6, then word_similarity() can help you [1]. For
> example:
> 
> postgres=# SELECT texts, word_similarity('hello word', texts) FROM test_trgm
> WHERE 'hello word' <% texts;
>texts   | word_similarity
> ---+-
>  blah blah blah hello world blah blah blah |0.818182
>  blah blah blah hello word blah blah blah  |   1
> (2 rows)
> 
> 1. https://www.postgresql.org/docs/9.6/static/pgtrgm.html
> 

Nice ! I do have 9.6 version.

Would this kind of index could handle more than 20M large texts ? The
recheck condition looks ressource consuming.

The full text index + phrase search + synonym dictionnary is the only
other alternativ to deal with typo-phrase mining ?

Is there any possibility in the future to add typo in the full text
road-map ?

Thanks,

> -- 
> Artur Zakirov
> Postgres Professional: http://www.postgrespro.com
> Russian Postgres Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Full Text Search combined with Fuzzy

2017-03-03 Thread Artur Zakirov

On 03.03.2017 15:49, Nicolas Paris wrote:


Hi Oleg,

Thanks. I thought pgtrgm was not able to index my long texts because of
limitation of 8191 bytes per index row for btree.

Then I found out it is possible to use pgtrgm over a GIN/GIST index.
My final use case is phrase mining in texts.

I want my application returns texts that contains approximatly the user
entry:

Eg: user search "Hello Word"
a text containing "blah blah blah hello world blah blah blah" would be
returned.

Test:
postgres=# CREATE table test_trgm (texts text);
CREATE TABLE
postgres=# CREATE INDEX ON test_trgm USING GIN(texts gin_trgm_ops);
CREATE INDEX
postgres=# SET enable_seqscan = OFF;
SET
postgres=# insert into test_trgm VALUES ('blah blah blah hello world blah blah 
blah');
INSERT 0 1
postgres=# insert into test_trgm VALUES ('blah blah blah hello word blah blah 
blah');
INSERT 0 1
postgres=# SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE 
texts % 'hello word';
   texts   | similarity
---+
 blah blah blah hello world blah blah blah |   0.473684
 blah blah blah hello word blah blah blah  | 0.6875
(2 rows)

postgres=# EXPLAIN SELECT texts, similarity(texts, 'hello word') FROM test_trgm 
WHERE texts % 'hello word';
QUERY PLAN
---
 Bitmap Heap Scan on test_trgm  (cost=52.01..56.03 rows=1 width=32)
   Recheck Cond: (texts % 'hello word'::text)
   ->  Bitmap Index Scan on test_trgm_texts_idx  (cost=0.00..52.01 rows=1 
width=0)
 Index Cond: (texts % 'hello word'::text)
(4 rows)

Conclusion: If I d'say 0.4 is my threshold, would this methodology meet
my requirements ?

Thanks for the help !



Hello,

If you use PostgreSQL 9.6, then word_similarity() can help you [1]. For 
example:


postgres=# SELECT texts, word_similarity('hello word', texts) FROM 
test_trgm WHERE 'hello word' <% texts;

   texts   | word_similarity
---+-
 blah blah blah hello world blah blah blah |0.818182
 blah blah blah hello word blah blah blah  |   1
(2 rows)

1. https://www.postgresql.org/docs/9.6/static/pgtrgm.html

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] ERROR: functions in index expression must be marked IMMUTABLE

2017-03-03 Thread Geoff Winkless
On 3 March 2017 at 12:17, Sven R. Kunze  wrote:

> On 03.03.2017 11:43, Geoff Winkless wrote:
>
> ​One alternative would be to make to_date accept all language variants of
> months simultaneously. A quick search of google suggests that there aren't
> any overlaps between languages (ie where one language uses "Foo" for March
> and another uses "Foo" for May), although you would have to​ do some more
> intense research to be sure. As far as I can see there's no other reason
> why to_date would need to be marked as stable/volatile, is there?
>
>
> it seems there are overlapping short months:
>

Trust the Balkan states to find something to disagree over :)

Oh well, that scrubs that idea then.

Geoff


Re: [GENERAL] Full Text Search combined with Fuzzy

2017-03-03 Thread Nicolas Paris
Le 27 févr. 2017 à 10:32, Oleg Bartunov écrivait :
> 
> 
> On Sun, Feb 26, 2017 at 3:52 PM, Nicolas Paris  wrote:
> 
> Hello,
> 
> AFAIK there is no built-in way to combine full text search and fuzzy
> matching
> (https://www.postgresql.org/docs/current/static/fuzzystrmatch.html).
> By example, phrase searching with tipos in it.
> 
> First I don't know if postgresql concurrents (lucene based...) are able
> to do so.
> 
> 
> Usually, https://www.postgresql.org/docs/current/static/pgtrgm.html is used 
> for
> this.
> 

Hi Oleg,

Thanks. I thought pgtrgm was not able to index my long texts because of
limitation of 8191 bytes per index row for btree.

Then I found out it is possible to use pgtrgm over a GIN/GIST index.
My final use case is phrase mining in texts.

I want my application returns texts that contains approximatly the user
entry:

Eg: user search "Hello Word"
a text containing "blah blah blah hello world blah blah blah" would be
returned.

Test:
postgres=# CREATE table test_trgm (texts text);
CREATE TABLE
postgres=# CREATE INDEX ON test_trgm USING GIN(texts gin_trgm_ops);
CREATE INDEX
postgres=# SET enable_seqscan = OFF;
SET
postgres=# insert into test_trgm VALUES ('blah blah blah hello world blah blah 
blah');
INSERT 0 1
postgres=# insert into test_trgm VALUES ('blah blah blah hello word blah blah 
blah');
INSERT 0 1
postgres=# SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE 
texts % 'hello word';
   texts   | similarity 
---+
 blah blah blah hello world blah blah blah |   0.473684
 blah blah blah hello word blah blah blah  | 0.6875
(2 rows)

postgres=# EXPLAIN SELECT texts, similarity(texts, 'hello word') FROM test_trgm 
WHERE texts % 'hello word';
QUERY PLAN  
   
---
 Bitmap Heap Scan on test_trgm  (cost=52.01..56.03 rows=1 width=32)
   Recheck Cond: (texts % 'hello word'::text)
   ->  Bitmap Index Scan on test_trgm_texts_idx  (cost=0.00..52.01 rows=1 
width=0)
 Index Cond: (texts % 'hello word'::text)
(4 rows)

Conclusion: If I d'say 0.4 is my threshold, would this methodology meet
my requirements ?

Thanks for the help !


> 
> Second, is such feature is in the road map ?
> 
> Third, I wonder if it is a good idea to use the postgresql synonyms
> feature for such prupose.(https://www.postgresql.org/docs/current/static/
> textsearch-dictionaries.html)
> I mean, building up a synonyms dictionnary containing tipos. By eg:
> 
> postgres        pgsql
> postgresql      pgsql
> postgrez        pgsql
> postgre         pgsql
> gogle           googl
> gooogle         googl
> 
> There is multiple way to build such dictionary. But my question is about
> the implementation of dictionnaries in postgresql: Is postgresql
> supposed to take advantage of billion entries dictionaries ?
> 
> 
> dictionary is just a program, so it's  up to developer how to write efficient
> program to deal with billion entries. Specifically to synonym dictionary, it's
> not intended to work with a lot of entries. btw, have a look on contrib/
> dict_xsyn dictionary, which is more flexible than synonym.
> 
> 
> Thanks by advance for you answers,
>
> 
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
> 
> 


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] ERROR: functions in index expression must be marked IMMUTABLE

2017-03-03 Thread Sven R. Kunze

On 03.03.2017 11:43, Geoff Winkless wrote:
​One alternative would be to make to_date accept all language variants 
of months simultaneously. A quick search of google suggests that there 
aren't any overlaps between languages (ie where one language uses 
"Foo" for March and another uses "Foo" for May), although you would 
have to​ do some more intense research to be sure. As far as I can see 
there's no other reason why to_date would need to be marked as 
stable/volatile, is there?


As a side-note, it seems there are overlapping short months:


Using
this webpage http://web.library.yale.edu/cataloging/months.htm
this JS(Firefox) JSON.stringify([for (x of 
document.querySelectorAll('table.grid tr td')) x.textContent])

this Python

import pprint
from collections import defaultdict

def chunks(l, n):
"""Yield successive n-sized chunks from l."""
for i in range(0, len(l), n):
yield l[i:i + n]

a=
a=list(chunks(a, 13))

dd=defaultdict(list)
for l in a:
 for i, m in enumerate(l):
 if i == 0:
 continue
 dd[m].append((i, l[0]))

pprint.pprint(dict(dd))
{'\nmart\n': [(3, 'Serbian')],
 'Ag.': [(8, 'Indonesian')],
 'Agustos': [(8, 'Turkish')],
 'Apr': [(4, 'Latin')],
 'Apr.': [(4, 'English'),
  (4, 'German'),
  (4, 'Greek, Modern'),
  (4, 'Indonesian'),
  (4, 'Malaysian'),
  (4, 'Romanian')],
 'Aralik': [(12, 'Turkish')],
 'Aug.': [(8, 'English'),
  (8, 'German'),
  (8, 'Greek, Modern'),
  (8, 'Latin'),
  (8, 'Romanian')],
 'Awst': [(8, 'Welsh')],
 'Chwef.': [(2, 'Welsh')],
 'Dec.': [(12, 'English'), (12, 'Latin'), (12, 'Romanian')],
 'Dek.': [(12, 'Greek, Modern')],
 'Des.': [(12, 'Indonesian')],
 'Dez.': [(12, 'German')],
 'Dis.': [(12, 'Malaysian')],
 'Ebr.': [(4, 'Welsh')],
 'Ekim': [(10, 'Turkish')],
 'Eylul': [(9, 'Turkish')],
 'Feb.': [(2, 'English'), (2, 'German'), (2, 'Malaysian'), (2, 
'Romanian')],

 'Febr.': [(2, 'Latin')],
 'Gorff.': [(7, 'Welsh')],
 'Haziran': [(6, 'Turkish')],
 'Hyd.': [(10, 'Welsh')],
 'Ian.': [(1, 'Greek, Modern'), (1, 'Latin'), (1, 'Romanian')],
 'Ion.': [(1, 'Welsh')],
 'Ioul.': [(7, 'Greek, Modern')],
 'Ioun.': [(6, 'Greek, Modern')],
 'Iul.': [(7, 'Latin')],
 'Iulie': [(7, 'Romanian')],
 'Iun.': [(6, 'Latin')],
 'Iunie': [(6, 'Romanian')],
 'Jan.': [(1, 'English'), (1, 'Malaysian')],
 'Jan./Djan.\xc2\xa0': [(1, 'Indonesian')],
 'Jan./J\xc3\xa4n.': [(1, 'German')],
 'Julai': [(7, 'Malaysian')],
 'Juli': [(7, 'German')],
 'Juli/Djuli': [(7, 'Indonesian')],
 'July': [(7, 'English')],
 'Jun': [(6, 'Malaysian')],
 'June': [(6, 'English')],
 'Juni': [(6, 'German')],
 'Juni/Djuni': [(6, 'Indonesian')],
 'Kasim': [(11, 'Turkish')],
 'Mac': [(3, 'Malaysian')],
 'Mai': [(5, 'German'), (5, 'Romanian'), (5, 'Welsh')],
 'Mai.': [(5, 'Latin')],
 'Maios': [(5, 'Greek, Modern')],
 'Mar.': [(3, 'English'), (3, 'Romanian')],
 'Mart.': [(3, 'Greek, Modern'), (3, 'Latin')],
 'Maw.': [(3, 'Welsh')],
 'May': [(5, 'English')],
 'Mayis': [(5, 'Turkish')],
 'Medi': [(9, 'Welsh')],
 'Meh.': [(6, 'Welsh')],
 'Mei': [(5, 'Malaysian')],
 'Mei/Mai': [(5, 'Indonesian')],
 'Mrt.': [(3, 'Indonesian')],
 'M\xc3\xa4rz': [(3, 'German')],
 'Nisan': [(4, 'Turkish')],
 'Noem.': [(11, 'Greek, Modern')],
 'Noiem.': [(11, 'Romanian')],
 'Nop.': [(11, 'Indonesian')],
 'Nov.': [(11, 'English'), (11, 'German'), (11, 'Latin'), (11, 
'Malaysian')],

 'Ocak': [(1, 'Turkish')],
 'Oct.': [(10, 'English'), (10, 'Latin'), (10, 'Romanian')],
 'Og': [(8, 'Malaysian')],
 'Okt.': [(10, 'German'),
  (10, 'Greek, Modern'),
  (10, 'Indonesian'),
  (10, 'Malaysian')],
 'Peb.': [(2, 'Indonesian')],
 'Phevr.': [(2, 'Greek, Modern')],
 'Rhag.': [(12, 'Welsh')],
 'Saus.': [(1, 'Lithuanian')],
 'Sept.': [(9, 'English'),
   (9, 'German'),
   (9, 'Greek, Modern'),
   (9, 'Indonesian'),
   (9, 'Latin'),
   (9, 'Malaysian'),
   (9, 'Romanian')],
 'Subat': [(2, 'Turkish')],
 'Tach.': [(11, 'Welsh')],
 'Temmuz': [(7, 'Turkish')],
 'abr.': [(4, 'Spanish')],
 'abril': [(4, 'Portuguese')],
 'ag.': [(8, 'Italian')],
 'agosto': [(8, 'Portuguese'), (8, 'Spanish')],
 'ao\xc3\xbbt': [(8, 'French')],
 'apr.': [(4, 'Dutch'),
  (4, 'Estonian'),
  (4, 'Italian'),
  (4, 'Latvian'),
  (4, 'Russian')],
 'apr./mali traven': [(4, 'Slovenian')],
 'april': [(4, 'Bosnian'),
   (4, 'Bulgarian'),
   (4, 'Danish'),
   (4, 'Norwegian'),
   (4, 'Serbian'),
   (4, 'Swedish')],
 'aug.': [(8, 'Bosnian'),
  (8, 'Danish'),
  (8, 'Dutch'),
  (8, 'Estonian'),
  (8, 'Hungarian'),
  (8, 'Latvian'),
  (8, 'Norwegian'),
  (8, 'Swedish')],
 'avg.': [(8, 'Bulgarian'), (8, 'Russian'), (8, 'Serbian')],
 'avg./veliki\xc2\xa0srpan': [(8, 'Slovenian')],
 'avril': [(4, 'French')],
 'bal.': [(4, 'Lithuanian')],
 'ber.': [(3, 'Ukranian')],
 'birz': [(6, 'Lithuanian')],
 

[GENERAL] PortalSuspended

2017-03-03 Thread Rui Pacheco
Hello,

Is there a way to force the backend to send a PortalSuspended message to the 
front-end? Perhaps emulate a particular load or a specific sequence of sql 
commands sent from the front-end?

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] ERROR: functions in index expression must be marked IMMUTABLE

2017-03-03 Thread Sven R. Kunze

On 03.03.2017 11:43, Geoff Winkless wrote:
One alternative would be to make to_date accept all language variants 
of months simultaneously. A quick search of google suggests that there 
aren't any overlaps between languages (ie where one language uses 
"Foo" for March and another uses "Foo" for May), although you would 
have to​ do some more intense research to be sure. As far as I can see 
there's no other reason why to_date would need to be marked as 
stable/volatile, is there?


I don't think so. It could be viable.

On the down side I imagine it would involve some 
potentially-prohibitively-large lookup tables; it would also end up 
with a technical incompatibility in that what ANSI SQL would reject as 
not-a-date might be parsed as a date.


There is another issue: languages change (admittedly very slowly) but I 
wouldn't want PostgreSQL to be incompatible with future generations.

Your performance argument weighs heavier, though.


I'm not in a position to judge if either of those would be acceptable.


Do you think I should post to pgsql-hackers?


Sven


Re: [GENERAL] ERROR: functions in index expression must be marked IMMUTABLE

2017-03-03 Thread Geoff Winkless
On 1 March 2017 at 14:23, Sven R. Kunze  wrote:

> I don't consider rolling an UDF the best alternative especially after
> having looked through many solution proposals on the Web which just take an
> mutable expression and wrap them up in an immutable function.
>

​One alternative would be to make to_date accept all language variants of
months simultaneously. A quick search of google suggests that there aren't
any overlaps between languages (ie where one language uses "Foo" for March
and another uses "Foo" for May), although you would have to​ do some more
intense research to be sure. As far as I can see there's no other reason
why to_date would need to be marked as stable/volatile, is there?

On the down side I imagine it would involve some
potentially-prohibitively-large lookup tables; it would also end up with a
technical incompatibility in that what ANSI SQL would reject as not-a-date
might be parsed as a date. I'm not in a position to judge if either of
those would be acceptable.

​Geoff​


Re: [GENERAL] DISTINCT vs GROUP BY - was Re: is (not) distinct from

2017-03-03 Thread Sven R. Kunze

On 03.03.2017 06:26, George Neuner wrote:

I know most people here don't pay much - or any - attention to
SQLServer, however there was an interesting article recently regarding
significant performance differences between DISTINCT and GROUP BY as
used to remove duplicates.

https://sqlperformance.com/2017/01/t-sql-queries/surprises-assumptions-group-by-distinct


On a similar note, this is also an interesting read about the topic of 
distinct vs group by:


https://blogs.oracle.com/developer/entry/counting_with_oracle_is_faster

Interesting is the performance difference between integers and strings 
for PostgreSQL which doesn't exist for Oracle.


I also tried rewriting "select distinct" to "select group by" using 
PostgreSQL. It didn't help; it was even worse (see appendix).



I'll get around to doing some testing soon.  For now, I am just asking
if anyone has ever run into something like this?


Yes, my team did. We use Django on a daily basis to generate SQL 
queries. In case of model-spanning queries, a lot of joining and 
duplications are involved. Distinct is the "generally" accepted way to 
remedy the situation but it's actually more like Tom said: distinct is a 
band-aid here. UNIONS and SUBSELECTs would be better I guess.



Sven


** Appendix **


# \d docs
 Table "public.docs"
 Column |  Type   | Modifiers
+-+---
 id | integer | not null default nextval('docs_id_seq'::regclass)
 meta   | jsonb   |
Indexes:
"docs_pkey" PRIMARY KEY, btree (id)


# explain analyze select count(distinct meta->>'blood_group') from 
docs;

QUERY PLAN

 Aggregate  (cost=760497.00..760497.01 rows=1 width=449) (actual 
time=37631.727..37631.727 rows=1 loops=1)
   ->  Seq Scan on docs  (cost=0.00..710497.00 rows=1000 width=449) 
(actual time=0.500..3999.417 rows=1000 loops=1)

 Planning time: 0.211 ms
 Execution time: 37631.829 ms
(4 rows)


# explain analyze select count(*) from (select meta->>'blood_group' 
from docs group by meta->>'blood_group') as x;

QUERY PLAN

 Aggregate  (cost=4441923.83..4441923.84 rows=1 width=0) (actual 
time=41189.472..41189.472 rows=1 loops=1)
   ->  Group  (cost=4241923.83..4316923.83 rows=1000 width=449) 
(actual time=31303.690..41189.455 rows=8 loops=1)

 Group Key: ((docs.meta ->> 'blood_group'::text))
 ->  Sort  (cost=4241923.83..4266923.83 rows=1000 
width=449) (actual time=31303.686..40475.227 rows=1000 loops=1)

   Sort Key: ((docs.meta ->> 'blood_group'::text))
   Sort Method: external merge  Disk: 129328kB
   ->  Seq Scan on docs (cost=0.00..735497.00 rows=1000 
width=449) (actual time=0.349..6433.691 rows=1000 loops=1)

 Planning time: 2.189 ms
 Execution time: 41203.669 ms
(9 rows)



Re: [GENERAL] PG on SSD

2017-03-03 Thread Achilleas Mantzios

On 03/03/2017 01:31, Scott Marlowe wrote:

On Thu, Mar 2, 2017 at 12:42 PM, scott ribe  wrote:

Is it reasonable to run PG on a mirrored pair of something like the Intel SSD 
DC 3610 series? (For example:
http://ark.intel.com/products/82935/Intel-SSD-DC-S3610-Series-480GB-2_5in-SATA-6Gbs-20nm-MLC)
 I'd *hope* that anything Intel classifies as a "Data Center SSD" would be 
reasonably reliable, have actually-working power loss protection etc, but is that the 
case?


 From the spec sheet they certainly seem to be safe against power loss.
I'd still test by pulling the power cables while running benchmarks to
be sure.

I've used the other Intel enterprise class ssds with good results on
the power plug pull tests.




+ Intel not only markets this as "Data Center SSD", moreover this seems to be in the 
respective high-end range within "Data Center SSDs".

--
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Querying JSON Lists

2017-03-03 Thread Sven R. Kunze



On 28.02.2017 17:33, Adrian Klaver wrote:

On 02/26/2017 03:26 AM, Sven R. Kunze wrote:

Hello everyone,

playing around with jsonb and coming from this SO question
http://stackoverflow.com/questions/19925641/check-if-a-postgres-json-array-contains-a-string 


I wonder why PostgreSQL behaves differently for text and integers on the
? and @> operators.


Let's have a look at 4 different but similar queries:

-- A) ? + text
select '{"food": ["12","34","45"]}'::jsonb->'food' ? '12';
 ?column?
--
 t

-- B) ? + integer
select '{"food": [12,34,45]}'::jsonb->'food' ? 12;
ERROR:  operator does not exist: jsonb ? integer
LINE 1: select '{"food": [12,34,45]}'::jsonb->'food' ? 12;
 ^
HINT:  No operator matches the given name and argument type(s). You
might need to add explicit type casts.


https://www.postgresql.org/docs/9.6/static/datatype-json.html#JSON-CONTAINMENT 



"jsonb also has an existence operator, which is a variation on the 
theme of containment: it tests whether a string (given as a text 
value) appears as an object key or array element at the top level of 
the jsonb value. These examples return true except as noted


-- String exists as array element:
SELECT '["foo", "bar", "baz"]'::jsonb ? 'bar';
"



-- C) @> + text
select '{"food": ["12","34","45"]}'::jsonb->'food' @> '["12"]',
'{"food": ["12","34","45"]}'::jsonb->'food' @> '"12"', '{"food":
["12","34","45"]}'::jsonb->'food' @> '12';
 ?column? | ?column? | ?column?
--+--+--
 t| t| f

-- D) @> + integer
select '{"food": [12,34,45]}'::jsonb->'food' @> '[12]', '{"food":
[12,34,45]}'::jsonb->'food' @> '12';--, '{"food":
[12,34,45]}'::jsonb->'food' @> 12;
 ?column? | ?column?
--+--
 t| t


Now my questions:

1) Why does A) work? Docs tells us that ? works for keys, not values.
2) Why does B) not work although A) works?
3) Why do the variants without the brackets on the right side of @> work
in C) and D)? Is there json data where their results differ from the
ones with the brackets?
4) What is the recommended way of testing inclusion in json lists?


I have not worked through your examples, but I suspect the answer's 
lie here:


https://www.postgresql.org/docs/9.6/static/datatype-json.html#JSON-CONTAINMENT 



8.14.3. jsonb Containment and Existence


More details yes, but not really an explanation on the 'why'. Especially 
not on 2) and 3). These feel like holes in the implementation.


Sven


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Autoanalyze oddity

2017-03-03 Thread Peter J. Holzer
This is with PostgreSQL 9.5.6 on Debian Linux.

I noticed that according to pg_stat_user_tables autoanalyze has never
run on a lot of tables. Here is one example:

wdsah=> select * from pg_stat_user_tables where schemaname='public' and 
relname='facttable_wds_indexstats';
─[ RECORD 1 ]───┬─
relid   │ 112723
schemaname  │ public
relname │ facttable_wds_indexstats
seq_scan│ 569
seq_tup_read│ 474779212
idx_scan│ 59184
idx_tup_fetch   │ 59184
n_tup_ins   │ 47128
n_tup_upd   │ 0
n_tup_del   │ 0
n_tup_hot_upd   │ 0
n_live_tup  │ 47128
n_dead_tup  │ 0
n_mod_since_analyze │ 47128
last_vacuum │ (∅)
last_autovacuum │ (∅)
last_analyze│ (∅)
last_autoanalyze│ (∅)
vacuum_count│ 0
autovacuum_count│ 0
analyze_count   │ 0
autoanalyze_count   │ 0

wdsah=> select count(*) from facttable_wds_indexstats;
 count  

 857992
(1 row)

So, n_live_tup is way off, and n_tup_ins and n_mod_since_analyze also
seem to be wrong. Looks like this hasn't been updated in a year or so.
But track_counts is on:

wdsah=> show track_counts;
 track_counts 
──
 on
(1 row)

And even if it wasn't, shouldn't the autovacuum daemon notice that
n_mod_since_analyze is greater than n_live_tup *
autovacuum_analyze_scale_factor and run an autoanalyze?

But the really weird thing is that pg_stats seems to be reasonably
current: I see entries in most_common_vals which were only inserted in
January. Is it possible that autoanalyze runs without updating
pg_stat_user_tables?

hp


-- 
   _  | Peter J. Holzer| A coding theorist is someone who doesn't
|_|_) || think Alice is crazy.
| |   | h...@hjp.at | -- John Gordon
__/   | http://www.hjp.at/ |http://downlode.org/Etext/alicebob.html


signature.asc
Description: Digital signature