Re: [GENERAL] referential integrity between elements of an array and another table?

2017-04-19 Thread Tom Lane
Henry M  writes:
> No hesitation. I am just trying to understand the options. This blog seemed
> to indicate the functionality existed but it looks like it was never
> completed (
> https://blog.2ndquadrant.com/postgresql-9-3-development-array-element-foreign-keys/
> ).

Yeah, there was a patch like that submitted, but we couldn't come to
consensus on how the feature should behave.  [ digs in archives... ]
See these threads:

https://www.postgresql.org/message-id/flat/4EB3DF02.4020604%402ndQuadrant.it
https://www.postgresql.org/message-id/flat/1343842863.5162.4.camel%40greygoo.devise-it.lan

> I see in this document
> https://www.postgresql.org/docs/9.6/static/sql-createtrigger.html that
> postgresql uses triggers to implement foreign keys so I am probably
> just missing the syntactic sugar for arrays. I will try to use a trigger.

It's not just a matter of syntactic sugar, there are some pretty basic
questions about what the semantics ought to be.  We also ran up against
implementation/performance problems that we couldn't find passable
solutions to.  Now, a lot of those problems might not apply to a
single-use application that doesn't have to worry about questions like
cross-type comparison semantics; but that's why you don't see such a
feature in current Postgres.

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] full text search on hstore or json with materialized view?

2017-04-19 Thread George Neuner
On Thu, 20 Apr 2017 00:55:48 -0400, George Neuner
 wrote:

Doh!

>  SELECT count(distinct s.id)
>FROM  samples_lg_txt AS s
>JOIN  keys AS k ON k.id = s.key
>WHERE (k.name = 'key1' AND s.tsv @@ to_query('value1')
>   OR (k.name = 'key2' AND s.tsv @@ to_query('value2')
>
>  :
>
>  SELECT count(distinct id)
>FROM  samples_lg_txt
>WHERE (key = 'key1' AND tsv @@ to_query('value1')
>   OR (key = 'key2' AND tsv @@ to_query('value2')


All the WHERE clauses need closing parentheses.

Time for bed,
George



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] full text search on hstore or json with materialized view?

2017-04-19 Thread George Neuner
On Wed, 19 Apr 2017 16:28:13 -0700, Rj Ewing 
wrote:

>okay, messing around a bit more with the secondary k,v table it seems like
>this could be a good solution..
>
>I created a keys table to hold the 63 key values, then I dropped and
>recreated the secondary table, using a FK referencing the keys table. I'm
>not really sure why, but a basic full text query on 44 million row is
>taking aproxx. 20ms.

That pretty much confirms your statistics were bad ... using the FK
table or not wouldn't make any difference to the planner.

But if you are getting 20ms on 44M rows, then one or more of the
following must be true:
 - your text values must be very short
 - your FTS queries must be very simple
 - you aren't reading the results


For comparison: 

I have an application that does FTS on a table of NAICS descriptions
indexed using tsvectors with an average length of 4.8 tokens per.  It
does a 3-part All/Any/None term search.

On my 24-core 2.4GHz server, a single threaded query with the whole
table and index in memory takes ~1 ms to search 20K rows using a
realistic tsquery:  e.g., 

  SELECT code,description
FROM naics
WHERE ts_index @@ to_tsquery('packaged & software & !(wholesale)')

[getting the data out of Postgresql takes longer than the search]


GIN indexes don't exactly scale linearly, and tsquery is, in general,
much more dependent on the lengths of the tsvectors than on the
complexity of the match, but with 44M rows of similarly distributed
data, a similarly realistic query would be expected to take well over
1 second.


My example is genuine but too small to bother parallelizing [mentioned
in a previous message].  Since you are *testing* with 1M records (that
create 44M k:v shards), I am assuming you will need to deal with much
more than that in deployment.  And if you think you need FTS, then you
must be expecting more than simple word matches [as below], else you
might do something simpler like

  SELECT ...
WHERE val ILIKE 



>my table structure is:
>
> Table "public.samples_lg_txt"
> Column |   Type   | Modifiers
>+--+---
> id | integer  |
> key| integer  |
> val| text |
> tsv| tsvector |
>Indexes:
>"idx_tsv_samples_lg_text" gin (tsv)
>Foreign-key constraints:
>"samples_lg_txt_id_fkey" FOREIGN KEY (id) REFERENCES samples_lg(id) 
>ON DELETE CASCADE
>"samples_lg_txt_key_fkey" FOREIGN KEY (key) REFERENCES keys(id)
>
>
>how would I write an AND query that filtered on 2 separate keys from the
>samples_lg_txt table?
>
>something like:
>
>SELECT COUNT(*) FROM samples WHERE id IN ( SELECT DISTINCT(s.id) FROM
>samples_lg_txt s JOIN keys k ON s.key = k.id WHERE (*name = 'key1' AND tsv
>@@ to_tsquery('value1')) AND (name = 'key2' AND tsv @@
>to_tsquery('value2'))*;

You're overthinking it

  SELECT count(distinct s.id)
FROM  samples_lg_txt AS s
JOIN  keys AS k ON k.id = s.key
WHERE (k.name = 'key1' AND s.tsv @@ to_query('value1')
   OR (k.name = 'key2' AND s.tsv @@ to_query('value2')


There's actually no need to join if you can use the key name instead
of an integer id.  You can FK on strings, so you can still maintain an
identity table of keys.  E.g.,

> id | integer   |
> key| vchar(32) | FK key(name) ...
> val| text  |
> tsv| tsvector  |


Then the query could be just

  SELECT count(distinct id)
FROM  samples_lg_txt
WHERE (key = 'key1' AND tsv @@ to_query('value1')
   OR (key = 'key2' AND tsv @@ to_query('value2')


Just a reminder [it's late here 8-)]: FK columns contain values - not
weird references to the foreign tables.  The constraint just enforces
that any value inserted/updated into the FK column matches an existing
value in the relevant foreign table.


George



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] cluster on brin indexes?

2017-04-19 Thread Samuel Williams
I see this, but no follow up:

https://www.postgresql.org/message-id/CAEepm%3D2LUCLZ2J4cwPv5DisHqD9BE_AXnqHGqf0Tj-cvtiqVcQ%40mail.gmail.com

So, is it possible or not?


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] full text search on hstore or json with materialized view?

2017-04-19 Thread Jeff Janes
Please don't top-post, thanks.

On Tue, Apr 18, 2017 at 3:18 PM, Rj Ewing  wrote:

> On Tue, Apr 18, 2017 at 3:00 PM, Bruce Momjian  wrote:
>
>>
>> Full text search of JSON and JSONB data is coming in Postgres 10, which
>> is to to be released in September of this year:
>>
>> https://www.depesz.com/2017/04/04/waiting-for-postgresql-10-
>> full-text-search-support-for-json-and-jsonb/
>
>

A step in the right direction for me, however it doesn't appear to support
> per field full text searching.
> It is exciting though!
>


Your best bet might be to ignore the per-field searching in the initial
(indexed) pass of the query to get everything that has all the search
terms, regardless of which field they occur in.  And the re-check whether
each of the found values was found in the appropriate field in a later pass.

Something like

select * from sample where
 to_tsvector(json_thing->>:key1) @@ :value1
  and to_tsvector(json_thing->>:key2) @@ :value2
  and to_tsvector('english',json_thing) @@ (:value1 || :value2)

>From the initial email:

> An idea that has come up is to use a materialized view or secondary table
with triggers, where we would have 3 columns (id, key, value).

How would this be different from the "triple store" you are abandoning?

Cheers,

Jeff


Re: [GENERAL] Large data and slow queries

2017-04-19 Thread Samuel Williams
Okay, so after changing longitude/latitude to float4, and
re-organizing the table a bit, I got the query down from about 8
minutes to 40 seconds.

The details are in the gist comments:
https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121

Now, just need to get performance another 2 orders of magnitude
better. Can we make the index a bit more coarse grained, perhaps
convert long/lat to integers or something, use a hilbert index, use a
postgis index... ideas?


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] full text search on hstore or json with materialized view?

2017-04-19 Thread George Neuner


Please don't top post.
https://en.wikipedia.org/wiki/Posting_style#Placement_of_replies
https://en.wikipedia.org/wiki/Posting_style#Choosing_the_proper_posting_style



>> On Tue, 18 Apr 2017 14:38:15 -0700, Rj Ewing 
>> wrote:
>>
>> > :
>> >An idea that has come up is to use a materialized view or secondary table
>> >with triggers, where we would have 3 columns (id, key, value).
>> >
>> >I think this would allow us to store a tsvector and gin index. Giving us
>> >the ability to use fulltext search on k:v pairs, then join the original
>> >data on the id field to return the entire record.
>> > :


On Wed, 19 Apr 2017 11:57:26 -0700, Rj Ewing 
wrote:

>I did some testing using a secondary table with the key, value column.
>However I don't think this will provide the performance that we need.
>Queries we taking 60+ seconds just for a count.

SELECT count(*) or filtered?
 
Either way, your statistics may be way off.  Did you vacuum analyze
the table after the inserts (and the index creation if it was done
separately)?


>With 1 million rows in the primary table, this resulted in 44 million rows
>in the secondary k,v table for full text searching. The same query is es
>takes ~50 ms on my local machine with 1/10th the ram allocated to es then
>was allocated to psql.
>
>I'm gonna test using trigrams indexes on approx 10 json fields, and see if
>that gives us what we are looking for.
>
>any thought on getting sub 1 sec queries on a table with 44 million rows?
>
>RJ

Based on your description of the data [at top], I rather doubt
trigrams will be an improvement over tsvector.  And they're more
cumbersome to use if you don't need better similarity matching than
what tsvector offers [which itself is somewhat adjustable via
dictionaries].

Without more detail re: your hardware, Postgresql version, what
indexes are/will be available, the types of queries you want to run,
etc., it's very hard to give really meaningful suggestions.


The kind of query you have alluded to is pretty easily parallelized:
it can be spread over multiple sessions with result aggregation done
on the client side.

Or, if you you have 9.6, you might try using backend parallelism:
https://www.postgresql.org/docs/9.6/static/parallel-query.html
[I've not used this, but some people have done it successfully.]


If you can restrict the FTS query to certain keys:

  SELECT id FROM mytable
WHERE tsquery( ... ) @@ to_tsvector(v)
AND k IN ( ... )
GROUP BY id

  [note: according to David Rowley, GROUP BY may be parallelized
 whereas  DISTINCT currently cannot be.]

then given an index on 'k' it may be much faster than just the FTS
query alone.  Subject to key variability, it also may be improved by
table partitioning to reduce the search space.

If the FTS query is key restricted, you can parallelize either on the
client or on the server.  If the FTS query is not key restricted, you
pretty much are limited to server side (and 9.6 or later).


And I'm out of suggestions for now.  

Parallel query is your best bet for maximum performance, but unless
you have enough RAM to hold the entire table and its indexes, and all
the query workspaces, then I doubt you will be able to get anywhere
near your optimistic execution target for FTS on 40+ million rows.  


YMMV,
George



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] referential integrity between elements of an array and another table?

2017-04-19 Thread Ken Tanzer
For me, foreign keys for arrays would be incredibly helpful.  For simple
checkbox-type options, it's _much_ simpler to just put them in an array,
and I work with a UI that can handle the arrays well.  I do hope this makes
it into Postgresql someday!  In the meantime, I've mostly sacrificed the
referential integrity in favor of the convenience.  Just my two cents.

Cheers,
Ken


-- 
AGENCY Software
A Free Software data system
By and for non-profits
*http://agency-software.org/ *
*https://agency-software.org/demo/client
*
ken.tan...@agency-software.org
(253) 245-3801

Subscribe to the mailing list
 to
learn more about AGENCY or
follow the discussion.


Re: [GENERAL] referential integrity between elements of an array and another table?

2017-04-19 Thread Henry M
No hesitation. I am just trying to understand the options. This blog seemed
to indicate the functionality existed but it looks like it was never
completed (
https://blog.2ndquadrant.com/postgresql-9-3-development-array-element-foreign-keys/
).

I see in this document
https://www.postgresql.org/docs/9.6/static/sql-createtrigger.html that
postgresql uses triggers to implement foreign keys so I am probably
just missing the syntactic sugar for arrays. I will try to use a trigger.
Thank you.


On Wed, Apr 19, 2017 at 12:24 PM Rob Sargent  wrote:

>
>
> On 04/19/2017 01:13 PM, Henry M wrote:
>
> I was just reading this question on reddit (the text duplicated below). I
> was wondering if there is an approach for handling array foreign key
> references. I am interested in the answer since I started using array
> fields as well. Thank you.
>
>
> - below is the message from the reddit poster:
>
>
> https://www.reddit.com/r/PostgreSQL/comments/66a74t/question_referential_integrity_between_elements/
>
> First off, thanks for taking the time. I don't see any "weekly help
> threads" or anything so I hope it's okay I made a post.
>
> *The Problem*
>
> I'm extending our CRM database in order to manage the ISO documentation of
> the company for which I work. We need, as part of ISO, to keep track of
> project development meetings and attendees.
>
> Currently, I have a "meetings" table with an attendee list as an array of
> integers, each element corresponding to the id field in the "person" table.
>
> How do I enforce referential integrity between elements of this array and
> the id field in "person"? All I could find was a proposed change for 9.2 or
> 9.3 that got shot down...
>
> Please help. I don't want to do a join table.
>
> Unless you have thousands of attendees per meeting and everyone in
> multiple meetings per day I would encourage a join table.  What's your
> hesitation for doing so?  Note I'm an array user myself, so not against
> arrays per se, but there should be a justification for denormalizing
> (depending your definition thereof).
>


Re: [GENERAL] full text search on hstore or json with materialized view?

2017-04-19 Thread Rj Ewing
okay, messing around a bit more with the secondary k,v table it seems like
this could be a good solution..

I created a keys table to hold the 63 key values, then I dropped and
recreated the secondary table, using a FK referencing the keys table. I'm
not really sure why, but a basic full text query on 44 million row is
taking aproxx. 20ms.

my table structure is:

 Table "public.samples_lg_txt"
 Column |   Type   | Modifiers
+--+---
 id | integer  |
 key| integer  |
 val| text |
 tsv| tsvector |
Indexes:
"idx_tsv_samples_lg_text" gin (tsv)
Foreign-key constraints:
"samples_lg_txt_id_fkey" FOREIGN KEY (id) REFERENCES samples_lg(id) ON
DELETE CASCADE
"samples_lg_txt_key_fkey" FOREIGN KEY (key) REFERENCES keys(id)


how would I write an AND query that filtered on 2 separate keys from the
samples_lg_txt table?

something like:

SELECT COUNT(*) FROM samples WHERE id IN ( SELECT DISTINCT(s.id) FROM
samples_lg_txt s JOIN keys k ON s.key = k.id WHERE (*name = 'key1' AND tsv
@@ to_tsquery('value1')) AND (name = 'key2' AND tsv @@
to_tsquery('value2'))*;

On Wed, Apr 19, 2017 at 11:57 AM, Rj Ewing  wrote:

> I did some testing using a secondary table with the key, value column.
> However I don't think this will provide the performance that we need.
> Queries we taking 60+ seconds just for a count.
>
> With 1 million rows in the primary table, this resulted in 44 million rows
> in the secondary k,v table for full text searching. The same query is es
> takes ~50 ms on my local machine with 1/10th the ram allocated to es then
> was allocated to psql.
>
> I'm gonna test using trigrams indexes on approx 10 json fields, and see if
> that gives us what we are looking for.
>
> any thought on getting sub 1 sec queries on a table with 44 million rows?
>
> RJ
>
> On Tue, Apr 18, 2017 at 10:35 PM, George Neuner 
> wrote:
>
>> On Tue, 18 Apr 2017 14:38:15 -0700, Rj Ewing 
>> wrote:
>>
>> >I am evaluating postgres for as a datastore for our webapp. We are moving
>> >away from a triple store db due to performance issues.
>> >
>> >Our data model consists of sets of user defined attributes. Approx 10% of
>> >the attributes tend to be 100% filled with 50% of the attributes having
>> >approx 25% filled. This is fairly sparse data, and it seems that jsonb or
>> >hstore will be best for us.
>> >
>> >Unfortunately, from my understanding, postres doesn't support fulltext
>> >search across hstore or jsonb key:values or even the entire document.
>> While
>> >this is not a deal breaker, this would be a great feature to have. We
>> have
>> >been experimenting w/ elasticsearch a bit, and particularly enjoy this
>> >feature, however we don't really want to involve the complexity and
>> >overhead of adding elasticsearch in front of our datasource right now.
>>
>> hstore and JSON values all really are just formatted text with a
>> custom column type.  You can create tsvectors from the values if you
>> cast them to text.
>>
>> Note that a tsvector can only work on a /flat/ key:value structure: it
>> won't understand nesting, and it and even with a flat store it won't
>> understand the difference between keys/tags and the associated values.
>>
>> E.g., you will be able to see that a value contains both "foo" and
>> "bar", but to distinguish 'foo:bar' from 'bar:foo' or 'foo:q,bar:z'
>> you either must check the token positions (from the tsvector) or *try*
>> to extract the key(s) you are interested in and check the associated
>> value(s).
>>
>> This might work ok if you search only for keys in a "document" ... but
>> trying to search values, I think would be far too complicated.
>>
>> It might help if you stored a 2D array instead of a flat structure,
>> but even that would be fairly complicated to work with.
>>
>>
>>
>> >An idea that has come up is to use a materialized view or secondary table
>> >with triggers, where we would have 3 columns (id, key, value).
>> >
>> >I think this would allow us to store a tsvector and gin index. Giving us
>> >the ability to use fulltext search on k:v pairs, then join the original
>> >data on the id field to return the entire record.
>>
>> This is a much better idea because it separates the key from the
>> value, and unlike the full "document" case [above], you will know that
>> the FTS index is covering only the values.
>>
>> If you need to preserve key order to reconstruct records, you will
>> need an additional column to maintain that ordering.
>>
>>
>> >is anyone currently doing this? Is there a better alternative? Any
>> >performance issues that immediately jump out ( I realize the writes will
>> >take longer)?
>> >
>> >the nature of our data is "relatively" static with bulk uploads (100 -
>> 1000
>> >records). So we can sacrifice some write performance.
>> >
>> >RJ
>>
>> Having to "reconstruct" records will make reads take longer as well,
>> but I think separating the keys and values is the 

Re: [GENERAL] Unable to upload backups

2017-04-19 Thread JP Jacoupy
Hi,

Seems to me your role "ronb" doesn't have the rights to create schema on your 
database.

Please refer to the GRANT command in the documentation. 
https://www.postgresql.org/docs/9.0/static/sql-grant.html

Sent from [ProtonMail](https://protonmail.ch), encrypted email based in 
Switzerland.

 Original Message 
Subject: [GENERAL] Unable to upload backups
Local Time: 19 avril 2017 12:56 PM
UTC Time: 19 avril 2017 10:56
From: ronb...@walla.co.il
To: pgsql-general@postgresql.org

Hi,
I'm using PostgreSQL 9.3.2
I'm running the command:

psql -h testserver -U ronb -f backup.sql -q -d foldertest 2>error.txt 
>output.txt

This should generate my database in foldertest

However this doesn't work. It's unable to create schemas

in the error.txt i see "permission denied for database foldertest".

I know this is not an access permission issue because there is a public schema 
which is buildin and it does create the tables/data in there.

It just cant create new schemas.

The intresting thing is that if I do:

psql -h testserver -U postgres -f backup.sql -q -d foldertest 2>error.txt 
>output.txt

Everything works. It create all schemas and generate the database correctly.

I don't see any diffrent in the hba.conf between postgres and ronb users.

What can be the problem?

Re: [GENERAL] UDP buffer drops / statistics collector

2017-04-19 Thread Alban Hertroys

> On 19 Apr 2017, at 20:36, Tim Kane  wrote:
> 
> Well, this is frustrating..
> The buffer drops are still occurring - so I thought it worth trying use a 
> ramdisk and set stats_temp_directory accordingly.
> 
> I've reloaded the instance, and can see that the stats directory is now being 
> populated in the new location.  Except - there is one last file 
> (pgss_query_texts.stat) that continues to be updated in the old pg_stat_tmp 
> path..  Is that supposed to happen?
> 
> 
> Fairly similar to this guy (but not quite the same).
> https://www.postgresql.org/message-id/d6e71befad7beb4fbcd8ae74fadb1265011bb40fc...@win-8-eml-ex1.eml.local
> 
> I can see the packets arriving and being consumed by the collector..  and, 
> the collector is indeed updating in the new stats_temp_directory.. just not 
> for that one file.
> 
> 
> It also failed to resolve the buffer drops.. At this point, I'm not sure I 
> expected it to.  They tend to occur semi-regularly (every 8-13 minutes) but I 
> can't correlate them with any kind of activity (and if I'm honest, it's 
> possibly starting to drive me a little bit mad).

This rings a bell for me. I recently had a similar issue in an MMO (Windows) 
where every 15 minutes I would get a number of consecutive freezes in-game. You 
could set your alarm by it, so regular.

That suddenly went away after I rearranged my home-network (for unrelated 
reasons), which incidentally moved several connections from the switch the 
game-system was connected to to another switch. I never pinpointed it to UDP, 
but then again, TCP would correct for the lost transfers (probably at the cost 
of UDP traffic).

Perhaps you have a switch somewhere that's overburdened?

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] referential integrity between elements of an array and another table?

2017-04-19 Thread Adrian Klaver

On 04/19/2017 12:13 PM, Henry M wrote:

I was just reading this question on reddit (the text duplicated below).
I was wondering if there is an approach for handling array foreign key
references. I am interested in the answer since I started using array
fields as well. Thank you.


The only thing I can think of is a faux FK, a INSERT/UPDATE trigger that 
verifies that the ids in the array are in the person table.





- below is the message from the reddit poster:

https://www.reddit.com/r/PostgreSQL/comments/66a74t/question_referential_integrity_between_elements/

First off, thanks for taking the time. I don't see any "weekly help
threads" or anything so I hope it's okay I made a post.

/The Problem/

I'm extending our CRM database in order to manage the ISO documentation
of the company for which I work. We need, as part of ISO, to keep track
of project development meetings and attendees.

Currently, I have a "meetings" table with an attendee list as an array
of integers, each element corresponding to the id field in the "person"
table.

How do I enforce referential integrity between elements of this array
and the id field in "person"? All I could find was a proposed change for
9.2 or 9.3 that got shot down...

Please help. I don't want to do a join table.




--
Adrian Klaver
adrian.kla...@aklaver.com


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] referential integrity between elements of an array and another table?

2017-04-19 Thread Rob Sargent



On 04/19/2017 01:13 PM, Henry M wrote:
I was just reading this question on reddit (the text duplicated 
below). I was wondering if there is an approach for handling array 
foreign key references. I am interested in the answer since I started 
using array fields as well. Thank you.



- below is the message from the reddit poster:

https://www.reddit.com/r/PostgreSQL/comments/66a74t/question_referential_integrity_between_elements/ 



First off, thanks for taking the time. I don't see any "weekly help 
threads" or anything so I hope it's okay I made a post.


/The Problem/

I'm extending our CRM database in order to manage the ISO 
documentation of the company for which I work. We need, as part of 
ISO, to keep track of project development meetings and attendees.


Currently, I have a "meetings" table with an attendee list as an array 
of integers, each element corresponding to the id field in the 
"person" table.


How do I enforce referential integrity between elements of this array 
and the id field in "person"? All I could find was a proposed change 
for 9.2 or 9.3 that got shot down...


Please help. I don't want to do a join table.

Unless you have thousands of attendees per meeting and everyone in 
multiple meetings per day I would encourage a join table.  What's your 
hesitation for doing so?  Note I'm an array user myself, so not against 
arrays per se, but there should be a justification for denormalizing 
(depending your definition thereof).


Re: [GENERAL] referential integrity between elements of an array and another table?

2017-04-19 Thread Rob Sargent



On 04/19/2017 01:13 PM, Henry M wrote:
I was just reading this question on reddit (the text duplicated 
below). I was wondering if there is an approach for handling array 
foreign key references. I am interested in the answer since I started 
using array fields as well. Thank you.



- below is the message from the reddit poster:

https://www.reddit.com/r/PostgreSQL/comments/66a74t/question_referential_integrity_between_elements/ 



First off, thanks for taking the time. I don't see any "weekly help 
threads" or anything so I hope it's okay I made a post.


/The Problem/

I'm extending our CRM database in order to manage the ISO 
documentation of the company for which I work. We need, as part of 
ISO, to keep track of project development meetings and attendees.


Currently, I have a "meetings" table with an attendee list as an array 
of integers, each element corresponding to the id field in the 
"person" table.


How do I enforce referential integrity between elements of this array 
and the id field in "person"? All I could find was a proposed change 
for 9.2 or 9.3 that got shot down...


Please help. I don't want to do a join table.

Unless you have thousands of attendees per meeting and everyone in 
multiple meetings per day I would encourage a join table.  What's your 
hesitation for doing so?  Note I'm an array user myself, so not against 
arrays per se, but there should be a justification for denormalizing 
(depending your definition thereof).


[GENERAL] referential integrity between elements of an array and another table?

2017-04-19 Thread Henry M
I was just reading this question on reddit (the text duplicated below). I
was wondering if there is an approach for handling array foreign key
references. I am interested in the answer since I started using array
fields as well. Thank you.


- below is the message from the reddit poster:

https://www.reddit.com/r/PostgreSQL/comments/66a74t/question_referential_integrity_between_elements/

First off, thanks for taking the time. I don't see any "weekly help
threads" or anything so I hope it's okay I made a post.

*The Problem*

I'm extending our CRM database in order to manage the ISO documentation of
the company for which I work. We need, as part of ISO, to keep track of
project development meetings and attendees.

Currently, I have a "meetings" table with an attendee list as an array of
integers, each element corresponding to the id field in the "person" table.

How do I enforce referential integrity between elements of this array and
the id field in "person"? All I could find was a proposed change for 9.2 or
9.3 that got shot down...

Please help. I don't want to do a join table.


Re: [GENERAL] full text search on hstore or json with materialized view?

2017-04-19 Thread Rj Ewing
I did some testing using a secondary table with the key, value column.
However I don't think this will provide the performance that we need.
Queries we taking 60+ seconds just for a count.

With 1 million rows in the primary table, this resulted in 44 million rows
in the secondary k,v table for full text searching. The same query is es
takes ~50 ms on my local machine with 1/10th the ram allocated to es then
was allocated to psql.

I'm gonna test using trigrams indexes on approx 10 json fields, and see if
that gives us what we are looking for.

any thought on getting sub 1 sec queries on a table with 44 million rows?

RJ

On Tue, Apr 18, 2017 at 10:35 PM, George Neuner 
wrote:

> On Tue, 18 Apr 2017 14:38:15 -0700, Rj Ewing 
> wrote:
>
> >I am evaluating postgres for as a datastore for our webapp. We are moving
> >away from a triple store db due to performance issues.
> >
> >Our data model consists of sets of user defined attributes. Approx 10% of
> >the attributes tend to be 100% filled with 50% of the attributes having
> >approx 25% filled. This is fairly sparse data, and it seems that jsonb or
> >hstore will be best for us.
> >
> >Unfortunately, from my understanding, postres doesn't support fulltext
> >search across hstore or jsonb key:values or even the entire document.
> While
> >this is not a deal breaker, this would be a great feature to have. We have
> >been experimenting w/ elasticsearch a bit, and particularly enjoy this
> >feature, however we don't really want to involve the complexity and
> >overhead of adding elasticsearch in front of our datasource right now.
>
> hstore and JSON values all really are just formatted text with a
> custom column type.  You can create tsvectors from the values if you
> cast them to text.
>
> Note that a tsvector can only work on a /flat/ key:value structure: it
> won't understand nesting, and it and even with a flat store it won't
> understand the difference between keys/tags and the associated values.
>
> E.g., you will be able to see that a value contains both "foo" and
> "bar", but to distinguish 'foo:bar' from 'bar:foo' or 'foo:q,bar:z'
> you either must check the token positions (from the tsvector) or *try*
> to extract the key(s) you are interested in and check the associated
> value(s).
>
> This might work ok if you search only for keys in a "document" ... but
> trying to search values, I think would be far too complicated.
>
> It might help if you stored a 2D array instead of a flat structure,
> but even that would be fairly complicated to work with.
>
>
>
> >An idea that has come up is to use a materialized view or secondary table
> >with triggers, where we would have 3 columns (id, key, value).
> >
> >I think this would allow us to store a tsvector and gin index. Giving us
> >the ability to use fulltext search on k:v pairs, then join the original
> >data on the id field to return the entire record.
>
> This is a much better idea because it separates the key from the
> value, and unlike the full "document" case [above], you will know that
> the FTS index is covering only the values.
>
> If you need to preserve key order to reconstruct records, you will
> need an additional column to maintain that ordering.
>
>
> >is anyone currently doing this? Is there a better alternative? Any
> >performance issues that immediately jump out ( I realize the writes will
> >take longer)?
> >
> >the nature of our data is "relatively" static with bulk uploads (100 -
> 1000
> >records). So we can sacrifice some write performance.
> >
> >RJ
>
> Having to "reconstruct" records will make reads take longer as well,
> but I think separating the keys and values is the best way to do it.
>
>
> YMMV,
> George
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] streaming replication and archive_status

2017-04-19 Thread Luciano Mittmann
2017-04-18 22:46 GMT-03:00 Jeff Janes :

> On Tue, Apr 18, 2017 at 5:20 AM, Luciano Mittmann 
> wrote:
>
>>
>>
>> Hi Jeff,
>>
>> **Does each file in pg_xlog/archive_status/ have a corresponding file
>> one directory up?
>>
>> no corresponding file on pg_xlog directory. That is the question.. for
>> some reason or some parameter that I do not know, the files are
>> considered consumed but are not erased later.
>>
>
> I can see how a well-timed crash could leave behind a few .done files, but
> not 75 thousand of them.
>
> Are they still accumulating, or was it only an historical accumulation?
>
> Also, is this on Windows?
>
> Cheers,
>
> Jeff
>

Hey Jeff,

they are still accumulating, 78k today.
Running on SLES 12.1

regards,


Re: [GENERAL] UDP buffer drops / statistics collector

2017-04-19 Thread Tim Kane
Well, this is frustrating..
The buffer drops are still occurring - so I thought it worth trying use a
ramdisk and set *stats_temp_directory* accordingly.

I've reloaded the instance, and can see that the stats directory is now
being populated in the new location.  *Except* - there is one last file (
pgss_query_texts.stat) that continues to be updated in the *old* pg_stat_tmp
path..  Is that supposed to happen?


Fairly similar to this guy (but not quite the same).
https://www.postgresql.org/message-id/d6e71befad7beb4fbcd8ae74fadb1265011bb40fc...@win-8-eml-ex1.eml.local

I can see the packets arriving and being consumed by the collector..  and,
the collector is indeed updating in the new stats_temp_directory.. just not
for that one file.


It also failed to resolve the buffer drops.. At this point, I'm not sure I
expected it to.  They tend to occur semi-regularly (every 8-13 minutes) but
I can't correlate them with any kind of activity (and if I'm honest, it's
possibly starting to drive me a little bit mad).




On Tue, Apr 18, 2017 at 2:53 PM Tim Kane  wrote:

> Okay, so I've run an strace on the collector process during a buffer drop
> event.
> I can see evidence of a recvfrom loop pulling in a *maximum* of 142kb.
>
> While I've had already increased rmem_max, it would appear this is not
> being observed by the kernel.
> rmem_default is set to 124kb, which would explain the above read maxing
> out just slightly beyond this (presuming a ring buffer filling up behind
> the read).
>
> I'm going to try increasing rmem_default and see if it has any positive
> effect.. (and then investigate why the kernel doesn't want to consider
> rmem_max)..
>
>
>
>
>
> On Tue, Apr 18, 2017 at 8:05 AM Tim Kane  wrote:
>
>> Hi all,
>>
>> I'm seeing sporadic (but frequent) UDP buffer drops on a host that so far
>> I've not been able to resolve.
>>
>> The drops are originating from postgres processes, and from what I know -
>> the only UDP traffic generated by postgres should be consumed by the
>> statistics collector - but for whatever reason, it's failing to read the
>> packets quickly enough.
>>
>> Interestingly, I'm seeing these drops occur even when the system is
>> idle..  but every 15 minutes or so (not consistently enough to isolate any
>> particular activity) we'll see in the order of ~90 packets dropped at a
>> time.
>>
>> I'm running 9.6.2, but the issue was previously occurring on 9.2.4 (on
>> the same hardware)
>>
>>
>> If it's relevant..  there are two instances of postgres running (and
>> consequently, 2 instances of the stats collector process) though 1 of those
>> instances is most definitely idle for most of the day.
>>
>> In an effort to try to resolve the problem, I've increased (x2) the UDP
>> recv buffer sizes on the host - but it seems to have had no effect.
>>
>> cat /proc/sys/net/core/rmem_max
>> 1677216
>>
>> The following parameters are configured
>>
>> track_activities on
>> track_counts on
>> track_functions  none
>> track_io_timing  off
>>
>>
>> There are approximately 80-100 connections at any given time.
>>
>> It seems that the issue started a few weeks ago, around the time of a
>> reboot on the given host... but it's difficult to know what (if anything)
>> has changed, or why :-/
>>
>>
>> Incidentally... the documentation doesn't seem to have any mention of UDP
>> whatsoever.  I'm going to use this as an opportunity to dive into the
>> source - but perhaps it's worth improving the documentation around this?
>>
>> My next step is to try disabling track_activities and track_counts to
>> see if they improve matters any, but I wouldn't expect these to generate
>> enough data to flood the UDP buffers :-/
>>
>> Any ideas?
>>
>>
>>
>>


Re: [GENERAL] Recover corrupted data

2017-04-19 Thread John R Pierce

On 4/19/2017 9:24 AM, Alexandre wrote:

2) We dont use RAID.


so just a direct attached single disk drive?  is it perchance a 
'desktop' type disk?  those often have dodgy write buffering and lie 
about writes (saying they are complete when they are just in a volatile 
ram buffer).   sometimes you can turn this behavior off, depending on 
the brand and model drive, but do note it will slow your system down a 
fair bit.



--
john r pierce, recycling bits in santa cruz



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Recover corrupted data

2017-04-19 Thread George Neuner
On Wed, 19 Apr 2017 13:25:41 -0300, Alexandre 
wrote:

>  :
>But there is no solution for [file corruption]?


The only solutions are to guard against it: make frequent backups and
make use of safety mechanisms in Postgresql and in the OS.

Postgresql logs (WAL) intended changes to the database before it makes
them.  NTFS *can* do similar change logging for files - but its
logging may or may not be turned on by default.


If you are using NTFS on a hard disk, then for maximum crash
resistance make sure that both journaling (usn) and self-healing
(repair) are turned on.

If the hard disk was formatted by a (relatively) recent version of
Windows, then it is likely that journaling is on already.  But best to
check because prior to Vista the default was OFF, and a number of
internet "tweak" sites advise to turn off journaling deliberately to
enhance write performance.  Disabling journaling is [maybe] acceptable
for a personal workstation, but not for a server.


If you are using SSD, then OS journaling will be off by default.  If
the SSD is battery backed, then journaling *probably* is not necessary
and you can choose whether to trade enhanced crash resistance against
increased SSD wear and (slightly) reduced write performance.


See:

fsutil usn ...
https://technet.microsoft.com/en-us/library/cc788042(v=ws.11).aspx

fsutil repair ...
https://technet.microsoft.com/en-us/library/ff621565(v=ws.11).aspx


George



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Why so long?

2017-04-19 Thread David Rowley
On 20 April 2017 at 03:24, Steve Clark  wrote:
> pmacct=# explain select min(id) from netflow;
>QUERY PLAN
> -
>  Result  (cost=1.13..1.14 rows=1 width=0)
>InitPlan 1 (returns $0)
>  ->  Limit  (cost=0.71..1.13 rows=1 width=8)
>->  Index Only Scan using netflow_pkey on netflow
> (cost=0.71..3799108784.10 rows=9123246080 width=8)
>  Index Cond: (id IS NOT NULL)
> (5 rows)
>
> pmacct=# \timing
> Timing is on.
> pmacct=# select max(id) from netflow;
>  max
> -
>  17547256873
> (1 row)
>
> Time: 0.626 ms
>
>
>
> pmacct=# select min(id) from netflow;
> ^CCancel request sent
> ERROR:  canceling statement due to user request
> Time: 339114.334 ms

It may help the diagnosis if you run an EXPLAIN (ANALYZE, BUFFERS)
version of this query to completion.

The index pages could simply be cold and coming from disk on a very
much I/O starved system.

More recently added pages are more likely to be cached.

You may also want to consider running the EXPLAIN (ANALYZE, BUFFERS)
after having SET track_io_timing = on;

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Recover corrupted data

2017-04-19 Thread Alexandre
It appears to be just one table I'm trying to backup without that table.

But there is no solution for this kind of error?

On Wed, Apr 19, 2017 at 1:11 PM, Moreno Andreo 
wrote:

> Il 19/04/2017 17:49, Vick Khera ha scritto:
>
> 1) restore from backup
> 2) fix whatever configuration you made to let windows (or your hardware)
> destroy your data on crash. is there some RAID cache that is not backed up
> by a battery?
>
>
> IMHO If there's no hurry, it'd be better to start with point 2, because if
> your filesystem (or hardware) is somehow badly broken, it will happen
> again...
> If you have more than one database in your cluster, you can find what's
> the database that's been corrupted and restore just that one, instead of
> the whole cluster.
> Cheers
> Moreno
>
>
>
> On Wed, Apr 19, 2017 at 10:18 AM, Alexandre  wrote:
>
>> Hello,
>>
>> The computer had a unexpected shutdown, it is a Windows machine.
>> Now some data appears to be corrupted, I am receiving exceptions like
>> this:
>>
>> ERROR: could not read block 0 in file "base/16393/16485": read only
>> 0 of 8192 bytes
>>
>> There is some way to correct this?
>>
>
>
>


Re: [GENERAL] Recover corrupted data

2017-04-19 Thread Alexandre
1) We have a backup but its from the last month, I will try to backup the
data without the table that raises the exception.
2) We dont use RAID.

Thank you

On Wed, Apr 19, 2017 at 12:49 PM, Vick Khera  wrote:

> 1) restore from backup
> 2) fix whatever configuration you made to let windows (or your hardware)
> destroy your data on crash. is there some RAID cache that is not backed up
> by a battery?
>
>
> On Wed, Apr 19, 2017 at 10:18 AM, Alexandre  wrote:
>
>> Hello,
>>
>> The computer had a unexpected shutdown, it is a Windows machine.
>> Now some data appears to be corrupted, I am receiving exceptions like
>> this:
>>
>> ERROR: could not read block 0 in file "base/16393/16485": read only
>> 0 of 8192 bytes
>>
>> There is some way to correct this?
>>
>
>


Re: [GENERAL] Why so long?

2017-04-19 Thread Steve Clark
On 04/19/2017 11:57 AM, Jeff Janes wrote:
> On Wed, Apr 19, 2017 at 8:24 AM, Steve Clark  > wrote:
>
> Hello,
>
> I am confused. I have a table that has an incrementing primary key id.
>
> When I select max(id) from table is returns almost instantly but
> when I select min(id) from table it takes longer than I want to wait.
>
> Shouldn't postgresql be able to quickly find the minimum id value in the 
> index?
>
>
> Not if the low end of the index is stuffed full of obsolete entries, which 
> haven't been cleaned up because it is not being vacuumed often enough.
>
> Do you have autovacuum on?  Have you manually vacuumed the table recently?
>
> Cheers,
>
> Jeff
Hi Jeff,

Autovacuum is turned on.

 schemaname |relname| last_vacuum |last_autovacuum  
  | vacuum_count | autovacuum_count
+---+-+---+--+--
 public | netflow   | | 2017-04-11 
01:18:53.261221-04 |0 |1


It is a large table.
select pg_size_pretty(pg_relation_size('netflow'));
 pg_size_pretty

 1267 GB

select pg_size_pretty(pg_total_relation_size('netflow_pkey'));
 pg_size_pretty

 287 GB

Regards,
Steve


Re: [GENERAL] Recover corrupted data

2017-04-19 Thread Moreno Andreo

  
  
Il 19/04/2017 17:49, Vick Khera ha
  scritto:


  1) restore from backup
2) fix whatever configuration you made to let windows (or
  your hardware) destroy your data on crash. is there some RAID
  cache that is not backed up by a battery?
  


IMHO If there's no hurry, it'd be better to start with point 2,
because if your filesystem (or hardware) is somehow badly broken, it
will happen again... 
If you have more than one database in your cluster, you can find
what's the database that's been corrupted and restore just that one,
instead of the whole cluster.
Cheers
Moreno

  


  
  
On Wed, Apr 19, 2017 at 10:18 AM,
  Alexandre 
  wrote:
  

  Hello,
  
  
  The computer had
a unexpected shutdown, it is a Windows machine.
  Now some data
appears to be corrupted, I am receiving exceptions like
this:
  
  
  
ERROR: could not read block 0
  in file "base/16393/16485": read only
0 of 8192 bytes


There is some way to correct
  this?
  

  


  



  





Re: [GENERAL] Why so long?

2017-04-19 Thread Jeff Janes
On Wed, Apr 19, 2017 at 8:24 AM, Steve Clark 
wrote:

> Hello,
>
> I am confused. I have a table that has an incrementing primary key id.
>
> When I select max(id) from table is returns almost instantly but
> when I select min(id) from table it takes longer than I want to wait.
>
> Shouldn't postgresql be able to quickly find the minimum id value in the
> index?
>

Not if the low end of the index is stuffed full of obsolete entries, which
haven't been cleaned up because it is not being vacuumed often enough.

Do you have autovacuum on?  Have you manually vacuumed the table recently?

Cheers,

Jeff


Re: [GENERAL] Recover corrupted data

2017-04-19 Thread Vick Khera
1) restore from backup
2) fix whatever configuration you made to let windows (or your hardware)
destroy your data on crash. is there some RAID cache that is not backed up
by a battery?


On Wed, Apr 19, 2017 at 10:18 AM, Alexandre  wrote:

> Hello,
>
> The computer had a unexpected shutdown, it is a Windows machine.
> Now some data appears to be corrupted, I am receiving exceptions like this:
>
> ERROR: could not read block 0 in file "base/16393/16485": read only
> 0 of 8192 bytes
>
> There is some way to correct this?
>


Re: [GENERAL] Why so long?

2017-04-19 Thread Steve Clark
Should add this is version 9.4.10 of postgresql

On 04/19/2017 11:24 AM, Steve Clark wrote:
> Hello,
>
> I am confused. I have a table that has an incrementing primary key id.
>
> When I select max(id) from table is returns almost instantly but
> when I select min(id) from table it takes longer than I want to wait.
>
> Shouldn't postgresql be able to quickly find the minimum id value in the 
> index?
>
>
> pmacct=# explain select max(id) from netflow;
> QUERY PLAN
>
> --
>  Result  (cost=1.13..1.14 rows=1 width=0)
>InitPlan 1 (returns $0)
>  ->  Limit  (cost=0.71..1.13 rows=1 width=8)
>->  Index Only Scan Backward using netflow_pkey on netflow  
> (cost=0.71..3799108784.10 rows=9123246080 width=8)
>  Index Cond: (id IS NOT NULL)
> (5 rows)
>
> pmacct=# explain select min(id) from netflow;
>QUERY PLAN 
>   
> -
>  Result  (cost=1.13..1.14 rows=1 width=0)
>InitPlan 1 (returns $0)
>  ->  Limit  (cost=0.71..1.13 rows=1 width=8)
>->  Index Only Scan using netflow_pkey on netflow  
> (cost=0.71..3799108784.10 rows=9123246080 width=8)
>  Index Cond: (id IS NOT NULL)
> (5 rows)
>
> pmacct=# \timing
> Timing is on.
> pmacct=# select max(id) from netflow;
>  max
> -
>  17547256873
> (1 row)
>
> Time: 0.626 ms
>
>
>
> pmacct=# select min(id) from netflow;
> ^CCancel request sent
> ERROR:  canceling statement due to user request
> Time: 339114.334 ms
>
>
>Table "public.netflow"
>  Column |Type |  
> Modifiers 
> +-+-
>  id | bigint  | not null default 
> nextval('netflow_id_seq'::regclass)
>  agent_id   | bigint  |
>  bytes  | bigint  |
>  stamp_inserted | timestamp without time zone | not null default '0001-01-01 
> 00:00:00'::timestamp without time zone
>  stamp_updated  | timestamp without time zone |
>  packets| integer | default 0
>  port_src   | integer | default 0
>  port_dst   | integer | default 0
>  ip_proto   | smallint| default 0
>  tos| smallint| default 0
>  ip_src | inet| not null default 
> '0.0.0.0'::inet
>  ip_dst | inet| not null default 
> '0.0.0.0'::inet
> Indexes:
> "netflow_pkey" PRIMARY KEY, btree (id)
> "netflow_ts_key" btree (stamp_inserted)
> "netflow_tsu_idx" btree (stamp_updated)
> Triggers:
> netflow_import AFTER INSERT OR UPDATE ON netflow FOR EACH STATEMENT 
> EXECUTE PROCEDURE netflow_update()
>
>
>
> -- 
>


-- 
Stephen Clark
*NetWolves Managed Services, LLC.*
Director of Technology
Phone: 813-579-3200
Fax: 813-882-0209
Email: steve.cl...@netwolves.com
http://www.netwolves.com


[GENERAL] Why so long?

2017-04-19 Thread Steve Clark
Hello,

I am confused. I have a table that has an incrementing primary key id.

When I select max(id) from table is returns almost instantly but
when I select min(id) from table it takes longer than I want to wait.

Shouldn't postgresql be able to quickly find the minimum id value in the index?


pmacct=# explain select max(id) from netflow;
QUERY PLAN  
 
--
 Result  (cost=1.13..1.14 rows=1 width=0)
   InitPlan 1 (returns $0)
 ->  Limit  (cost=0.71..1.13 rows=1 width=8)
   ->  Index Only Scan Backward using netflow_pkey on netflow  
(cost=0.71..3799108784.10 rows=9123246080 width=8)
 Index Cond: (id IS NOT NULL)
(5 rows)

pmacct=# explain select min(id) from netflow;
   QUERY PLAN   

-
 Result  (cost=1.13..1.14 rows=1 width=0)
   InitPlan 1 (returns $0)
 ->  Limit  (cost=0.71..1.13 rows=1 width=8)
   ->  Index Only Scan using netflow_pkey on netflow  
(cost=0.71..3799108784.10 rows=9123246080 width=8)
 Index Cond: (id IS NOT NULL)
(5 rows)

pmacct=# \timing
Timing is on.
pmacct=# select max(id) from netflow;
 max
-
 17547256873
(1 row)

Time: 0.626 ms



pmacct=# select min(id) from netflow;
^CCancel request sent
ERROR:  canceling statement due to user request
Time: 339114.334 ms


   Table "public.netflow"
 Column |Type |  
Modifiers 
+-+-
 id | bigint  | not null default 
nextval('netflow_id_seq'::regclass)
 agent_id   | bigint  |
 bytes  | bigint  |
 stamp_inserted | timestamp without time zone | not null default '0001-01-01 
00:00:00'::timestamp without time zone
 stamp_updated  | timestamp without time zone |
 packets| integer | default 0
 port_src   | integer | default 0
 port_dst   | integer | default 0
 ip_proto   | smallint| default 0
 tos| smallint| default 0
 ip_src | inet| not null default '0.0.0.0'::inet
 ip_dst | inet| not null default '0.0.0.0'::inet
Indexes:
"netflow_pkey" PRIMARY KEY, btree (id)
"netflow_ts_key" btree (stamp_inserted)
"netflow_tsu_idx" btree (stamp_updated)
Triggers:
netflow_import AFTER INSERT OR UPDATE ON netflow FOR EACH STATEMENT EXECUTE 
PROCEDURE netflow_update()



-- 



Re: [GENERAL] Large data and slow queries

2017-04-19 Thread Alban Hertroys

> On 19 Apr 2017, at 12:58, Samuel Williams  
> wrote:
> 
> It's interesting you talk about using multiple indexes. In
> MySQL/MariaDB and derivatives, I've never seen the query planner
> consider using multiple indexes. So, it's possible that Postgres may
> use multiple indexes if it saves time? Or do you mean, doing something
> like manually joining the data and leveraging the different indexes
> explicitly?

PG is capable of doing bitmap heap scans to combine results from multiple 
indices, among other things.
Whether that will actually improve performance in this case I don't know, but 
it's worth a try I think.

> The correlation between user_id and location... well, it's somewhat
> temporally related.

So users are constantly moving around but happen to be at the same locations at 
regular intervals?

In my experience, people don't usually move around much, so you should 
certainly be able to pinpoint them mostly to a specific area, right? (Hence my 
suggestions for a country column or partitioning in squares)


> On 19 April 2017 at 22:50, Alban Hertroys  wrote:
>> 
>>> On 19 Apr 2017, at 6:01, Samuel Williams  
>>> wrote:
>>> 
>>> Hi.
>>> 
>>> We have 400,000,000 records in a table (soon to be 800,000,000), here
>>> is the schema (\d+)
>>> 
>>> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121
>>> 
>>> We want the following kinds of query to be fast:
>>> 
>>> SELECT DISTINCT "user_event"."user_id" FROM "user_event" WHERE
>>> "user_event"."what" IN ('poll', 'location_change',
>>> 'midnight_location') AND ("user_event"."created_at" >= '2016-04-19
>>> 01:23:55') AND (latitude > -37.03079375089291 AND latitude <
>>> -36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
>>> 175.0805140220076);
>>> 
>>> We have a btree index and it appears to be working. However, it's
>>> still pretty slow.
>>> 
>>> EXPLAIN ANALYZE gives the following:
>>> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121#gistcomment-2065314
>>> 
>>> I'm thinking that I need to do the following to help:
>>> 
>>> CLUSTER user_event ON index_user_event_for_visits_3 followed by
>>> analyze... Our data is mostly time series but sometimes we get some
>>> dumps with historical records.
>> 
>> It seems to me that a large part of the problem is that the server has to 
>> scan all index entries from that date and within those location bounds to 
>> find that the distinct user id's in that set are about 114,000 out of 1.7M 
>> rows matching the selection-criteria. If it could stop at the first location 
>> for each user, it would have to scan less than a 10th of the index entries 
>> that it has to scan now...
>> 
>> How high is the correlation between user id and location? That can probably 
>> be leveraged…
>> Perhaps you'd get better performance if you'd use multiple indices instead 
>> of stuffing everything into a single purpose-specific one? I would suggest 
>> one on (user_id, latitude, longitude) or (latitude, longitude, user_id) and 
>> one on (created_at, user_id), or perhaps (created_at, latitude, longitude). 
>> That approach could also reduce the number of indices you have on that 
>> table, as well as their sizes, making it all fit into cache a little easier. 
>> Then again, additional operations will be required to combine them.
>> 
>> For a different approach; It may be possible to enrich your data with 
>> something that is easy to index and query, with a high correlation to 
>> (latitude, longitude). That could also be used to partition over. Country 
>> seems a good candidate here, unless all your data-points are in New Zealand 
>> like the above?
>> Then again, some countries are a lot larger, with a higher population, than 
>> others. And populations can be highly concentrated (Delhi, Moscow to name a 
>> few).
>> Another option is to divide the location space up into squares of a fixed 
>> size, with a partition for each square. About 80% of those squares are 
>> unpopulated though, being at sea.
>> 
>> Created_at is a very good candidate for partitioning too, especially if you 
>> don't intend to keep data older than a certain age. Truncating or dropping a 
>> partition that you no longer need is quick and easy.
>> 
>> With data-sets this large, I'd think you would want to partition on multiple 
>> dimensions, creating a matrix of partitions under a single master table. I 
>> don't think PG has a simple way of doing that (yet) though; perhaps it's 
>> possible by abusing multiple levels of inheritance, but that sounds like a 
>> bad idea.
>> 
>> And of course, create your partitions sufficiently course to prevent 
>> overburdening the system tables, which would slow down the query planner.
>> 
>> Hopefully there's something useful in my ramblings!
>> 
>> Alban Hertroys
>> --
>> If you can't see the forest for the trees,
>> cut the trees and you'll find there is no forest.
>> 
> 
> 
> -- 
> Sent 

Re: [GENERAL] Large data and slow queries

2017-04-19 Thread Samuel Williams
Ah right, yeah, it's insert only. So, it's never been vacuumed.

On 20 April 2017 at 01:25, Stephen Frost  wrote:
> Greetings,
>
> * Samuel Williams (space.ship.travel...@gmail.com) wrote:
>> Thanks for all the suggestions Stephen.
>>
>> > That explain analyze shows a whole ton of heap fetches.  When was the
>> last time a VACUUM was run on this table, to build the visibility map?
>> Without the visibility map being current, an Index-Only Scan, as is
>> happening here, can really suck.
>>
>> I'm using the default 9.6 config, I thought that auto-vacuum was on by 
>> default?
>
> Sure, but that doesn't mean the table gets vacuumed.  In particular,
> actual vacuums are only kicked off when the number of *updated* or
> *deleted* tuples passes the autovacuum threshold.  If no updates or
> deletes are done on the table (iow, it's essentially an insert-only or
> insert-mostly table), the autovacuum daemon nevers runs a vacuum on it
> (though it'll run analyze's).
>
> https://www.postgresql.org/docs/current/static/runtime-config-autovacuum.html#GUC-AUTOVACUUM-VACUUM-THRESHOLD
>
> Check pg_stat_all_tables to see when/if the table has actually been
> vacuumed.
>
> Thanks!
>
> Stephen


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] tuple statistics update

2017-04-19 Thread Tom Lane
Tom DalPozzo  writes:
> 2017-04-18 22:06 GMT+02:00 Tom Lane :
>> If your session is just sitting, that's not surprising.  I think stats
>> updates are only transmitted to the collector at transaction end (and
>> even then, only if it's been at least N msec since the last transmission
>> from the current session).

> Hi, my transaction ended, that I waited 1 min and queried the stat.
> I had to wait for session end in order to see the updeted stats.
> Did you mean "...to the collector at SESSION end"?

No, I meant what I said.  No amount of just-sitting will cause a session
to do something: it either transmits stats at transaction end or it
doesn't because it thinks it's too soon since its last update.  If you
want to force out the stats, you could wait a second or so and then
do a dummy transaction (even "select 1" will do, in testing).

BTW, a quick look at the code says the threshold is 500 msec, ie,
stats are sent only if it's been at least half a second since the
session's last report.

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Unable to upload backups

2017-04-19 Thread Adrian Klaver

On 04/19/2017 07:16 AM, Ron Ben wrote:

Here :)


Thanks.

See my previous response. Basically we need more information before this 
can be solved.



I think I may have found the problem.

The role defined as:

CREATE ROLE "ronb" LOGIN
  NOSUPERUSER INHERIT NOCREATEDB NOCREATEROLE NOREPLICATION;
GRANT users TO "ronb";
GRANT users2 TO "ronb";

users is a group role:

CREATE ROLE users
  SUPERUSER INHERIT CREATEDB CREATEROLE NOREPLICATION;

 users2 is a group role:
CREATE ROLE users2
  NOSUPERUSER INHERIT NOCREATEDB NOCREATEROLE NOREPLICATION;
GRANT reports TO users2 ;


I think PostgreSQL doesn't know how to handle this conflicted commands.
What PostgreSQL does when such conflic appears? does it take the last
known command of grant?

Sadly, when there are more than one role it's impossible to know which
role was first. PostgreSQL shows them alphabeticly rather than by date
so in case of overlaping instructions its impossible to know which one
was first.


ב אפר׳ 19, 2017 17:01, Adrian Klaver כתב:

On 04/19/2017 06:49 AM, Ron Ben wrote:

Is it possible to get your email program to left justify text on
sending? I can figure out the right justified text, it just
takes me longer.

> I think I may have found the problem.
>
> The role defined as:
>
> CREATE ROLE "ronb" LOGIN
> NOSUPERUSER INHERIT NOCREATEDB NOCREATEROLE NOREPLICATION;
> GRANT users TO "ronb";
> GRANT users2 TO "ronb";
>
> users is a group role:
>
> CREATE ROLE users
> SUPERUSER INHERIT CREATEDB CREATEROLE NOREPLICATION;
>
> users2 is a group role:
> CREATE ROLE users2
> NOSUPERUSER INHERIT NOCREATEDB NOCREATEROLE NOREPLICATION;
> GRANT reports TO users2 ;

That may or may not be the problem. See:

https://www.postgresql.org/docs/9.6/static/sql-createrole.html

"The INHERIT attribute governs inheritance of grantable
privileges (that
is, access privileges for database objects and role
memberships). It
does not apply to the special role attributes set by CREATE ROLE
and
ALTER ROLE. For example, being a member of a role with CREATEDB
privilege does not immediately grant the ability to create
databases,
even if INHERIT is set; it would be necessary to become that
role via
SET ROLE before creating a database."


What you show above is part of the answer. The other parts are the
actual privileges on the objects. Also the command that created
the dump
file that you are trying to restore. Permissions/privileges
issues can
be complex and solving them requires a complete set of information.

>
>
> I think PostgreSQL doesn't know how to handle this conflicted
commands.
> What PostgreSQL does when such conflic appears? does it take
the last
> known command of grant?
>
> Sadly, when there are more than one role it's impossible to
know which
> role was first. PostgreSQL shows them alphabeticly rather than
by date
> so in case of overlaping instructions its impossible to know
which one
> was first.
>
>
> ב אפר׳ 19, 2017 16:40, Adrian Klaver כתב:
>
> On 04/19/2017 03:56 AM, Ron Ben wrote:
> > Hi,
> > I'm using PostgreSQL 9.3.2
> > I'm running the command:
> >
> >
> > psql -h testserver -U ronb -f backup.sql -q -d foldertest
> 2>error.txt
> >>output.txt
>
> What was the command that created backup.sql?
>
> >
> > This should generate my database in foldertest
> >
> > However this doesn't work. It's unable to create schemas
> >
> > in the error.txt i see "permission denied for database
> foldertest".
>
> What user is the foldertest owner?
>
> In psql l will tell you this.
>
> >
> > I know this is not an access permission issue because there is
> a public
> > schema which is buildin and it does create the tables/data in
> there.
>
> Because the public schema is by default open to all:
>
> https://www.postgresql.org/docs/9.6/static/ddl-schemas.html
>
> "A user can also be allowed to create objects in someone else's
> schema.
> To allow that, the CREATE privilege on the schema needs to be
> granted.
> Note that by default, everyone has CREATE and USAGE privileges
> on the
> schema public. This allows all users that are able to connect to
> a given
> database to create objects in its public schema. ... "
>
>
> >
> > 

[GENERAL] Recover corrupted data

2017-04-19 Thread Alexandre
Hello,

The computer had a unexpected shutdown, it is a Windows machine.
Now some data appears to be corrupted, I am receiving exceptions like this:

ERROR: could not read block 0 in file "base/16393/16485": read only
0 of 8192 bytes

There is some way to correct this?


Re: [GENERAL] Unable to upload backups

2017-04-19 Thread Ron Ben


Here :)
I think I may have found the problem.
 
The role defined as:
 
CREATE ROLE "ronb" LOGIN  NOSUPERUSER INHERIT NOCREATEDB NOCREATEROLE NOREPLICATION;GRANT users TO "ronb";
GRANT users2 TO "ronb";
 
users is a group role:
 
CREATE ROLE users   SUPERUSER INHERIT CREATEDB CREATEROLE NOREPLICATION;
 
 users2 is a group role:
CREATE ROLE users2   NOSUPERUSER INHERIT NOCREATEDB NOCREATEROLE NOREPLICATION;
GRANT reports TO users2 ;
 
 
I think PostgreSQL doesn't know how to handle this conflicted commands.
What PostgreSQL does when such conflic appears? does it take the last known command of grant?

Sadly, when there are more than one role it's impossible to know which role was first. PostgreSQL shows them alphabeticly rather than by date so in case of overlaping instructions its impossible to know which one was first.ב אפר׳ 19, 2017 17:01, Adrian Klaver כתב:On 04/19/2017 06:49 AM, Ron Ben wrote:Is it possible to get your email program to left justify text on sending? I can figure out the right justified text, it just takes me longer.> I think I may have found the problem.>> The role defined as:>> CREATE ROLE "ronb" LOGIN>   NOSUPERUSER INHERIT NOCREATEDB NOCREATEROLE NOREPLICATION;> GRANT users TO "ronb";> GRANT users2 TO "ronb";>> users is a group role:>> CREATE ROLE users>   SUPERUSER INHERIT CREATEDB CREATEROLE NOREPLICATION;>>  users2 is a group role:> CREATE ROLE users2>   NOSUPERUSER INHERIT NOCREATEDB NOCREATEROLE NOREPLICATION;> GRANT reports TO users2 ;That may or may not be the problem. See:https://www.postgresql.org/docs/9.6/static/sql-createrole.html"The INHERIT attribute governs inheritance of grantable privileges (that is, access privileges for database objects and role memberships). It does not apply to the special role attributes set by CREATE ROLE and ALTER ROLE. For example, being a member of a role with CREATEDB privilege does not immediately grant the ability to create databases, even if INHERIT is set; it would be necessary to become that role via SET ROLE before creating a database."What you show above is part of the answer. The other parts are the actual privileges on the objects. Also the command that created the dump file that you are trying to restore. Permissions/privileges issues can be complex and solving them requires a complete set of information.>>> I think PostgreSQL doesn't know how to handle this conflicted commands.> What PostgreSQL does when such conflic appears? does it take the last> known command of grant?>> Sadly, when there are more than one role it's impossible to know which> role was first. PostgreSQL shows them alphabeticly rather than by date> so in case of overlaping instructions its impossible to know which one> was first.>>> ב אפר׳ 19, 2017 16:40, Adrian Klaver כתב:>> On 04/19/2017 03:56 AM, Ron Ben wrote:> > Hi,> > I'm using PostgreSQL 9.3.2> > I'm running the command:> >> >> > psql -h testserver -U ronb -f backup.sql -q -d foldertest> 2>error.txt> >>output.txt>> What was the command that created backup.sql?>> >> > This should generate my database in foldertest> >> > However this doesn't work. It's unable to create schemas> >> > in the error.txt i see "permission denied for database> foldertest".>> What user is the foldertest owner?>> In psql l will tell you this.>> >> > I know this is not an access permission issue because there is> a public> > schema which is buildin and it does create the tables/data in> there.>> Because the public schema is by default open to all:>> https://www.postgresql.org/docs/9.6/static/ddl-schemas.html>> "A user can also be allowed to create objects in someone else's> schema.> To allow that, the CREATE privilege on the schema needs to be> granted.> Note that by default, everyone has CREATE and USAGE privileges> on the> schema public. This allows all users that are able to connect to> a given> database to create objects in its public schema. ... ">>> >> > It just cant create new schemas.>> In psql do dn+, that will show schema owners and who else has> privileges.>> For what the different privileges are and how they are> represented in> the above output see:>> https://www.postgresql.org/docs/9.6/static/sql-grant.html>> >> >> >> > The intresting thing is that if I do:> >> > psql -h testserver -U postgres -f backup.sql -q -d foldertest> > 2>error.txt >output.txt> >> >> >> > Everything works. It create all schemas and generate the> database correctly.>> Because the postgres user is a superuser and can do anything.>> >> > I don't see any diffrent in the hba.conf between 

Re: [GENERAL] Unable to upload backups

2017-04-19 Thread Adrian Klaver

On 04/19/2017 06:49 AM, Ron Ben wrote:

Is it possible to get your email program to left justify text on 
sending? I can figure out the right justified text, it just takes me longer.



I think I may have found the problem.

The role defined as:

CREATE ROLE "ronb" LOGIN
  NOSUPERUSER INHERIT NOCREATEDB NOCREATEROLE NOREPLICATION;
GRANT users TO "ronb";
GRANT users2 TO "ronb";

users is a group role:

CREATE ROLE users
  SUPERUSER INHERIT CREATEDB CREATEROLE NOREPLICATION;

 users2 is a group role:
CREATE ROLE users2
  NOSUPERUSER INHERIT NOCREATEDB NOCREATEROLE NOREPLICATION;
GRANT reports TO users2 ;


That may or may not be the problem. See:

https://www.postgresql.org/docs/9.6/static/sql-createrole.html

"The INHERIT attribute governs inheritance of grantable privileges (that 
is, access privileges for database objects and role memberships). It 
does not apply to the special role attributes set by CREATE ROLE and 
ALTER ROLE. For example, being a member of a role with CREATEDB 
privilege does not immediately grant the ability to create databases, 
even if INHERIT is set; it would be necessary to become that role via 
SET ROLE before creating a database."



What you show above is part of the answer. The other parts are the 
actual privileges on the objects. Also the command that created the dump 
file that you are trying to restore. Permissions/privileges issues can 
be complex and solving them requires a complete set of information.





I think PostgreSQL doesn't know how to handle this conflicted commands.
What PostgreSQL does when such conflic appears? does it take the last
known command of grant?

Sadly, when there are more than one role it's impossible to know which
role was first. PostgreSQL shows them alphabeticly rather than by date
so in case of overlaping instructions its impossible to know which one
was first.


ב אפר׳ 19, 2017 16:40, Adrian Klaver כתב:

On 04/19/2017 03:56 AM, Ron Ben wrote:
> Hi,
> I'm using PostgreSQL 9.3.2
> I'm running the command:
>
>
> psql -h testserver -U ronb -f backup.sql -q -d foldertest
2>error.txt
>>output.txt

What was the command that created backup.sql?

>
> This should generate my database in foldertest
>
> However this doesn't work. It's unable to create schemas
>
> in the error.txt i see "permission denied for database
foldertest".

What user is the foldertest owner?

In psql l will tell you this.

>
> I know this is not an access permission issue because there is
a public
> schema which is buildin and it does create the tables/data in
there.

Because the public schema is by default open to all:

https://www.postgresql.org/docs/9.6/static/ddl-schemas.html

"A user can also be allowed to create objects in someone else's
schema.
To allow that, the CREATE privilege on the schema needs to be
granted.
Note that by default, everyone has CREATE and USAGE privileges
on the
schema public. This allows all users that are able to connect to
a given
database to create objects in its public schema. ... "


>
> It just cant create new schemas.

In psql do dn+, that will show schema owners and who else has
privileges.

For what the different privileges are and how they are
represented in
the above output see:

https://www.postgresql.org/docs/9.6/static/sql-grant.html

>
>
>
> The intresting thing is that if I do:
>
> psql -h testserver -U postgres -f backup.sql -q -d foldertest
> 2>error.txt >output.txt
>
>
>
> Everything works. It create all schemas and generate the
database correctly.

Because the postgres user is a superuser and can do anything.

>
> I don't see any diffrent in the hba.conf between postgres and
ronb users.

That is not the issue. pg_hba determines who can connect, what
you are
seeing is the Postgres privilege system determining what a user
can do
once they are connected. If it had been a pg_hba rejection you
would
have seen something like:

aklaver@tito:~> psql -d production -U guest -h localhost
psql: FATAL: no pg_hba.conf entry for host "::1", user "guest",
database "production", SSL on
FATAL: no pg_hba.conf entry for host "::1", user "guest", database
"production", SSL off


To get an overview of what users there are in your database
cluster in
psql do du


>
> What can be the problem?
>


--
Adrian Klaver
adrian.kla...@aklaver.com


--
Sent via pgsql-general mailing list 

Re: [GENERAL] tuple statistics update

2017-04-19 Thread Adrian Klaver

On 04/19/2017 12:28 AM, Tom DalPozzo wrote:

2017-04-18 21:42 GMT+02:00 Adrian Klaver >:

On 04/17/2017 09:18 AM, Tom DalPozzo wrote:

Hi, I'm using libpq to insert tuples in my table and keep looking at
statistics through psql instead.
I noticed that sometimes n_tuple_ins is not updated even after 1 min
that my transaction committed.
My libpq connection is kept alive. If I close the connection
then the
stats get updated.
I know that stats are not instantaneous, but I thought that after a
while that a transaction is committed it would be updated.


Any of this apply?:

https://www.postgresql.org/docs/9.6/static/monitoring-stats.html


"Another important point is that when a server process is asked to
display any of these statistics, it first fetches the most recent
report emitted by the collector process and then continues to use
this snapshot for all statistical views and functions until the end
of its current transaction. So the statistics will show static
information as long as you continue the current transaction.
Similarly, information about the current queries of all sessions is
collected when any such information is first requested within a
transaction, and the same information will be displayed throughout
the transaction. This is a feature, not a bug, because it allows you
to perform several queries on the statistics and correlate the
results without worrying that the numbers are changing underneath
you. But if you want to see new results with each query, be sure to
do the queries outside any transaction block. Alternatively, you can
invoke pg_stat_clear_snapshot(), which will discard the current
transaction's statistics snapshot (if any). The next use of
statistical information will cause a new snapshot to be fetched."

Regards
Pupillo


I read it, it seems to say that after N millisec that my transaction
ends,  stat should be current. I also tried pg_stat_clear_snapshot()
with no success.


You have two sessions in play, one that is inserting rows, the other in 
psql looking at the stats. It is not clear to me which session you are 
referring to in the above. So maybe an outline of what you are doing. 
Something like:


Session 1 Monitor stats table(?) using command(?)

Session2 Insert rows. The INSERT query

etc

Regards
Pupillo







--
Adrian Klaver
adrian.kla...@aklaver.com 





--
Adrian Klaver
adrian.kla...@aklaver.com


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Unable to upload backups

2017-04-19 Thread Ron Ben


I think I may have found the problem.
 
The role defined as:
 
CREATE ROLE "ronb" LOGIN  NOSUPERUSER INHERIT NOCREATEDB NOCREATEROLE NOREPLICATION;GRANT users TO "ronb";
GRANT users2 TO "ronb";
 
users is a group role:
 
CREATE ROLE users   SUPERUSER INHERIT CREATEDB CREATEROLE NOREPLICATION;
 
 users2 is a group role:
CREATE ROLE users2   NOSUPERUSER INHERIT NOCREATEDB NOCREATEROLE NOREPLICATION;
GRANT reports TO users2 ;
 
 
I think PostgreSQL doesn't know how to handle this conflicted commands.
What PostgreSQL does when such conflic appears? does it take the last known command of grant?

Sadly, when there are more than one role it's impossible to know which role was first. PostgreSQL shows them alphabeticly rather than by date so in case of overlaping instructions its impossible to know which one was first.ב אפר׳ 19, 2017 16:40, Adrian Klaver כתב:On 04/19/2017 03:56 AM, Ron Ben wrote:> Hi,> I'm using PostgreSQL 9.3.2> I'm running the command:>>> psql -h testserver -U ronb -f backup.sql  -q -d foldertest  2>error.txt>>output.txtWhat was the command that created backup.sql?>> This should generate my database in foldertest>> However this doesn't work. It's unable to create schemas>> in the error.txt i see "permission denied for database foldertest".What user is the foldertest owner?In psql l will tell you this.>> I know this is not an access permission issue because there is a public> schema which is buildin and it does create the tables/data in there.Because the public schema is by default open to all:https://www.postgresql.org/docs/9.6/static/ddl-schemas.html"A user can also be allowed to create objects in someone else's schema. To allow that, the CREATE privilege on the schema needs to be granted. Note that by default, everyone has CREATE and USAGE privileges on the schema public. This allows all users that are able to connect to a given database to create objects in its public schema.  ... ">> It just cant create new schemas.In psql do dn+, that will show schema owners and who else has privileges.For what the different privileges are and how they are represented in the above output see:https://www.postgresql.org/docs/9.6/static/sql-grant.html The intresting thing is that if I do:>> psql -h testserver -U postgres -f backup.sql  -q -d foldertest> 2>error.txt >output.txt Everything works. It create all schemas and generate the database correctly.Because the postgres user is a superuser and can do anything.>> I don't see any diffrent in the hba.conf between postgres and ronb users.That is not the issue. pg_hba determines who can connect, what you are seeing is the Postgres privilege system determining what a user can do once they are connected. If it had been a pg_hba rejection you would have seen something like:aklaver@tito:~> psql -d production -U guest -h localhostpsql: FATAL:  no pg_hba.conf entry for host "::1", user "guest", database "production", SSL onFATAL:  no pg_hba.conf entry for host "::1", user "guest", database "production", SSL offTo get an overview of what users there are in your database cluster in psql do du>> What can be the problem?>-- Adrian Klaveradrian.kla...@aklaver.com-- Sent via pgsql-general mailing list (pgsql-general@postgresql.org)To make changes to your subscription:http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Unable to upload backups

2017-04-19 Thread Adrian Klaver

On 04/19/2017 03:56 AM, Ron Ben wrote:

Hi,
I'm using PostgreSQL 9.3.2
I'm running the command:


psql -h testserver -U ronb -f backup.sql  -q -d foldertest  2>error.txt

output.txt


What was the command that created backup.sql?



This should generate my database in foldertest

However this doesn't work. It's unable to create schemas

in the error.txt i see "permission denied for database foldertest".


What user is the foldertest owner?

In psql \l will tell you this.



I know this is not an access permission issue because there is a public
schema which is buildin and it does create the tables/data in there.


Because the public schema is by default open to all:

https://www.postgresql.org/docs/9.6/static/ddl-schemas.html

"A user can also be allowed to create objects in someone else's schema. 
To allow that, the CREATE privilege on the schema needs to be granted. 
Note that by default, everyone has CREATE and USAGE privileges on the 
schema public. This allows all users that are able to connect to a given 
database to create objects in its public schema.  ... "





It just cant create new schemas.


In psql do \dn+, that will show schema owners and who else has privileges.

For what the different privileges are and how they are represented in 
the above output see:


https://www.postgresql.org/docs/9.6/static/sql-grant.html





The intresting thing is that if I do:

psql -h testserver -U postgres -f backup.sql  -q -d foldertest
2>error.txt >output.txt



Everything works. It create all schemas and generate the database correctly.


Because the postgres user is a superuser and can do anything.



I don't see any diffrent in the hba.conf between postgres and ronb users.


That is not the issue. pg_hba determines who can connect, what you are 
seeing is the Postgres privilege system determining what a user can do 
once they are connected. If it had been a pg_hba rejection you would 
have seen something like:


aklaver@tito:~> psql -d production -U guest -h localhost
psql: FATAL:  no pg_hba.conf entry for host "::1", user "guest", 
database "production", SSL on
FATAL:  no pg_hba.conf entry for host "::1", user "guest", database 
"production", SSL off



To get an overview of what users there are in your database cluster in 
psql do \du





What can be the problem?




--
Adrian Klaver
adrian.kla...@aklaver.com


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Large data and slow queries

2017-04-19 Thread Stephen Frost
Greetings,

* Samuel Williams (space.ship.travel...@gmail.com) wrote:
> Thanks for all the suggestions Stephen.
> 
> > That explain analyze shows a whole ton of heap fetches.  When was the
> last time a VACUUM was run on this table, to build the visibility map?
> Without the visibility map being current, an Index-Only Scan, as is
> happening here, can really suck.
> 
> I'm using the default 9.6 config, I thought that auto-vacuum was on by 
> default?

Sure, but that doesn't mean the table gets vacuumed.  In particular,
actual vacuums are only kicked off when the number of *updated* or
*deleted* tuples passes the autovacuum threshold.  If no updates or
deletes are done on the table (iow, it's essentially an insert-only or
insert-mostly table), the autovacuum daemon nevers runs a vacuum on it
(though it'll run analyze's).

https://www.postgresql.org/docs/current/static/runtime-config-autovacuum.html#GUC-AUTOVACUUM-VACUUM-THRESHOLD

Check pg_stat_all_tables to see when/if the table has actually been
vacuumed.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [GENERAL] Large data and slow queries

2017-04-19 Thread Samuel Williams
Thanks for all the suggestions Stephen.

> That explain analyze shows a whole ton of heap fetches.  When was the
last time a VACUUM was run on this table, to build the visibility map?
Without the visibility map being current, an Index-Only Scan, as is
happening here, can really suck.

I'm using the default 9.6 config, I thought that auto-vacuum was on by default?



On 20 April 2017 at 00:48, Stephen Frost  wrote:
> Greetings,
>
> * Samuel Williams (space.ship.travel...@gmail.com) wrote:
>> We want the following kinds of query to be fast:
>
> "kinds of query" isn't helpful, you should be reviewing exactly the
> queries you care about because statistics and your exact data set and
> what the exact query you're running is will all have an impact.
>
>> SELECT DISTINCT "user_event"."user_id" FROM "user_event" WHERE
>> "user_event"."what" IN ('poll', 'location_change',
>> 'midnight_location') AND ("user_event"."created_at" >= '2016-04-19
>> 01:23:55') AND (latitude > -37.03079375089291 AND latitude <
>> -36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
>> 175.0805140220076);
>
> Ugh.  You should really look at and consider PostGIS whenever you're
> working with geospatial data.
>
>> We have a btree index and it appears to be working. However, it's
>> still pretty slow.
>
> You're finding some 17M records and then reducing that with DISTINCT to
> only 114k.  Hard to say if it's faster to just brute-force your way
> through that with a HashAgg (as your explain analyze shows), or if a
> loose index scan would work better (check out how to do one in PG here:
> https://wiki.postgresql.org/wiki/Loose_indexscan).
>
>> EXPLAIN ANALYZE gives the following:
>> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121#gistcomment-2065314
>
> That explain analyze shows a whole ton of heap fetches.  When was the
> last time a VACUUM was run on this table, to build the visibility map?
> Without the visibility map being current, an Index-Only Scan, as is
> happening here, can really suck.
>
>> CLUSTER user_event ON index_user_event_for_visits_3 followed by
>> analyze... Our data is mostly time series but sometimes we get some
>> dumps with historical records.
>>
>> Perhaps add a BRIN index on created_at
>
> BRIN would be helpful if you wanted a smaller index.  That doesn't seem
> to be the case here.
>
>> I'm wondering if... we can use an index to cache, all user_ids seen on
>> a given day. If we change our index to be more discrete, e.g.
>> created_at::date, would this help? The set union of user_ids for 365
>> days should be pretty fast?
>
> Materialized views are often useful, particularly when the results are
> (relatively) slow moving.
>
>> I'm open to any ideas or suggestions, ideally we can keep
>> optimisations within the database, rather than adding a layer of
>> caching on top.
>
> If you need to go the materialized view route, I'd definitely recommend
> doing that in the database rather than outside it.
>
> Thanks!
>
> Stephen


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Large data and slow queries

2017-04-19 Thread Stephen Frost
Greetings,

* Samuel Williams (space.ship.travel...@gmail.com) wrote:
> We want the following kinds of query to be fast:

"kinds of query" isn't helpful, you should be reviewing exactly the
queries you care about because statistics and your exact data set and
what the exact query you're running is will all have an impact.

> SELECT DISTINCT "user_event"."user_id" FROM "user_event" WHERE
> "user_event"."what" IN ('poll', 'location_change',
> 'midnight_location') AND ("user_event"."created_at" >= '2016-04-19
> 01:23:55') AND (latitude > -37.03079375089291 AND latitude <
> -36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
> 175.0805140220076);

Ugh.  You should really look at and consider PostGIS whenever you're
working with geospatial data.

> We have a btree index and it appears to be working. However, it's
> still pretty slow.

You're finding some 17M records and then reducing that with DISTINCT to
only 114k.  Hard to say if it's faster to just brute-force your way
through that with a HashAgg (as your explain analyze shows), or if a
loose index scan would work better (check out how to do one in PG here:
https://wiki.postgresql.org/wiki/Loose_indexscan).

> EXPLAIN ANALYZE gives the following:
> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121#gistcomment-2065314

That explain analyze shows a whole ton of heap fetches.  When was the
last time a VACUUM was run on this table, to build the visibility map?
Without the visibility map being current, an Index-Only Scan, as is
happening here, can really suck.

> CLUSTER user_event ON index_user_event_for_visits_3 followed by
> analyze... Our data is mostly time series but sometimes we get some
> dumps with historical records.
> 
> Perhaps add a BRIN index on created_at

BRIN would be helpful if you wanted a smaller index.  That doesn't seem
to be the case here.

> I'm wondering if... we can use an index to cache, all user_ids seen on
> a given day. If we change our index to be more discrete, e.g.
> created_at::date, would this help? The set union of user_ids for 365
> days should be pretty fast?

Materialized views are often useful, particularly when the results are
(relatively) slow moving.

> I'm open to any ideas or suggestions, ideally we can keep
> optimisations within the database, rather than adding a layer of
> caching on top.

If you need to go the materialized view route, I'd definitely recommend
doing that in the database rather than outside it.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [GENERAL] potential extension of psql's \df+ ?

2017-04-19 Thread Pavel Stehule
2017-04-19 12:14 GMT+02:00 Karsten Hilbert :

> On Wed, Apr 19, 2017 at 12:00:04PM +0200, Pavel Stehule wrote:
>
> > > Hence I wonder whether an approach along these lines:
> > >
> > > select
> > > row_number() over ()
> > > || src_line
> ...
> > > ) as func_src;
> > >
> > > would be a worthwhile change to the query "\df+" uses ?
> > >
> > >
> > do you know \sf+ fce command?
>
> Sorry for the noise :-)
>

no problem :)

Pavel


>
> Thanks,
> Karsten
> --
> GPG key ID E4071346 @ eu.pool.sks-keyservers.net
> E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] Large data and slow queries

2017-04-19 Thread Martijn Tonies (Upscene Productions)

Hello Vinny, Samuel,



Perhaps I'm missing something, but I'd be interested in the reasoning
behind this.

For column 'what', it seems you have no index on all values, only
indices with specific values for 'what'.

How does this speed up the search? Will PostgreSQL use those indices,
instead of using a generic index on 'what' and optionally other
columns?




That's a "partial index", it only contains records that meet the 
requirements of the index definition.


https://www.postgresql.org/docs/9.5/static/indexes-partial.html

Basically; if you create an index on records where 'name = kees' then if 
your query contains "where name=kees"
the planner can just load that index and know that the records in that 
index will not contain

any other names, saving the need to filter for 'name=kees'


Martijn that is a good question. It's because we are only concerned
with a subset of events for this index and this particular query. The
query planner can recognise this and use the index correctly. By doing
this, we reduce the size of the index significantly. In the best case,
where we only wanted a few things, the index was reduced from 21GB to
8MB.



Thank for the answers, I've seen such indices before, but used, for example,
on an UPPER(...) of all values, never thought of using them to filter out
specific values in order to make the index smaller.


With regards,

Martijn Tonies
Upscene Productions
http://www.upscene.com

Database Workbench - developer tool for Oracle, MS SQL Server, PostgreSQL,
SQL Anywhere, MySQL, InterBase, NexusDB and Firebird. 




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Large data and slow queries

2017-04-19 Thread vinny

On 2017-04-19 13:25, Martijn Tonies (Upscene Productions) wrote:

Samuel, others,

Perhaps I'm missing something, but I'd be interested in the reasoning
behind this.

For column 'what', it seems you have no index on all values, only
indices with specific values for 'what'.

How does this speed up the search? Will PostgreSQL use those indices,
instead of using a generic index on 'what' and optionally other
columns?


With regards,

Martijn Tonies
Upscene Productions
http://www.upscene.com



That's a "partial index", it only contains records that meet the 
requirements of the index definition.


https://www.postgresql.org/docs/9.5/static/indexes-partial.html

Basically; if you create an index on records where 'name = kees' then if 
your query contains "where name=kees"
the planner can just load that index and know that the records in that 
index will not contain

any other names, saving the need to filter for 'name=kees'


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Large data and slow queries

2017-04-19 Thread Samuel Williams
Martijn that is a good question. It's because we are only concerned
with a subset of events for this index and this particular query. The
query planner can recognise this and use the index correctly. By doing
this, we reduce the size of the index significantly. In the best case,
where we only wanted a few things, the index was reduced from 21GB to
8MB.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Large data and slow queries

2017-04-19 Thread Martijn Tonies (Upscene Productions)

Samuel, others,

Perhaps I'm missing something, but I'd be interested in the reasoning behind 
this.


For column 'what', it seems you have no index on all values, only indices 
with specific values for 'what'.


How does this speed up the search? Will PostgreSQL use those indices, 
instead of using a generic index on 'what' and optionally other columns?



With regards,

Martijn Tonies
Upscene Productions
http://www.upscene.com


-Original Message- 
From: Samuel Williams

Sent: Wednesday, April 19, 2017 6:01 AM
To: pgsql-general
Subject: [GENERAL] Large data and slow queries

Hi.

We have 400,000,000 records in a table (soon to be 800,000,000), here
is the schema (\d+)

https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121

We want the following kinds of query to be fast:

SELECT DISTINCT "user_event"."user_id" FROM "user_event" WHERE
"user_event"."what" IN ('poll', 'location_change',
'midnight_location') AND ("user_event"."created_at" >= '2016-04-19
01:23:55') AND (latitude > -37.03079375089291 AND latitude <
-36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
175.0805140220076);

We have a btree index and it appears to be working. However, it's
still pretty slow.

EXPLAIN ANALYZE gives the following:
https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121#gistcomment-2065314

I'm thinking that I need to do the following to help:

CLUSTER user_event ON index_user_event_for_visits_3 followed by
analyze... Our data is mostly time series but sometimes we get some
dumps with historical records.

Perhaps add a BRIN index on created_at

I'm wondering if... we can use an index to cache, all user_ids seen on
a given day. If we change our index to be more discrete, e.g.
created_at::date, would this help? The set union of user_ids for 365
days should be pretty fast?

I'm open to any ideas or suggestions, ideally we can keep
optimisations within the database, rather than adding a layer of
caching on top.

Kind regards,
Samuel


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general 




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Large data and slow queries

2017-04-19 Thread vinny

On 2017-04-19 09:48, John R Pierce wrote:

On 4/19/2017 12:31 AM, vinny wrote:
Given the number of records, my first thought was either partitioning 
or partial-indexes.

The fewer rows are in the index, the quicker it will be to check,
and it's not a lot of work to create separate indexes for lat/long 
ranges or dates.


that only works if the planner can figure out which partitions to use
in advance, otherwise it ends up having to scan all the partitions.


--
john r pierce, recycling bits in santa cruz



True, but then again, the proposed queries are quite straight-forward
so I don't expect that to be a problem, really.

Worth a test, if only to see if it helps.


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Large data and slow queries

2017-04-19 Thread Samuel Williams
Thanks Alban, I appreciate your ideas and thoughts.

I'm a little reluctant to go down the partitioning route as I think
we'll probably end up sharding out horizontally using citus data in
the near future and naive postgres partitioning may hamper that
effort.

It's interesting you talk about using multiple indexes. In
MySQL/MariaDB and derivatives, I've never seen the query planner
consider using multiple indexes. So, it's possible that Postgres may
use multiple indexes if it saves time? Or do you mean, doing something
like manually joining the data and leveraging the different indexes
explicitly?

The correlation between user_id and location... well, it's somewhat
temporally related.



On 19 April 2017 at 22:50, Alban Hertroys  wrote:
>
>> On 19 Apr 2017, at 6:01, Samuel Williams  
>> wrote:
>>
>> Hi.
>>
>> We have 400,000,000 records in a table (soon to be 800,000,000), here
>> is the schema (\d+)
>>
>> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121
>>
>> We want the following kinds of query to be fast:
>>
>> SELECT DISTINCT "user_event"."user_id" FROM "user_event" WHERE
>> "user_event"."what" IN ('poll', 'location_change',
>> 'midnight_location') AND ("user_event"."created_at" >= '2016-04-19
>> 01:23:55') AND (latitude > -37.03079375089291 AND latitude <
>> -36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
>> 175.0805140220076);
>>
>> We have a btree index and it appears to be working. However, it's
>> still pretty slow.
>>
>> EXPLAIN ANALYZE gives the following:
>> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121#gistcomment-2065314
>>
>> I'm thinking that I need to do the following to help:
>>
>> CLUSTER user_event ON index_user_event_for_visits_3 followed by
>> analyze... Our data is mostly time series but sometimes we get some
>> dumps with historical records.
>
> It seems to me that a large part of the problem is that the server has to 
> scan all index entries from that date and within those location bounds to 
> find that the distinct user id's in that set are about 114,000 out of 1.7M 
> rows matching the selection-criteria. If it could stop at the first location 
> for each user, it would have to scan less than a 10th of the index entries 
> that it has to scan now...
>
> How high is the correlation between user id and location? That can probably 
> be leveraged…
> Perhaps you'd get better performance if you'd use multiple indices instead of 
> stuffing everything into a single purpose-specific one? I would suggest one 
> on (user_id, latitude, longitude) or (latitude, longitude, user_id) and one 
> on (created_at, user_id), or perhaps (created_at, latitude, longitude). That 
> approach could also reduce the number of indices you have on that table, as 
> well as their sizes, making it all fit into cache a little easier. Then 
> again, additional operations will be required to combine them.
>
> For a different approach; It may be possible to enrich your data with 
> something that is easy to index and query, with a high correlation to 
> (latitude, longitude). That could also be used to partition over. Country 
> seems a good candidate here, unless all your data-points are in New Zealand 
> like the above?
> Then again, some countries are a lot larger, with a higher population, than 
> others. And populations can be highly concentrated (Delhi, Moscow to name a 
> few).
> Another option is to divide the location space up into squares of a fixed 
> size, with a partition for each square. About 80% of those squares are 
> unpopulated though, being at sea.
>
> Created_at is a very good candidate for partitioning too, especially if you 
> don't intend to keep data older than a certain age. Truncating or dropping a 
> partition that you no longer need is quick and easy.
>
> With data-sets this large, I'd think you would want to partition on multiple 
> dimensions, creating a matrix of partitions under a single master table. I 
> don't think PG has a simple way of doing that (yet) though; perhaps it's 
> possible by abusing multiple levels of inheritance, but that sounds like a 
> bad idea.
>
> And of course, create your partitions sufficiently course to prevent 
> overburdening the system tables, which would slow down the query planner.
>
> Hopefully there's something useful in my ramblings!
>
> Alban Hertroys
> --
> If you can't see the forest for the trees,
> cut the trees and you'll find there is no forest.
>


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Unable to upload backups

2017-04-19 Thread Ron Ben
Hi,
I'm using PostgreSQL 9.3.2
I'm running the command:
 

psql -h testserver -U ronb -f backup.sql  -q -d foldertest  2>error.txt >output.txt
This should generate my database in foldertest
However this doesn't work. It's unable to create schemas
in the error.txt i see "permission denied for database foldertest".
I know this is not an access permission issue because there is a public schema which is buildin and it does create the tables/data in there.
It just cant create new schemas.
 
The intresting thing is that if I do:
psql -h testserver -U postgres -f backup.sql  -q -d foldertest  2>error.txt >output.txt
 
Everything works. It create all schemas and generate the database correctly.
I don't see any diffrent in the hba.conf between postgres and ronb users.
What can be the problem?


Re: [GENERAL] Large data and slow queries

2017-04-19 Thread Alban Hertroys

> On 19 Apr 2017, at 6:01, Samuel Williams  
> wrote:
> 
> Hi.
> 
> We have 400,000,000 records in a table (soon to be 800,000,000), here
> is the schema (\d+)
> 
> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121
> 
> We want the following kinds of query to be fast:
> 
> SELECT DISTINCT "user_event"."user_id" FROM "user_event" WHERE
> "user_event"."what" IN ('poll', 'location_change',
> 'midnight_location') AND ("user_event"."created_at" >= '2016-04-19
> 01:23:55') AND (latitude > -37.03079375089291 AND latitude <
> -36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
> 175.0805140220076);
> 
> We have a btree index and it appears to be working. However, it's
> still pretty slow.
> 
> EXPLAIN ANALYZE gives the following:
> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121#gistcomment-2065314
> 
> I'm thinking that I need to do the following to help:
> 
> CLUSTER user_event ON index_user_event_for_visits_3 followed by
> analyze... Our data is mostly time series but sometimes we get some
> dumps with historical records.

It seems to me that a large part of the problem is that the server has to scan 
all index entries from that date and within those location bounds to find that 
the distinct user id's in that set are about 114,000 out of 1.7M rows matching 
the selection-criteria. If it could stop at the first location for each user, 
it would have to scan less than a 10th of the index entries that it has to scan 
now...

How high is the correlation between user id and location? That can probably be 
leveraged…
Perhaps you'd get better performance if you'd use multiple indices instead of 
stuffing everything into a single purpose-specific one? I would suggest one on 
(user_id, latitude, longitude) or (latitude, longitude, user_id) and one on 
(created_at, user_id), or perhaps (created_at, latitude, longitude). That 
approach could also reduce the number of indices you have on that table, as 
well as their sizes, making it all fit into cache a little easier. Then again, 
additional operations will be required to combine them.

For a different approach; It may be possible to enrich your data with something 
that is easy to index and query, with a high correlation to (latitude, 
longitude). That could also be used to partition over. Country seems a good 
candidate here, unless all your data-points are in New Zealand like the above?
Then again, some countries are a lot larger, with a higher population, than 
others. And populations can be highly concentrated (Delhi, Moscow to name a 
few).
Another option is to divide the location space up into squares of a fixed size, 
with a partition for each square. About 80% of those squares are unpopulated 
though, being at sea.

Created_at is a very good candidate for partitioning too, especially if you 
don't intend to keep data older than a certain age. Truncating or dropping a 
partition that you no longer need is quick and easy.

With data-sets this large, I'd think you would want to partition on multiple 
dimensions, creating a matrix of partitions under a single master table. I 
don't think PG has a simple way of doing that (yet) though; perhaps it's 
possible by abusing multiple levels of inheritance, but that sounds like a bad 
idea.

And of course, create your partitions sufficiently course to prevent 
overburdening the system tables, which would slow down the query planner.

Hopefully there's something useful in my ramblings!

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] potential extension of psql's \df+ ?

2017-04-19 Thread Karsten Hilbert
On Wed, Apr 19, 2017 at 12:00:04PM +0200, Pavel Stehule wrote:

> > Hence I wonder whether an approach along these lines:
> >
> > select
> > row_number() over ()
> > || src_line
...
> > ) as func_src;
> >
> > would be a worthwhile change to the query "\df+" uses ?
> >
> >
> do you know \sf+ fce command?

Sorry for the noise :-)

Thanks,
Karsten
-- 
GPG key ID E4071346 @ eu.pool.sks-keyservers.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Large data and slow queries

2017-04-19 Thread Samuel Williams
> Did that 50% performance gain come from just the datatype, or that fact that 
> the index became smaller?

How would one measure this?

On 19 April 2017 at 19:48, John R Pierce  wrote:
> On 4/19/2017 12:31 AM, vinny wrote:
>>
>> Given the number of records, my first thought was either partitioning or
>> partial-indexes.
>> The fewer rows are in the index, the quicker it will be to check,
>> and it's not a lot of work to create separate indexes for lat/long ranges
>> or dates.
>
>
> that only works if the planner can figure out which partitions to use in
> advance, otherwise it ends up having to scan all the partitions.
>
>
>
> --
> john r pierce, recycling bits in santa cruz
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] potential extension of psql's \df+ ?

2017-04-19 Thread Pavel Stehule
Hi

2017-04-19 11:55 GMT+02:00 Karsten Hilbert :

> Hi all !
>
> Every so often, when working with functions, errors get
> reported with context information similar to this:
>
> Context: PL/pgSQL function "test_function" line 5 at SQL statement
>
> Often, the function source is kept under version control (or
> in a file annotated, commented, etc in certain ways) such
> that the file line number does not correspond to the function
> source line number.
>
> In such cases, a typical approach would be to go and do
>
> \df+ test_function
>
> which nicely produces function metadata and source as
> contained in the database. For non-trivial functions it is,
> however, slightly cumbersome to count the line numbers
> onscreen (let alone line break settings).
>
> Hence I wonder whether an approach along these lines:
>
> select
> row_number() over ()
> || src_line
> from (
> select
> unnest(string_to_array(prosrc, E'\n')) as src_line
> from
> pg_proc
> where
> proname = 'FUNC_NAME'
> ) as func_src;
>
> would be a worthwhile change to the query "\df+" uses ?
>
>
do you know \sf+ fce command?

Regards

Pavel



> (Resorting to "\ef" won't align temp file lines with function
> source line numbers either, because pgsql boilerplate is
> needed for function definition etc.)
>
> Thanks for considering,
> Karsten
> --
> GPG key ID E4071346 @ eu.pool.sks-keyservers.net
> E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


[GENERAL] potential extension of psql's \df+ ?

2017-04-19 Thread Karsten Hilbert
Hi all !

Every so often, when working with functions, errors get
reported with context information similar to this:

Context: PL/pgSQL function "test_function" line 5 at SQL statement

Often, the function source is kept under version control (or
in a file annotated, commented, etc in certain ways) such
that the file line number does not correspond to the function
source line number.

In such cases, a typical approach would be to go and do

\df+ test_function

which nicely produces function metadata and source as
contained in the database. For non-trivial functions it is,
however, slightly cumbersome to count the line numbers
onscreen (let alone line break settings).

Hence I wonder whether an approach along these lines:

select
row_number() over ()
|| src_line
from (
select
unnest(string_to_array(prosrc, E'\n')) as src_line
from
pg_proc
where
proname = 'FUNC_NAME'
) as func_src;

would be a worthwhile change to the query "\df+" uses ?

(Resorting to "\ef" won't align temp file lines with function
source line numbers either, because pgsql boilerplate is
needed for function definition etc.)

Thanks for considering,
Karsten
-- 
GPG key ID E4071346 @ eu.pool.sks-keyservers.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] # of connections and architecture design

2017-04-19 Thread Moreno Andreo

  
  
Il 18/04/2017 18:51, Jeff Janes ha
  scritto:


  

  On Tue, Apr 18, 2017 at 2:42 AM,
Moreno Andreo 
wrote:
Hi all,
      As many of you has read last Friday (and many has
  tired to help, too, and I still thank you very much), I
  had a bad service outage.
  I was pointed to reduce number of maximum connections
  using a pooler, and that's what I'm building in test lab,
  but I'm wondering if there's something I can do with my
  overall architecture design.
  ATM we host one database per customer (about 400 now) and
  every customer has two points of access to data:
  - Directly to database, via rubyrep, to replicate the
  database he has in his own machine
  - Wia WCF self-hosted web services to read other customers
  data
  Every customer can access (and replicate) his database
  from a number of different positions (max 3).
  Customers are organized in groups (max 10 per group), and
  there is the chance that someone accesses someone else's
  data via WCF.
  For example, pick up a group of 5: everyone running
  rubyrep with only one position enabled, and getting data
  from others' database.
  If I'm not wrong, it's 5 connections (rubyrep) plus 5*4
  (everyone connecting to everyone else's database) for WCF,
  so 25 connections
  Now imagine a group of 10
  Last friday I've been told that 350 connections is quite a
  big number and things can begin to slow down. Ok. When
  something slows down I'm used to search and find the
  bottleneck (CPU, RAM, IO, etc). If everything was running
  apparently fine (CPU < 10%, RAM used < 20%, I/O
  rarely over 20%), how can I say there's a bottleneck
  that's slowing down things? Am I missing something?
  Another thing is that on a customer server (with a
  similar, smaller architecture)  I _do_ have a connection
  leak problem that's under investigation, but when things
  begin to slow down I simply run a pg_terminate_backend on
  all connection with an age > 10 min and everything goes
  back to normal. On my server, last friday, it did not
  help, so I thought that was not the main problem.
  I've got no problems in splitting this architecture in how
  many servers I need, but I think I need some tips on how
  to design this, in order to avoid major issues in the near
  future (ask for details if needed).
  
  The current PostgreSQL 9.5.6 server is an 8 core VM with
  52 GB RAM and Debian 8.
  WCF server is Windows 2012 R2 4-core, 16 GB RAM.
  
  While facing the issue none of them showed up any kind of
  overload and their logs were clean.
  
  I'm a bit scared it can happen again.



The logs being clean doesn't help much, if your log
  settings are set to be too terse.


Is log_lock_waits on?
  

  

It's off

  

  
  log_checkpoints?
  

  

off

  

  
  track_io_timing (doesn't show up in the logs, you
  have to query database views)?  

  

  

off (never referred to)

  

  


Is log_min_duration_statement set to a reasonable
  value?  

  

  

default value (it's commented out)

  

  
log_autovacuum_min_duration?
  

  

default value (commented)

  

  


Are you using pg_stat_statement (also doesn't show up
  in the logs, you have to query it), 
  

  

I'm using pg_stat_statements to keep track of the connections, their
origin and if there's some lock (waiting = true)... obviously in a
given moment I have only a maximum of 2 or 3 with state = 'active',
all others are 'idle'

  

  
and perhaps auto_explain?
  

  

never used. (I'll check the docs).

Thanks
Moreno.

  

  


Cheers,


Jeff
  

Re: [GENERAL] Large data and slow queries

2017-04-19 Thread John R Pierce

On 4/19/2017 12:31 AM, vinny wrote:
Given the number of records, my first thought was either partitioning 
or partial-indexes.

The fewer rows are in the index, the quicker it will be to check,
and it's not a lot of work to create separate indexes for lat/long 
ranges or dates.


that only works if the planner can figure out which partitions to use in 
advance, otherwise it ends up having to scan all the partitions.



--
john r pierce, recycling bits in santa cruz



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] tuple statistics update

2017-04-19 Thread Tom DalPozzo
2017-04-18 22:06 GMT+02:00 Tom Lane :

> Tom DalPozzo  writes:
> > Hi, I'm using libpq to insert tuples in my table and keep looking at
> > statistics through psql instead.
> > I noticed that sometimes n_tuple_ins is not updated even after 1 min that
> > my transaction committed.
>
> If your session is just sitting, that's not surprising.  I think stats
> updates are only transmitted to the collector at transaction end (and
> even then, only if it's been at least N msec since the last transmission
> from the current session).
>
> regards, tom lane
>

Hi, my transaction ended, that I waited 1 min and queried the stat.
I had to wait for session end in order to see the updeted stats.
Did you mean "...to the collector at SESSION end"?

Regards
Pupillo


Re: [GENERAL] Large data and slow queries

2017-04-19 Thread vinny

On 2017-04-19 07:04, Samuel Williams wrote:

Thanks John. Yes, you are absolutely right, you want the index to be
bottom heavy so you can cull as much as possible at the top. I'm
familiar with that, once implementing a brute-force sudoku solver, it
has the same principle.

I've been working on this all afternoon. By reducing the longitude,
latitude columns to float4, in my test cases, I found about 50%
improvement in performance. It may also use less space. So part of the
problem was my choice of data type. We've computed that float4 has a
worst case precision of about 1.6m which we are okay with for
analytics data.

Another option we may consider is using a (signed) integer - e.g.
longitude = 180*(v/2^31) and latitude = 180*(v/2^31) as this has a
uniform error across all points, but it's a bit more cumbersome to
handle. Is there a rational datatype in postgres which works like
this?



On 19 April 2017 at 16:42, John R Pierce  wrote:

On 4/18/2017 9:01 PM, Samuel Williams wrote:


We want the following kinds of query to be fast:

SELECT ... AND (latitude > -37.03079375089291 AND latitude <
-36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
175.0805140220076);




I wonder if GIST would work better if you use the native POINT type, 
and

compared it like

mypoint <@ BOX
'((174.6307139779924,-37.03079375089291),(175.0805140220076,-36.67086424910709
))'

with a gist index on mypoint...

but, it all hinges on which clauses in your query are most selective, 
thats

where you want an index.

--
john r pierce, recycling bits in santa cruz



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Did that 50% performance gain come from just the datatype, or that fact 
that the index became smaller?


Given the number of records, my first thought was either partitioning or 
partial-indexes.

The fewer rows are in the index, the quicker it will be to check,
and it's not a lot of work to create separate indexes for lat/long 
ranges or dates.




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] tuple statistics update

2017-04-19 Thread Tom DalPozzo
2017-04-18 21:42 GMT+02:00 Adrian Klaver :

> On 04/17/2017 09:18 AM, Tom DalPozzo wrote:
>
>> Hi, I'm using libpq to insert tuples in my table and keep looking at
>> statistics through psql instead.
>> I noticed that sometimes n_tuple_ins is not updated even after 1 min
>> that my transaction committed.
>> My libpq connection is kept alive. If I close the connection then the
>> stats get updated.
>> I know that stats are not instantaneous, but I thought that after a
>> while that a transaction is committed it would be updated.
>>
>
> Any of this apply?:
>
> https://www.postgresql.org/docs/9.6/static/monitoring-stats.html
>
> "Another important point is that when a server process is asked to display
> any of these statistics, it first fetches the most recent report emitted by
> the collector process and then continues to use this snapshot for all
> statistical views and functions until the end of its current transaction.
> So the statistics will show static information as long as you continue the
> current transaction. Similarly, information about the current queries of
> all sessions is collected when any such information is first requested
> within a transaction, and the same information will be displayed throughout
> the transaction. This is a feature, not a bug, because it allows you to
> perform several queries on the statistics and correlate the results without
> worrying that the numbers are changing underneath you. But if you want to
> see new results with each query, be sure to do the queries outside any
> transaction block. Alternatively, you can invoke pg_stat_clear_snapshot(),
> which will discard the current transaction's statistics snapshot (if any).
> The next use of statistical information will cause a new snapshot to be
> fetched."
>
> Regards
>> Pupillo
>>
>>
I read it, it seems to say that after N millisec that my transaction ends,
 stat should be current. I also tried pg_stat_clear_snapshot() with no
success.
Regards
Pupillo



>
>>
>>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>