[GENERAL] How to monitor locks (max_pred_locks_per_transaction)?

2015-01-06 Thread Andrey Lizenko
Hi,
I've got

 2014-12-28 14:33:23 GMT 553582643 24969 SELECT 53200 63/8298433
 54a00a84.6189 1 %ERROR:  out of shared memory
  2014-12-28 14:33:23 GMT 553582643 24969 SELECT 53200 63/8298433
 54a00a84.6189 2 %HINT:  You might need to increase
 max_pred_locks_per_transaction.


Is there any way to predict such OOM situation (to adjust
max_pred_locks_per_transaction before some transaction fails)?
As far as we have a lot of transaction in SERIALIZABLE isolation level,
should it be some counts of pg_locks with mode = AccessExclusiveLock or
something like that?

Documentation says, that it is an average value, and every single
transaction could lock more objects:

 objects (e.g., tables); hence, no more than this many distinct objects can
 be locked at any one time. This parameter controls the average number of
 object locks allocated for each transaction; individual transactions can
 lock more objects as long as the locks of all transactions fit in the lock
 table.


how can I get number of 'distinct objects' mentioned here?

WBR, Andrey Lizenko


Re: [GENERAL] Advice for using integer arrays?

2015-01-06 Thread Jeff Janes
On Tue, Jan 6, 2015 at 9:09 AM, Michael Heaney mhea...@jcvi.org wrote:

 I'm fairly new to Postgres, and have a design issue for which an array of
 integers might be a good solution.  But I'd like to hear from the experts
 before proceeding down this path.

 Essentially, I'm trying to model the relationship between a group of
 biological samples and their genes. Each sample (there are ~10K of them
 now, with more coming) will have about 30,000 genes. Conversely, a
 particular gene may be present in almost all samples.

 So I've created the following table to handle the many-to-many
 relationship:

  create table sample_gene (id serial, sample_id int, gene_id int);


What is the value of having this table at all?  It doesn't seem to contain
anything informative, like an allele identifier, a resequence, or a copy
number variation.  If you are just trying to record the fact that a gene
was present in that sample, perhaps it would be better to instead record
the genes have been deleted, rather than the ones that have not been
deleted?  That would probably be a much smaller list.



  create table sample_gene_array (id serial, sample_id int, gene_id int []
 );

 So now the table data looks like this:

 sample_id|gene_id []
 ---
 1 |  [1:30475]
 2 |  [1:29973]
 etc.


I'm not familiar with the square bracket and colon as a syntax for
expressing int arrays.  Are you taking liberties with the psql output, or
using a different client program?  Does that represent the range from 1 to
30475, or the two values 1 and 30475?

Cheers,

Jeff


Re: [GENERAL] Correct/optimal DML query for application session management ?

2015-01-06 Thread Andy Colson

On 1/6/2015 1:56 PM, Tim Smith wrote:

Hi Andy,

Yeah, the table scan was what worried me.

As for no indexes ?  I just didn't put the create index statements
in my post ... ;-)

Tim

On 6 January 2015 at 18:35, Andy Colson a...@squeakycode.net wrote:

On 1/6/2015 12:02 PM, Tim Smith wrote:


Hi,

I'm probably being incredibly stupid and missing something incredibly
simple but I've got a bit of query-writers block here !

create table app_sessions(
session_id char(64) unique not null,
user_id char(32) unique not null,
session_start bigint not null,
session_lastactive bigint not null
);


The rules are :
Enforced session timeout after 86400 seconds (1 day)
Last active less than 1 hour ago.

My idea to clean out stale sessions :
delete from app_sessions where extract (epoch from
now())-session_start86400 or session_lastactive=extract (epoch from
now())-3600;

But of course that's going to be a nasty query, so that's why I think
I'm missing something and need a fresh pair of eyes (or a whole
mailing list's worth of eyes !!).

Thanks guys !




I don't see any other way.  Why do you think it'll be so nasty?  Cuz it'll
table scan?  You have no indexes so it doesn't matter what you write, it'll
have to scan all rows.  How many rows do you expect to have?  500?  1000?
Table scan will be fine.

If you wanted to make it more readable .. but work the same, you could use
timestamptz instead of bigint, and then write:

where current_timestamp - '1 day'::interval  session_start

-Andy






Oh, more important, if its only gonna be 1000 rows or less, then I'd say 
drop the indexes and let it table scan.  I bet it'll be faster.


-Andy


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Correct/optimal DML query for application session management ?

2015-01-06 Thread Alban Hertroys

 On 06 Jan 2015, at 19:02, Tim Smith randomdev4+postg...@gmail.com wrote:
 create table app_sessions(
 session_id char(64) unique not null,
 user_id char(32) unique not null,
 session_start bigint not null,
 session_lastactive bigint not null
 );

Just an observation: Are you sure that you don’t want to allow your users to 
have multiple sessions? You made users unique across the entire table, which 
mean there can only ever be one session for a user. Unless you delete ‘expired’ 
sessions ASAP, users won’t be able to start a new session again after their 
original session expired.
That may be desirable for your application and thus intended, but perhaps it 
was not?

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Advice for using integer arrays?

2015-01-06 Thread Adrian Klaver

On 01/06/2015 01:18 PM, Michael Heaney wrote:

On 1/6/2015 2:19 PM, Jeff Janes wrote:

On Tue, Jan 6, 2015 at 9:09 AM, Michael Heaney mhea...@jcvi.org
mailto:mhea...@jcvi.org wrote:

I'm fairly new to Postgres, and have a design issue for which an
array of integers might be a good solution.  But I'd like to hear
from the experts before proceeding down this path.

Essentially, I'm trying to model the relationship between a group
of biological samples and their genes. Each sample (there are ~10K
of them now, with more coming) will have about 30,000 genes.
Conversely, a particular gene may be present in almost all samples.

So I've created the following table to handle the many-to-many
relationship:

 create table sample_gene (id serial, sample_id int, gene_id int);


What is the value of having this table at all?  It doesn't seem to
contain anything informative, like an allele identifier, a resequence,
or a copy number variation.  If you are just trying to record the fact
that a gene was present in that sample, perhaps it would be better to
instead record the genes have been deleted, rather than the ones that
have not been deleted?  That would probably be a much smaller list.



I suppose there could be a gene table which would contain data about
each gene_id.
But I'm an IT guy, not a biologist, and my sample_gene table doesn't
actually
exist.


Alright, now I am confused. In your original post you say sample_gene 
does exist and you joined it against genes(I assume a gene table) which 
is when you encountered slow performance. Did I miss something:)?


Per a previous suggestion, it might be good to show the queries you are 
using or plan to use. The issue may be in the query not the layout. Also 
running the query with EXPLAIN ANALYZE would be helpful:


http://www.postgresql.org/docs/9.3/interactive/sql-explain.html

 I'm more concerned with how to deal with many-to-many relationships

when each parent could have tens of thousands of children. Collapsing all
the children into an array for each parent looked intriguing - but maybe
it's
not a good idea.  I just don't know, so I thought I'd ask you guys.




 create table sample_gene_array (id serial, sample_id int, gene_id
int [] );

So now the table data looks like this:

sample_id|gene_id []
---
1 |  [1:30475]
2 |  [1:29973]
etc.


I'm not familiar with the square bracket and colon as a syntax for
expressing int arrays.  Are you taking liberties with the psql output,
or using a different client program?  Does that represent the range
from 1 to 30475, or the two values 1 and 30475?


Yes, it's shorthand for the full range of values from 1 to 30475. Wasn't
sure how
to represent it,


Use Python? Looks like a 1 based slice notation.

and almost went with (1,2,3...30475).  Apologies...



Michael Heaney
JCVI






--
Adrian Klaver
adrian.kla...@aklaver.com


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Advice for using integer arrays?

2015-01-06 Thread Michael Heaney

On 1/6/2015 2:19 PM, Jeff Janes wrote:
On Tue, Jan 6, 2015 at 9:09 AM, Michael Heaney mhea...@jcvi.org 
mailto:mhea...@jcvi.org wrote:


I'm fairly new to Postgres, and have a design issue for which an
array of integers might be a good solution.  But I'd like to hear
from the experts before proceeding down this path.

Essentially, I'm trying to model the relationship between a group
of biological samples and their genes. Each sample (there are ~10K
of them now, with more coming) will have about 30,000 genes.
Conversely, a particular gene may be present in almost all samples.

So I've created the following table to handle the many-to-many
relationship:

 create table sample_gene (id serial, sample_id int, gene_id int);


What is the value of having this table at all?  It doesn't seem to 
contain anything informative, like an allele identifier, a resequence, 
or a copy number variation.  If you are just trying to record the fact 
that a gene was present in that sample, perhaps it would be better to 
instead record the genes have been deleted, rather than the ones that 
have not been deleted?  That would probably be a much smaller list.




I suppose there could be a gene table which would contain data about 
each gene_id.
But I'm an IT guy, not a biologist, and my sample_gene table doesn't 
actually

exist.  I'm more concerned with how to deal with many-to-many relationships
when each parent could have tens of thousands of children. Collapsing all
the children into an array for each parent looked intriguing - but maybe 
it's

not a good idea.  I just don't know, so I thought I'd ask you guys.




 create table sample_gene_array (id serial, sample_id int, gene_id
int [] );

So now the table data looks like this:

sample_id|gene_id []
---
1 |  [1:30475]
2 |  [1:29973]
etc.


I'm not familiar with the square bracket and colon as a syntax for 
expressing int arrays.  Are you taking liberties with the psql output, 
or using a different client program?  Does that represent the range 
from 1 to 30475, or the two values 1 and 30475?


Yes, it's shorthand for the full range of values from 1 to 30475. Wasn't 
sure how

to represent it, and almost went with (1,2,3...30475).  Apologies...


Michael Heaney
JCVI





Re: [GENERAL] Advice for using integer arrays?

2015-01-06 Thread Paul Jungwirth
Hi Michael,

I can't comment on the domain-specific stuff, but I recently used numeric
arrays for a project and it worked well. In my case we had one million
simulation results (floats) per scenario, so rather than reading one
million separate rows to compute a histogram, we stored everything in one
row per scenario. Ideally one million floats is 8 megabytes, which is big
but still shouldn't require more than 100ms to read from disk and feed into
a simple computation. Here are some functions I wrote to make it easier 
faster to compute stats from numeric arrays:

https://github.com/pjungwir/aggs_for_arrays/

One caveat is that you will lose a lot of benefits by leaving the
relational model, e.g. being able to filter your integers by other
criteria. In our case all we had were floats, but imagine if you had
metadata attached to each one like the time the simulation was run, who ran
it, etc. Then you'd want to stick with something richer than just an array
of numbers.

But in that case maybe parallel arrays is acceptable. It would be a bit
like a column-store inside of Postgres. :-) I've been meaning to add a
function to that Github repo to filter an array given a same-size array of
booleans, so you can do filtering like in R or Pandas, but I haven't found
a nice way in Postgres to express e.g. `filter_array(simulation_results,
simulation_run_times  now() - interval '1 day')`.

Good luck!
Paul



On Tue, Jan 6, 2015 at 1:18 PM, Michael Heaney mhea...@jcvi.org wrote:

  On 1/6/2015 2:19 PM, Jeff Janes wrote:

  On Tue, Jan 6, 2015 at 9:09 AM, Michael Heaney mhea...@jcvi.org wrote:

 I'm fairly new to Postgres, and have a design issue for which an array of
 integers might be a good solution.  But I'd like to hear from the experts
 before proceeding down this path.

 Essentially, I'm trying to model the relationship between a group of
 biological samples and their genes. Each sample (there are ~10K of them
 now, with more coming) will have about 30,000 genes. Conversely, a
 particular gene may be present in almost all samples.

 So I've created the following table to handle the many-to-many
 relationship:

  create table sample_gene (id serial, sample_id int, gene_id int);


  What is the value of having this table at all?  It doesn't seem to
 contain anything informative, like an allele identifier, a resequence, or a
 copy number variation.  If you are just trying to record the fact that a
 gene was present in that sample, perhaps it would be better to instead
 record the genes have been deleted, rather than the ones that have not been
 deleted?  That would probably be a much smaller list.


 I suppose there could be a gene table which would contain data about each
 gene_id.
 But I'm an IT guy, not a biologist, and my sample_gene table doesn't
 actually
 exist.  I'm more concerned with how to deal with many-to-many relationships
 when each parent could have tens of thousands of children.  Collapsing all
 the children into an array for each parent looked intriguing - but maybe
 it's
 not a good idea.  I just don't know, so I thought I'd ask you guys.





  create table sample_gene_array (id serial, sample_id int, gene_id int []
 );

 So now the table data looks like this:

 sample_id|gene_id []
 ---
 1 |  [1:30475]
 2 |  [1:29973]
 etc.


  I'm not familiar with the square bracket and colon as a syntax for
 expressing int arrays.  Are you taking liberties with the psql output, or
 using a different client program?  Does that represent the range from 1 to
 30475, or the two values 1 and 30475?


 Yes, it's shorthand for the full range of values from 1 to 30475.  Wasn't
 sure how
 to represent it, and almost went with (1,2,3...30475).  Apologies...


 Michael Heaney
 JCVI








-- 
_
Pulchritudo splendor veritatis.


Re: [GENERAL] Correct/optimal DML query for application session management ?

2015-01-06 Thread Tim Smith
Hi Andy,

Yeah, the table scan was what worried me.

As for no indexes ?  I just didn't put the create index statements
in my post ... ;-)

Tim

On 6 January 2015 at 18:35, Andy Colson a...@squeakycode.net wrote:
 On 1/6/2015 12:02 PM, Tim Smith wrote:

 Hi,

 I'm probably being incredibly stupid and missing something incredibly
 simple but I've got a bit of query-writers block here !

 create table app_sessions(
 session_id char(64) unique not null,
 user_id char(32) unique not null,
 session_start bigint not null,
 session_lastactive bigint not null
 );


 The rules are :
 Enforced session timeout after 86400 seconds (1 day)
 Last active less than 1 hour ago.

 My idea to clean out stale sessions :
 delete from app_sessions where extract (epoch from
 now())-session_start86400 or session_lastactive=extract (epoch from
 now())-3600;

 But of course that's going to be a nasty query, so that's why I think
 I'm missing something and need a fresh pair of eyes (or a whole
 mailing list's worth of eyes !!).

 Thanks guys !



 I don't see any other way.  Why do you think it'll be so nasty?  Cuz it'll
 table scan?  You have no indexes so it doesn't matter what you write, it'll
 have to scan all rows.  How many rows do you expect to have?  500?  1000?
 Table scan will be fine.

 If you wanted to make it more readable .. but work the same, you could use
 timestamptz instead of bigint, and then write:

 where current_timestamp - '1 day'::interval  session_start

 -Andy



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Correct/optimal DML query for application session management ?

2015-01-06 Thread Andy Colson

On 1/6/2015 12:02 PM, Tim Smith wrote:


Hi,

I'm probably being incredibly stupid and missing something incredibly
simple but I've got a bit of query-writers block here !

create table app_sessions(
session_id char(64) unique not null,
user_id char(32) unique not null,
session_start bigint not null,
session_lastactive bigint not null
);


The rules are :
Enforced session timeout after 86400 seconds (1 day)
Last active less than 1 hour ago.

My idea to clean out stale sessions :
delete from app_sessions where extract (epoch from
now())-session_start86400 or session_lastactive=extract (epoch from
now())-3600;

But of course that's going to be a nasty query, so that's why I think
I'm missing something and need a fresh pair of eyes (or a whole
mailing list's worth of eyes !!).

Thanks guys !




I don't see any other way.  Why do you think it'll be so nasty?  Cuz it'll
table scan?  You have no indexes so it doesn't matter what you write, it'll
have to scan all rows.  How many rows do you expect to have?  500?  1000?
Table scan will be fine.

If you wanted to make it more readable .. but work the same, you could use
timestamptz instead of bigint, and then write:

where current_timestamp - '1 day'::interval  session_start

-Andy



On 1/6/2015 1:56 PM, Tim Smith wrote: Hi Andy,

 Yeah, the table scan was what worried me.

 As for no indexes ?  I just didn't put the create index statements
 in my post ... ;-)

 Tim


Please don't top post.

With and without indexes is a completely different answer.

See the performance section at:

https://wiki.postgresql.org/wiki/Guide_to_reporting_problems

-Andy



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to monitor locks (max_pred_locks_per_transaction)?

2015-01-06 Thread Kevin Grittner
Andrey Lizenko lizenk...@gmail.com wrote:

 2014-12-28 14:33:23 GMT 553582643 24969 SELECT 53200 63/8298433 
 54a00a84.6189 1 %ERROR:  out of shared memory
 2014-12-28 14:33:23 GMT 553582643 24969 SELECT 53200 63/8298433 
 54a00a84.6189 2 %HINT:  You might need to increase 
 max_pred_locks_per_transaction.

 Is there any way to predict such OOM situation (to adjust
 max_pred_locks_per_transaction before some transaction fails)?
 As far as we have a lot of transaction in SERIALIZABLE isolation
 level, should it be some counts of pg_locks with mode =
 AccessExclusiveLock or something like that?

WHERE mode = 'SIReadLock'

 how can I get number of 'distinct objects' mentioned here?

The total 'SIReadLock' count must be less than
max_pred_locks_per_transaction * max_connections.

The default is small so that minimal space is reserved for those
not using serializable transactions.  Many people have found that
they need to set it to 10 to 20 times the default values.

Due to the heuristics of how multiple fine-grained locks are
combined into coarser-grained locks it might sometimes be necessary
(if you have a lot of page locks in a lot of tables) to raise
max_connections beyond what you need for actual connections.  I
have not actually seen this yet, but it could happen.  If it does,
please share details of the workload and your settings, so that we
can look at possible adjustments to the lock granularity promotion
logic or the memory allocation techniques.

Please note the suggestions on performance in the documentation of
serializable transactions.  In particular, if a transaction will
not be modifying data, setting it to READ ONLY can help a lot.  Not
only will it help performance, but it will tend to reduce the
number of predicate locks needed.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Inconsistent bgworker behaviour

2015-01-06 Thread Beena Emerson
Hello,

I have been working on a  module which launches background workers for a
list of databases provided by a configuration parameter(say m_databases).
This configuration parameter can be edited and reloaded.
It has a launcher which manages all the workers launched by the module.
The worker.bgw_notify_pid of the workers are set to the launcher pid.

The number of background workers that can be launched is restricted by
max_worker_processes.

Consider the following scenario:
max_worker_processes = 3
m_databases='db1, db2'

The server is started.

The m_databases is updated to
m_databases='db3, db2'

$ pg_ctl reload

The expected behavior is that the db1 worker should be terminated and db3
worker should be launched. However I found that this behavior is not
consistent. In few runs, when the databases parameter is edited and
reloaded, the new worker is launched before the old ones are terminated
causing an error.

I have used the following code on the launcher to ensure that the old
unnecessary background workers are terminated before the new background
workers are launched for newly added databases.
for (i = 0; i  workers_launched; i++)
{
if (!is_inlist(new_dblist, workers[i]-dbname))
{
/* Ensure that the background worker is terminated before
regitsering
 * new workers to avoid crossing the limit of
max_worker_processes
 */
ResetLatch(MyProc-procLatch);
TerminateBackgroundWorker(workers[i]-handle);
WaitLatch(MyProc-procLatch, WL_LATCH_SET, 0);
}
}
.
.
. (launch new workers)
.

The Latch is set when the SIGUSR1 signal is received. IIUC correctly, the
launcher gets the SIGUSR1 when the bgworker process has exited. No new
worker is launched or terminated in between still the code does not work as
expected for all the runs.

Any help will be appreciated.

Thank you,

Beena


Re: [GENERAL] Does anyone user pg-pool II on real production ? Please help me.

2015-01-06 Thread Sameer Kumar
On 6 Jan 2015 03:02, tuanhoanganh hatua...@gmail.com wrote:

 Hello everybody

 Does anyone user pg-pool II on real production ?

Yes we have a customer using it in load balancing mode and another one
using it with Watchdog for high availability.

 How many slave do you have? and how many size of database ?


In both cases 1. But I believe pgpool can support more databases. Size of
db should not matter but we have 50GB if database being used with pgpool.

 I need config my old retail system to support ~ 1500 user with 1000GB
over 4 years. I want to try pgpool-II but don't found real system use it.


I guess in addition to using pgpool you should also think about using
pgbouncer as connection pooling agent (I would probably install it on
application server and configure to connect to pgpool)

 My system use direct SQL SELECT query and a lot pg procedure.

If there is a way you can specify different connection strings for these
two type of queries then you can make most out of pgbouncer. Do you do any
explicit transactions in these procedures?

 Can pgpool-II support load balance from SELECT my_procedure() ...

Yes you can specify those on white list (provided they are readonly
procedures).


 Please help me

 Thanks in advance. Sorry for my English.


[GENERAL] BDR Error restarted

2015-01-06 Thread deans
Hi Guys.

First of all, BDR is cool, should have tried it earlier.

Environment: CentOS 6.5, PostgreSQL 9.4.0 with BDR from yum repository

Done the PostgreSQL 9.4 with BDR setup successfully by following the User
Guide and Admin Doc, but got a issue when tried to do postgresql service
restart on any of the cluster nodes, it always lost items in the relation
pg_stat_replication  after the service restarted, means lost replication
connections, like normally we have 26 items(13 databases and 3 servers in
cluster, so 13x2 items in pg_stat_replication on each node, yes, lots dbs on
one cluster), but after the restart, got some random count of item in the
relation, like 20, or 16 or 4, but not 26 as expected, i.e. not all the
replication connections running well. And the logs showing replication is
wait until the node has caught up, but they never catch up again. 

BDR Settings(replaced the real db name here):
1. on node 01, the replication src one:

2. on node 02, mostly like 03


Error logs:

Detail on the pgsql relation said above:






Thanks for any help/suggestion on this!
-Dean



--
View this message in context: 
http://postgresql.nabble.com/BDR-Error-restarted-tp5833139.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] postgresql versus riak for a global exchange

2015-01-06 Thread Raymond Cote
I’m familiar with both PostgreSQL and Riak (1.4, not 2.0). 
I know that Riak 2.0 now offers strong consistency. Have not yet seen what that 
does to performance. 
Big plusses for PostgreSQL:
   - you can do both relational and NOSQL tasks (the Binary JSON in the latest 
PostgreSQL).
   - well-tested consistency, ACID, etc.
   - lots of adapters and support. 
   - big community

Big plusses for Riak:
  - multi-master replication
  - multi-data center replication
  - easy to scale up

We use PostgreSQL in combination with Riak for data storage (we have a 
tokenization service). 
We're currently using the EnterpriseDB multi-master PostgreSQL replication and 
are quite happy with it. 
The replication runs periodically, not streaming, so there is at least a 1 
second delay for replication to occur. 
Riak replicates quicker — but then you don’t have the strong relational 
structure on top. 

As mentioned earlier, ‘exchange…trade…asset’ is a bit vague. 
In addition to just storing things, you’ll need to keep track of all sorts of 
log-in and contact info — perhaps not ideal for Riak. 
Probably best to consider precisely what traits your planned application has 
and then look to match against the database storage. 
May even end up with a mix of the two just as we have. 

Your decision may also depend on which development language/framework you chose 
for the implementation. 

—Ray


 On Jan 5, 2015, at 11:37 AM, xu xiut xiut...@gmail.com wrote:
 
 Hello, I am looking at creating a toy project which may turn into an actual 
 business if I'm lucky, the ideal is generally just an exchange for people to 
 trade some type of asset.
 
 I'm looking at using either PostgreSQL or Riak, and I'm wondering if there 
 are opinions and suggestions that someone would be willing to share with me 
 when evaluating databases.
 
 This is the first time I've actually considered something besides PostgreSQL. 
 Riak 2.0 now offers strong consistency and I really respect the community and 
 the work that has gone into the project. It seems like it would be easy to 
 replicate across multiple data centers.
 
 Thanks for letting me ask this here!



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Advice for using integer arrays?

2015-01-06 Thread Arthur Silva
On Jan 6, 2015 3:12 PM, Michael Heaney mhea...@jcvi.org wrote:

 I'm fairly new to Postgres, and have a design issue for which an array of
integers might be a good solution.  But I'd like to hear from the experts
before proceeding down this path.

 Essentially, I'm trying to model the relationship between a group of
biological samples and their genes. Each sample (there are ~10K of them
now, with more coming) will have about 30,000 genes. Conversely, a
particular gene may be present in almost all samples.

 So I've created the following table to handle the many-to-many
relationship:

  create table sample_gene (id serial, sample_id int, gene_id int);

 which looks like this when populated:

 sample_id|gene_id
 ---
 1 |  1
 1 |  2
 ...
 1 |30475
 2 | 1
 2 | 2
 ...
 2 |29973
 3 |  1
 etc.

 The table now contains hundreds of millions of rows (with many, many more
to come).  Join performance between samples and genes is quite slow, even
with indexes on sample_id and gene_id.

 So it occurred to me: why not eliminate all the duplicate sample_id
values by storing the gene_id's in an array, like so:

  create table sample_gene_array (id serial, sample_id int, gene_id int []
);

 So now the table data looks like this:

 sample_id|gene_id []
 ---
 1 |  [1:30475]
 2 |  [1:29973]
 etc.

 The new table is significantly smaller, and performance (using ANY[] ) is
quite good.  Nevertheless, I'm uneasy.  I come from a Sybase ASE
background, and so have no experience with arrays as datatypes. Is it okay
to store 30K+ gene values in an array in the linking table (or maybe even
in the sample table itself, thus eliminating the linking table)?  Should I
unnest the gene_id's first, before using them to join to the gene table?

 TIA for any guidance you can provide.  Again, I'm a Postgres neophyte -
but I'm in awe of the power and flexibility of this database, and wish that
I'd started using it sooner.

 --
 Michael Heaney
 JCVI





 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general

Please provide sample queries so we can understand how you query the data.


Re: [GENERAL] Correct/optimal DML query for application session management ?

2015-01-06 Thread Andy Colson

On 1/6/2015 12:02 PM, Tim Smith wrote:

Hi,

I'm probably being incredibly stupid and missing something incredibly
simple but I've got a bit of query-writers block here !

create table app_sessions(
session_id char(64) unique not null,
user_id char(32) unique not null,
session_start bigint not null,
session_lastactive bigint not null
);


The rules are :
Enforced session timeout after 86400 seconds (1 day)
Last active less than 1 hour ago.

My idea to clean out stale sessions :
delete from app_sessions where extract (epoch from
now())-session_start86400 or session_lastactive=extract (epoch from
now())-3600;

But of course that's going to be a nasty query, so that's why I think
I'm missing something and need a fresh pair of eyes (or a whole
mailing list's worth of eyes !!).

Thanks guys !




I don't see any other way.  Why do you think it'll be so nasty?  Cuz 
it'll table scan?  You have no indexes so it doesn't matter what you 
write, it'll have to scan all rows.  How many rows do you expect to 
have?  500?  1000?  Table scan will be fine.


If you wanted to make it more readable .. but work the same, you could 
use timestamptz instead of bigint, and then write:


where current_timestamp - '1 day'::interval  session_start

-Andy



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] postgresql versus riak for a global exchange

2015-01-06 Thread Jonathan Vanasco

A very popular design I see is often this:

- PostgreSQL for account, inventory, transactional; and all writes
- NoSQL (Redis, Riak, Mongo, etc) for read-only index postgres (almost 
like a read-through cache) and assembled documents


On Jan 5, 2015, at 5:46 PM, Raymond Cote wrote:

 I’m familiar with both PostgreSQL and Riak (1.4, not 2.0). 
 I know that Riak 2.0 now offers strong consistency. Have not yet seen what 
 that does to performance. 
 Big plusses for PostgreSQL:
   - you can do both relational and NOSQL tasks (the Binary JSON in the latest 
 PostgreSQL).
   - well-tested consistency, ACID, etc.
   - lots of adapters and support. 
   - big community
 
 Big plusses for Riak:
  - multi-master replication
  - multi-data center replication
  - easy to scale up
 
 We use PostgreSQL in combination with Riak for data storage (we have a 
 tokenization service). 
 We're currently using the EnterpriseDB multi-master PostgreSQL replication 
 and are quite happy with it. 
 The replication runs periodically, not streaming, so there is at least a 1 
 second delay for replication to occur. 
 Riak replicates quicker — but then you don’t have the strong relational 
 structure on top. 
 
 As mentioned earlier, ‘exchange…trade…asset’ is a bit vague. 
 In addition to just storing things, you’ll need to keep track of all sorts of 
 log-in and contact info — perhaps not ideal for Riak. 
 Probably best to consider precisely what traits your planned application has 
 and then look to match against the database storage. 
 May even end up with a mix of the two just as we have. 
 
 Your decision may also depend on which development language/framework you 
 chose for the implementation. 



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Advice for using integer arrays?

2015-01-06 Thread Rob Sargent

On 01/06/2015 10:09 AM, Michael Heaney wrote:
I'm fairly new to Postgres, and have a design issue for which an array 
of integers might be a good solution.  But I'd like to hear from the 
experts before proceeding down this path.


Essentially, I'm trying to model the relationship between a group of 
biological samples and their genes. Each sample (there are ~10K of 
them now, with more coming) will have about 30,000 genes. Conversely, 
a particular gene may be present in almost all samples.


So I've created the following table to handle the many-to-many 
relationship:


 create table sample_gene (id serial, sample_id int, gene_id int);

which looks like this when populated:

sample_id|gene_id
---
1 |  1
1 |  2
...
1 |30475
2 | 1
2 | 2
...
2 |29973
3 |  1
etc.

The table now contains hundreds of millions of rows (with many, many 
more to come).  Join performance between samples and genes is quite 
slow, even with indexes on sample_id and gene_id.


So it occurred to me: why not eliminate all the duplicate sample_id 
values by storing the gene_id's in an array, like so:


 create table sample_gene_array (id serial, sample_id int, gene_id int 
[] );


So now the table data looks like this:

sample_id|gene_id []
---
1 |  [1:30475]
2 |  [1:29973]
etc.

The new table is significantly smaller, and performance (using ANY[] ) 
is quite good.  Nevertheless, I'm uneasy.  I come from a Sybase ASE 
background, and so have no experience with arrays as datatypes. Is it 
okay to store 30K+ gene values in an array in the linking table (or 
maybe even in the sample table itself, thus eliminating the linking 
table)?  Should I unnest the gene_id's first, before using them to 
join to the gene table?


TIA for any guidance you can provide.  Again, I'm a Postgres neophyte 
- but I'm in awe of the power and flexibility of this database, and 
wish that I'd started using it sooner.


--
Michael Heaney
JCVI





Think genotype calls (sample, marker, call) and this just explodes in 
terms of number of rows but I'm trying to not let that bother me.  
Partion by chromosome, or chromosome arm (even if partition is just 
separate tables) and the magnitude becomes manageable. At least as 
manageable as multi-gigabyte gvcf files per sample from GATK.


But I'm a little confused: all samples for a given species will have the 
same genes (roughly).  Are you storing gene variants (sequence or 
otherwise) per sample?




[GENERAL] Advice for using integer arrays?

2015-01-06 Thread Michael Heaney
I'm fairly new to Postgres, and have a design issue for which an array 
of integers might be a good solution.  But I'd like to hear from the 
experts before proceeding down this path.


Essentially, I'm trying to model the relationship between a group of 
biological samples and their genes. Each sample (there are ~10K of them 
now, with more coming) will have about 30,000 genes. Conversely, a 
particular gene may be present in almost all samples.


So I've created the following table to handle the many-to-many relationship:

 create table sample_gene (id serial, sample_id int, gene_id int);

which looks like this when populated:

sample_id|gene_id
---
1 |  1
1 |  2
...
1 |30475
2 | 1
2 | 2
...
2 |29973
3 |  1
etc.

The table now contains hundreds of millions of rows (with many, many 
more to come).  Join performance between samples and genes is quite 
slow, even with indexes on sample_id and gene_id.


So it occurred to me: why not eliminate all the duplicate sample_id 
values by storing the gene_id's in an array, like so:


 create table sample_gene_array (id serial, sample_id int, gene_id int 
[] );


So now the table data looks like this:

sample_id|gene_id []
---
1 |  [1:30475]
2 |  [1:29973]
etc.

The new table is significantly smaller, and performance (using ANY[] ) 
is quite good.  Nevertheless, I'm uneasy.  I come from a Sybase ASE 
background, and so have no experience with arrays as datatypes. Is it 
okay to store 30K+ gene values in an array in the linking table (or 
maybe even in the sample table itself, thus eliminating the linking 
table)?  Should I unnest the gene_id's first, before using them to join 
to the gene table?


TIA for any guidance you can provide.  Again, I'm a Postgres neophyte - 
but I'm in awe of the power and flexibility of this database, and wish 
that I'd started using it sooner.


--
Michael Heaney
JCVI





--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Correct/optimal DML query for application session management ?

2015-01-06 Thread Tim Smith
Hi,

I'm probably being incredibly stupid and missing something incredibly
simple but I've got a bit of query-writers block here !

create table app_sessions(
session_id char(64) unique not null,
user_id char(32) unique not null,
session_start bigint not null,
session_lastactive bigint not null
);


The rules are :
Enforced session timeout after 86400 seconds (1 day)
Last active less than 1 hour ago.

My idea to clean out stale sessions :
delete from app_sessions where extract (epoch from
now())-session_start86400 or session_lastactive=extract (epoch from
now())-3600;

But of course that's going to be a nasty query, so that's why I think
I'm missing something and need a fresh pair of eyes (or a whole
mailing list's worth of eyes !!).

Thanks guys !


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general