[GENERAL] How to monitor locks (max_pred_locks_per_transaction)?
Hi, I've got 2014-12-28 14:33:23 GMT 553582643 24969 SELECT 53200 63/8298433 54a00a84.6189 1 %ERROR: out of shared memory 2014-12-28 14:33:23 GMT 553582643 24969 SELECT 53200 63/8298433 54a00a84.6189 2 %HINT: You might need to increase max_pred_locks_per_transaction. Is there any way to predict such OOM situation (to adjust max_pred_locks_per_transaction before some transaction fails)? As far as we have a lot of transaction in SERIALIZABLE isolation level, should it be some counts of pg_locks with mode = AccessExclusiveLock or something like that? Documentation says, that it is an average value, and every single transaction could lock more objects: objects (e.g., tables); hence, no more than this many distinct objects can be locked at any one time. This parameter controls the average number of object locks allocated for each transaction; individual transactions can lock more objects as long as the locks of all transactions fit in the lock table. how can I get number of 'distinct objects' mentioned here? WBR, Andrey Lizenko
Re: [GENERAL] Advice for using integer arrays?
On Tue, Jan 6, 2015 at 9:09 AM, Michael Heaney mhea...@jcvi.org wrote: I'm fairly new to Postgres, and have a design issue for which an array of integers might be a good solution. But I'd like to hear from the experts before proceeding down this path. Essentially, I'm trying to model the relationship between a group of biological samples and their genes. Each sample (there are ~10K of them now, with more coming) will have about 30,000 genes. Conversely, a particular gene may be present in almost all samples. So I've created the following table to handle the many-to-many relationship: create table sample_gene (id serial, sample_id int, gene_id int); What is the value of having this table at all? It doesn't seem to contain anything informative, like an allele identifier, a resequence, or a copy number variation. If you are just trying to record the fact that a gene was present in that sample, perhaps it would be better to instead record the genes have been deleted, rather than the ones that have not been deleted? That would probably be a much smaller list. create table sample_gene_array (id serial, sample_id int, gene_id int [] ); So now the table data looks like this: sample_id|gene_id [] --- 1 | [1:30475] 2 | [1:29973] etc. I'm not familiar with the square bracket and colon as a syntax for expressing int arrays. Are you taking liberties with the psql output, or using a different client program? Does that represent the range from 1 to 30475, or the two values 1 and 30475? Cheers, Jeff
Re: [GENERAL] Correct/optimal DML query for application session management ?
On 1/6/2015 1:56 PM, Tim Smith wrote: Hi Andy, Yeah, the table scan was what worried me. As for no indexes ? I just didn't put the create index statements in my post ... ;-) Tim On 6 January 2015 at 18:35, Andy Colson a...@squeakycode.net wrote: On 1/6/2015 12:02 PM, Tim Smith wrote: Hi, I'm probably being incredibly stupid and missing something incredibly simple but I've got a bit of query-writers block here ! create table app_sessions( session_id char(64) unique not null, user_id char(32) unique not null, session_start bigint not null, session_lastactive bigint not null ); The rules are : Enforced session timeout after 86400 seconds (1 day) Last active less than 1 hour ago. My idea to clean out stale sessions : delete from app_sessions where extract (epoch from now())-session_start86400 or session_lastactive=extract (epoch from now())-3600; But of course that's going to be a nasty query, so that's why I think I'm missing something and need a fresh pair of eyes (or a whole mailing list's worth of eyes !!). Thanks guys ! I don't see any other way. Why do you think it'll be so nasty? Cuz it'll table scan? You have no indexes so it doesn't matter what you write, it'll have to scan all rows. How many rows do you expect to have? 500? 1000? Table scan will be fine. If you wanted to make it more readable .. but work the same, you could use timestamptz instead of bigint, and then write: where current_timestamp - '1 day'::interval session_start -Andy Oh, more important, if its only gonna be 1000 rows or less, then I'd say drop the indexes and let it table scan. I bet it'll be faster. -Andy -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Correct/optimal DML query for application session management ?
On 06 Jan 2015, at 19:02, Tim Smith randomdev4+postg...@gmail.com wrote: create table app_sessions( session_id char(64) unique not null, user_id char(32) unique not null, session_start bigint not null, session_lastactive bigint not null ); Just an observation: Are you sure that you don’t want to allow your users to have multiple sessions? You made users unique across the entire table, which mean there can only ever be one session for a user. Unless you delete ‘expired’ sessions ASAP, users won’t be able to start a new session again after their original session expired. That may be desirable for your application and thus intended, but perhaps it was not? Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll find there is no forest. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Advice for using integer arrays?
On 01/06/2015 01:18 PM, Michael Heaney wrote: On 1/6/2015 2:19 PM, Jeff Janes wrote: On Tue, Jan 6, 2015 at 9:09 AM, Michael Heaney mhea...@jcvi.org mailto:mhea...@jcvi.org wrote: I'm fairly new to Postgres, and have a design issue for which an array of integers might be a good solution. But I'd like to hear from the experts before proceeding down this path. Essentially, I'm trying to model the relationship between a group of biological samples and their genes. Each sample (there are ~10K of them now, with more coming) will have about 30,000 genes. Conversely, a particular gene may be present in almost all samples. So I've created the following table to handle the many-to-many relationship: create table sample_gene (id serial, sample_id int, gene_id int); What is the value of having this table at all? It doesn't seem to contain anything informative, like an allele identifier, a resequence, or a copy number variation. If you are just trying to record the fact that a gene was present in that sample, perhaps it would be better to instead record the genes have been deleted, rather than the ones that have not been deleted? That would probably be a much smaller list. I suppose there could be a gene table which would contain data about each gene_id. But I'm an IT guy, not a biologist, and my sample_gene table doesn't actually exist. Alright, now I am confused. In your original post you say sample_gene does exist and you joined it against genes(I assume a gene table) which is when you encountered slow performance. Did I miss something:)? Per a previous suggestion, it might be good to show the queries you are using or plan to use. The issue may be in the query not the layout. Also running the query with EXPLAIN ANALYZE would be helpful: http://www.postgresql.org/docs/9.3/interactive/sql-explain.html I'm more concerned with how to deal with many-to-many relationships when each parent could have tens of thousands of children. Collapsing all the children into an array for each parent looked intriguing - but maybe it's not a good idea. I just don't know, so I thought I'd ask you guys. create table sample_gene_array (id serial, sample_id int, gene_id int [] ); So now the table data looks like this: sample_id|gene_id [] --- 1 | [1:30475] 2 | [1:29973] etc. I'm not familiar with the square bracket and colon as a syntax for expressing int arrays. Are you taking liberties with the psql output, or using a different client program? Does that represent the range from 1 to 30475, or the two values 1 and 30475? Yes, it's shorthand for the full range of values from 1 to 30475. Wasn't sure how to represent it, Use Python? Looks like a 1 based slice notation. and almost went with (1,2,3...30475). Apologies... Michael Heaney JCVI -- Adrian Klaver adrian.kla...@aklaver.com -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Advice for using integer arrays?
On 1/6/2015 2:19 PM, Jeff Janes wrote: On Tue, Jan 6, 2015 at 9:09 AM, Michael Heaney mhea...@jcvi.org mailto:mhea...@jcvi.org wrote: I'm fairly new to Postgres, and have a design issue for which an array of integers might be a good solution. But I'd like to hear from the experts before proceeding down this path. Essentially, I'm trying to model the relationship between a group of biological samples and their genes. Each sample (there are ~10K of them now, with more coming) will have about 30,000 genes. Conversely, a particular gene may be present in almost all samples. So I've created the following table to handle the many-to-many relationship: create table sample_gene (id serial, sample_id int, gene_id int); What is the value of having this table at all? It doesn't seem to contain anything informative, like an allele identifier, a resequence, or a copy number variation. If you are just trying to record the fact that a gene was present in that sample, perhaps it would be better to instead record the genes have been deleted, rather than the ones that have not been deleted? That would probably be a much smaller list. I suppose there could be a gene table which would contain data about each gene_id. But I'm an IT guy, not a biologist, and my sample_gene table doesn't actually exist. I'm more concerned with how to deal with many-to-many relationships when each parent could have tens of thousands of children. Collapsing all the children into an array for each parent looked intriguing - but maybe it's not a good idea. I just don't know, so I thought I'd ask you guys. create table sample_gene_array (id serial, sample_id int, gene_id int [] ); So now the table data looks like this: sample_id|gene_id [] --- 1 | [1:30475] 2 | [1:29973] etc. I'm not familiar with the square bracket and colon as a syntax for expressing int arrays. Are you taking liberties with the psql output, or using a different client program? Does that represent the range from 1 to 30475, or the two values 1 and 30475? Yes, it's shorthand for the full range of values from 1 to 30475. Wasn't sure how to represent it, and almost went with (1,2,3...30475). Apologies... Michael Heaney JCVI
Re: [GENERAL] Advice for using integer arrays?
Hi Michael, I can't comment on the domain-specific stuff, but I recently used numeric arrays for a project and it worked well. In my case we had one million simulation results (floats) per scenario, so rather than reading one million separate rows to compute a histogram, we stored everything in one row per scenario. Ideally one million floats is 8 megabytes, which is big but still shouldn't require more than 100ms to read from disk and feed into a simple computation. Here are some functions I wrote to make it easier faster to compute stats from numeric arrays: https://github.com/pjungwir/aggs_for_arrays/ One caveat is that you will lose a lot of benefits by leaving the relational model, e.g. being able to filter your integers by other criteria. In our case all we had were floats, but imagine if you had metadata attached to each one like the time the simulation was run, who ran it, etc. Then you'd want to stick with something richer than just an array of numbers. But in that case maybe parallel arrays is acceptable. It would be a bit like a column-store inside of Postgres. :-) I've been meaning to add a function to that Github repo to filter an array given a same-size array of booleans, so you can do filtering like in R or Pandas, but I haven't found a nice way in Postgres to express e.g. `filter_array(simulation_results, simulation_run_times now() - interval '1 day')`. Good luck! Paul On Tue, Jan 6, 2015 at 1:18 PM, Michael Heaney mhea...@jcvi.org wrote: On 1/6/2015 2:19 PM, Jeff Janes wrote: On Tue, Jan 6, 2015 at 9:09 AM, Michael Heaney mhea...@jcvi.org wrote: I'm fairly new to Postgres, and have a design issue for which an array of integers might be a good solution. But I'd like to hear from the experts before proceeding down this path. Essentially, I'm trying to model the relationship between a group of biological samples and their genes. Each sample (there are ~10K of them now, with more coming) will have about 30,000 genes. Conversely, a particular gene may be present in almost all samples. So I've created the following table to handle the many-to-many relationship: create table sample_gene (id serial, sample_id int, gene_id int); What is the value of having this table at all? It doesn't seem to contain anything informative, like an allele identifier, a resequence, or a copy number variation. If you are just trying to record the fact that a gene was present in that sample, perhaps it would be better to instead record the genes have been deleted, rather than the ones that have not been deleted? That would probably be a much smaller list. I suppose there could be a gene table which would contain data about each gene_id. But I'm an IT guy, not a biologist, and my sample_gene table doesn't actually exist. I'm more concerned with how to deal with many-to-many relationships when each parent could have tens of thousands of children. Collapsing all the children into an array for each parent looked intriguing - but maybe it's not a good idea. I just don't know, so I thought I'd ask you guys. create table sample_gene_array (id serial, sample_id int, gene_id int [] ); So now the table data looks like this: sample_id|gene_id [] --- 1 | [1:30475] 2 | [1:29973] etc. I'm not familiar with the square bracket and colon as a syntax for expressing int arrays. Are you taking liberties with the psql output, or using a different client program? Does that represent the range from 1 to 30475, or the two values 1 and 30475? Yes, it's shorthand for the full range of values from 1 to 30475. Wasn't sure how to represent it, and almost went with (1,2,3...30475). Apologies... Michael Heaney JCVI -- _ Pulchritudo splendor veritatis.
Re: [GENERAL] Correct/optimal DML query for application session management ?
Hi Andy, Yeah, the table scan was what worried me. As for no indexes ? I just didn't put the create index statements in my post ... ;-) Tim On 6 January 2015 at 18:35, Andy Colson a...@squeakycode.net wrote: On 1/6/2015 12:02 PM, Tim Smith wrote: Hi, I'm probably being incredibly stupid and missing something incredibly simple but I've got a bit of query-writers block here ! create table app_sessions( session_id char(64) unique not null, user_id char(32) unique not null, session_start bigint not null, session_lastactive bigint not null ); The rules are : Enforced session timeout after 86400 seconds (1 day) Last active less than 1 hour ago. My idea to clean out stale sessions : delete from app_sessions where extract (epoch from now())-session_start86400 or session_lastactive=extract (epoch from now())-3600; But of course that's going to be a nasty query, so that's why I think I'm missing something and need a fresh pair of eyes (or a whole mailing list's worth of eyes !!). Thanks guys ! I don't see any other way. Why do you think it'll be so nasty? Cuz it'll table scan? You have no indexes so it doesn't matter what you write, it'll have to scan all rows. How many rows do you expect to have? 500? 1000? Table scan will be fine. If you wanted to make it more readable .. but work the same, you could use timestamptz instead of bigint, and then write: where current_timestamp - '1 day'::interval session_start -Andy -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Correct/optimal DML query for application session management ?
On 1/6/2015 12:02 PM, Tim Smith wrote: Hi, I'm probably being incredibly stupid and missing something incredibly simple but I've got a bit of query-writers block here ! create table app_sessions( session_id char(64) unique not null, user_id char(32) unique not null, session_start bigint not null, session_lastactive bigint not null ); The rules are : Enforced session timeout after 86400 seconds (1 day) Last active less than 1 hour ago. My idea to clean out stale sessions : delete from app_sessions where extract (epoch from now())-session_start86400 or session_lastactive=extract (epoch from now())-3600; But of course that's going to be a nasty query, so that's why I think I'm missing something and need a fresh pair of eyes (or a whole mailing list's worth of eyes !!). Thanks guys ! I don't see any other way. Why do you think it'll be so nasty? Cuz it'll table scan? You have no indexes so it doesn't matter what you write, it'll have to scan all rows. How many rows do you expect to have? 500? 1000? Table scan will be fine. If you wanted to make it more readable .. but work the same, you could use timestamptz instead of bigint, and then write: where current_timestamp - '1 day'::interval session_start -Andy On 1/6/2015 1:56 PM, Tim Smith wrote: Hi Andy, Yeah, the table scan was what worried me. As for no indexes ? I just didn't put the create index statements in my post ... ;-) Tim Please don't top post. With and without indexes is a completely different answer. See the performance section at: https://wiki.postgresql.org/wiki/Guide_to_reporting_problems -Andy -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] How to monitor locks (max_pred_locks_per_transaction)?
Andrey Lizenko lizenk...@gmail.com wrote: 2014-12-28 14:33:23 GMT 553582643 24969 SELECT 53200 63/8298433 54a00a84.6189 1 %ERROR: out of shared memory 2014-12-28 14:33:23 GMT 553582643 24969 SELECT 53200 63/8298433 54a00a84.6189 2 %HINT: You might need to increase max_pred_locks_per_transaction. Is there any way to predict such OOM situation (to adjust max_pred_locks_per_transaction before some transaction fails)? As far as we have a lot of transaction in SERIALIZABLE isolation level, should it be some counts of pg_locks with mode = AccessExclusiveLock or something like that? WHERE mode = 'SIReadLock' how can I get number of 'distinct objects' mentioned here? The total 'SIReadLock' count must be less than max_pred_locks_per_transaction * max_connections. The default is small so that minimal space is reserved for those not using serializable transactions. Many people have found that they need to set it to 10 to 20 times the default values. Due to the heuristics of how multiple fine-grained locks are combined into coarser-grained locks it might sometimes be necessary (if you have a lot of page locks in a lot of tables) to raise max_connections beyond what you need for actual connections. I have not actually seen this yet, but it could happen. If it does, please share details of the workload and your settings, so that we can look at possible adjustments to the lock granularity promotion logic or the memory allocation techniques. Please note the suggestions on performance in the documentation of serializable transactions. In particular, if a transaction will not be modifying data, setting it to READ ONLY can help a lot. Not only will it help performance, but it will tend to reduce the number of predicate locks needed. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] Inconsistent bgworker behaviour
Hello, I have been working on a module which launches background workers for a list of databases provided by a configuration parameter(say m_databases). This configuration parameter can be edited and reloaded. It has a launcher which manages all the workers launched by the module. The worker.bgw_notify_pid of the workers are set to the launcher pid. The number of background workers that can be launched is restricted by max_worker_processes. Consider the following scenario: max_worker_processes = 3 m_databases='db1, db2' The server is started. The m_databases is updated to m_databases='db3, db2' $ pg_ctl reload The expected behavior is that the db1 worker should be terminated and db3 worker should be launched. However I found that this behavior is not consistent. In few runs, when the databases parameter is edited and reloaded, the new worker is launched before the old ones are terminated causing an error. I have used the following code on the launcher to ensure that the old unnecessary background workers are terminated before the new background workers are launched for newly added databases. for (i = 0; i workers_launched; i++) { if (!is_inlist(new_dblist, workers[i]-dbname)) { /* Ensure that the background worker is terminated before regitsering * new workers to avoid crossing the limit of max_worker_processes */ ResetLatch(MyProc-procLatch); TerminateBackgroundWorker(workers[i]-handle); WaitLatch(MyProc-procLatch, WL_LATCH_SET, 0); } } . . . (launch new workers) . The Latch is set when the SIGUSR1 signal is received. IIUC correctly, the launcher gets the SIGUSR1 when the bgworker process has exited. No new worker is launched or terminated in between still the code does not work as expected for all the runs. Any help will be appreciated. Thank you, Beena
Re: [GENERAL] Does anyone user pg-pool II on real production ? Please help me.
On 6 Jan 2015 03:02, tuanhoanganh hatua...@gmail.com wrote: Hello everybody Does anyone user pg-pool II on real production ? Yes we have a customer using it in load balancing mode and another one using it with Watchdog for high availability. How many slave do you have? and how many size of database ? In both cases 1. But I believe pgpool can support more databases. Size of db should not matter but we have 50GB if database being used with pgpool. I need config my old retail system to support ~ 1500 user with 1000GB over 4 years. I want to try pgpool-II but don't found real system use it. I guess in addition to using pgpool you should also think about using pgbouncer as connection pooling agent (I would probably install it on application server and configure to connect to pgpool) My system use direct SQL SELECT query and a lot pg procedure. If there is a way you can specify different connection strings for these two type of queries then you can make most out of pgbouncer. Do you do any explicit transactions in these procedures? Can pgpool-II support load balance from SELECT my_procedure() ... Yes you can specify those on white list (provided they are readonly procedures). Please help me Thanks in advance. Sorry for my English.
[GENERAL] BDR Error restarted
Hi Guys. First of all, BDR is cool, should have tried it earlier. Environment: CentOS 6.5, PostgreSQL 9.4.0 with BDR from yum repository Done the PostgreSQL 9.4 with BDR setup successfully by following the User Guide and Admin Doc, but got a issue when tried to do postgresql service restart on any of the cluster nodes, it always lost items in the relation pg_stat_replication after the service restarted, means lost replication connections, like normally we have 26 items(13 databases and 3 servers in cluster, so 13x2 items in pg_stat_replication on each node, yes, lots dbs on one cluster), but after the restart, got some random count of item in the relation, like 20, or 16 or 4, but not 26 as expected, i.e. not all the replication connections running well. And the logs showing replication is wait until the node has caught up, but they never catch up again. BDR Settings(replaced the real db name here): 1. on node 01, the replication src one: 2. on node 02, mostly like 03 Error logs: Detail on the pgsql relation said above: Thanks for any help/suggestion on this! -Dean -- View this message in context: http://postgresql.nabble.com/BDR-Error-restarted-tp5833139.html Sent from the PostgreSQL - general mailing list archive at Nabble.com. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] postgresql versus riak for a global exchange
I’m familiar with both PostgreSQL and Riak (1.4, not 2.0). I know that Riak 2.0 now offers strong consistency. Have not yet seen what that does to performance. Big plusses for PostgreSQL: - you can do both relational and NOSQL tasks (the Binary JSON in the latest PostgreSQL). - well-tested consistency, ACID, etc. - lots of adapters and support. - big community Big plusses for Riak: - multi-master replication - multi-data center replication - easy to scale up We use PostgreSQL in combination with Riak for data storage (we have a tokenization service). We're currently using the EnterpriseDB multi-master PostgreSQL replication and are quite happy with it. The replication runs periodically, not streaming, so there is at least a 1 second delay for replication to occur. Riak replicates quicker — but then you don’t have the strong relational structure on top. As mentioned earlier, ‘exchange…trade…asset’ is a bit vague. In addition to just storing things, you’ll need to keep track of all sorts of log-in and contact info — perhaps not ideal for Riak. Probably best to consider precisely what traits your planned application has and then look to match against the database storage. May even end up with a mix of the two just as we have. Your decision may also depend on which development language/framework you chose for the implementation. —Ray On Jan 5, 2015, at 11:37 AM, xu xiut xiut...@gmail.com wrote: Hello, I am looking at creating a toy project which may turn into an actual business if I'm lucky, the ideal is generally just an exchange for people to trade some type of asset. I'm looking at using either PostgreSQL or Riak, and I'm wondering if there are opinions and suggestions that someone would be willing to share with me when evaluating databases. This is the first time I've actually considered something besides PostgreSQL. Riak 2.0 now offers strong consistency and I really respect the community and the work that has gone into the project. It seems like it would be easy to replicate across multiple data centers. Thanks for letting me ask this here! -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Advice for using integer arrays?
On Jan 6, 2015 3:12 PM, Michael Heaney mhea...@jcvi.org wrote: I'm fairly new to Postgres, and have a design issue for which an array of integers might be a good solution. But I'd like to hear from the experts before proceeding down this path. Essentially, I'm trying to model the relationship between a group of biological samples and their genes. Each sample (there are ~10K of them now, with more coming) will have about 30,000 genes. Conversely, a particular gene may be present in almost all samples. So I've created the following table to handle the many-to-many relationship: create table sample_gene (id serial, sample_id int, gene_id int); which looks like this when populated: sample_id|gene_id --- 1 | 1 1 | 2 ... 1 |30475 2 | 1 2 | 2 ... 2 |29973 3 | 1 etc. The table now contains hundreds of millions of rows (with many, many more to come). Join performance between samples and genes is quite slow, even with indexes on sample_id and gene_id. So it occurred to me: why not eliminate all the duplicate sample_id values by storing the gene_id's in an array, like so: create table sample_gene_array (id serial, sample_id int, gene_id int [] ); So now the table data looks like this: sample_id|gene_id [] --- 1 | [1:30475] 2 | [1:29973] etc. The new table is significantly smaller, and performance (using ANY[] ) is quite good. Nevertheless, I'm uneasy. I come from a Sybase ASE background, and so have no experience with arrays as datatypes. Is it okay to store 30K+ gene values in an array in the linking table (or maybe even in the sample table itself, thus eliminating the linking table)? Should I unnest the gene_id's first, before using them to join to the gene table? TIA for any guidance you can provide. Again, I'm a Postgres neophyte - but I'm in awe of the power and flexibility of this database, and wish that I'd started using it sooner. -- Michael Heaney JCVI -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general Please provide sample queries so we can understand how you query the data.
Re: [GENERAL] Correct/optimal DML query for application session management ?
On 1/6/2015 12:02 PM, Tim Smith wrote: Hi, I'm probably being incredibly stupid and missing something incredibly simple but I've got a bit of query-writers block here ! create table app_sessions( session_id char(64) unique not null, user_id char(32) unique not null, session_start bigint not null, session_lastactive bigint not null ); The rules are : Enforced session timeout after 86400 seconds (1 day) Last active less than 1 hour ago. My idea to clean out stale sessions : delete from app_sessions where extract (epoch from now())-session_start86400 or session_lastactive=extract (epoch from now())-3600; But of course that's going to be a nasty query, so that's why I think I'm missing something and need a fresh pair of eyes (or a whole mailing list's worth of eyes !!). Thanks guys ! I don't see any other way. Why do you think it'll be so nasty? Cuz it'll table scan? You have no indexes so it doesn't matter what you write, it'll have to scan all rows. How many rows do you expect to have? 500? 1000? Table scan will be fine. If you wanted to make it more readable .. but work the same, you could use timestamptz instead of bigint, and then write: where current_timestamp - '1 day'::interval session_start -Andy -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] postgresql versus riak for a global exchange
A very popular design I see is often this: - PostgreSQL for account, inventory, transactional; and all writes - NoSQL (Redis, Riak, Mongo, etc) for read-only index postgres (almost like a read-through cache) and assembled documents On Jan 5, 2015, at 5:46 PM, Raymond Cote wrote: I’m familiar with both PostgreSQL and Riak (1.4, not 2.0). I know that Riak 2.0 now offers strong consistency. Have not yet seen what that does to performance. Big plusses for PostgreSQL: - you can do both relational and NOSQL tasks (the Binary JSON in the latest PostgreSQL). - well-tested consistency, ACID, etc. - lots of adapters and support. - big community Big plusses for Riak: - multi-master replication - multi-data center replication - easy to scale up We use PostgreSQL in combination with Riak for data storage (we have a tokenization service). We're currently using the EnterpriseDB multi-master PostgreSQL replication and are quite happy with it. The replication runs periodically, not streaming, so there is at least a 1 second delay for replication to occur. Riak replicates quicker — but then you don’t have the strong relational structure on top. As mentioned earlier, ‘exchange…trade…asset’ is a bit vague. In addition to just storing things, you’ll need to keep track of all sorts of log-in and contact info — perhaps not ideal for Riak. Probably best to consider precisely what traits your planned application has and then look to match against the database storage. May even end up with a mix of the two just as we have. Your decision may also depend on which development language/framework you chose for the implementation. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Advice for using integer arrays?
On 01/06/2015 10:09 AM, Michael Heaney wrote: I'm fairly new to Postgres, and have a design issue for which an array of integers might be a good solution. But I'd like to hear from the experts before proceeding down this path. Essentially, I'm trying to model the relationship between a group of biological samples and their genes. Each sample (there are ~10K of them now, with more coming) will have about 30,000 genes. Conversely, a particular gene may be present in almost all samples. So I've created the following table to handle the many-to-many relationship: create table sample_gene (id serial, sample_id int, gene_id int); which looks like this when populated: sample_id|gene_id --- 1 | 1 1 | 2 ... 1 |30475 2 | 1 2 | 2 ... 2 |29973 3 | 1 etc. The table now contains hundreds of millions of rows (with many, many more to come). Join performance between samples and genes is quite slow, even with indexes on sample_id and gene_id. So it occurred to me: why not eliminate all the duplicate sample_id values by storing the gene_id's in an array, like so: create table sample_gene_array (id serial, sample_id int, gene_id int [] ); So now the table data looks like this: sample_id|gene_id [] --- 1 | [1:30475] 2 | [1:29973] etc. The new table is significantly smaller, and performance (using ANY[] ) is quite good. Nevertheless, I'm uneasy. I come from a Sybase ASE background, and so have no experience with arrays as datatypes. Is it okay to store 30K+ gene values in an array in the linking table (or maybe even in the sample table itself, thus eliminating the linking table)? Should I unnest the gene_id's first, before using them to join to the gene table? TIA for any guidance you can provide. Again, I'm a Postgres neophyte - but I'm in awe of the power and flexibility of this database, and wish that I'd started using it sooner. -- Michael Heaney JCVI Think genotype calls (sample, marker, call) and this just explodes in terms of number of rows but I'm trying to not let that bother me. Partion by chromosome, or chromosome arm (even if partition is just separate tables) and the magnitude becomes manageable. At least as manageable as multi-gigabyte gvcf files per sample from GATK. But I'm a little confused: all samples for a given species will have the same genes (roughly). Are you storing gene variants (sequence or otherwise) per sample?
[GENERAL] Advice for using integer arrays?
I'm fairly new to Postgres, and have a design issue for which an array of integers might be a good solution. But I'd like to hear from the experts before proceeding down this path. Essentially, I'm trying to model the relationship between a group of biological samples and their genes. Each sample (there are ~10K of them now, with more coming) will have about 30,000 genes. Conversely, a particular gene may be present in almost all samples. So I've created the following table to handle the many-to-many relationship: create table sample_gene (id serial, sample_id int, gene_id int); which looks like this when populated: sample_id|gene_id --- 1 | 1 1 | 2 ... 1 |30475 2 | 1 2 | 2 ... 2 |29973 3 | 1 etc. The table now contains hundreds of millions of rows (with many, many more to come). Join performance between samples and genes is quite slow, even with indexes on sample_id and gene_id. So it occurred to me: why not eliminate all the duplicate sample_id values by storing the gene_id's in an array, like so: create table sample_gene_array (id serial, sample_id int, gene_id int [] ); So now the table data looks like this: sample_id|gene_id [] --- 1 | [1:30475] 2 | [1:29973] etc. The new table is significantly smaller, and performance (using ANY[] ) is quite good. Nevertheless, I'm uneasy. I come from a Sybase ASE background, and so have no experience with arrays as datatypes. Is it okay to store 30K+ gene values in an array in the linking table (or maybe even in the sample table itself, thus eliminating the linking table)? Should I unnest the gene_id's first, before using them to join to the gene table? TIA for any guidance you can provide. Again, I'm a Postgres neophyte - but I'm in awe of the power and flexibility of this database, and wish that I'd started using it sooner. -- Michael Heaney JCVI -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] Correct/optimal DML query for application session management ?
Hi, I'm probably being incredibly stupid and missing something incredibly simple but I've got a bit of query-writers block here ! create table app_sessions( session_id char(64) unique not null, user_id char(32) unique not null, session_start bigint not null, session_lastactive bigint not null ); The rules are : Enforced session timeout after 86400 seconds (1 day) Last active less than 1 hour ago. My idea to clean out stale sessions : delete from app_sessions where extract (epoch from now())-session_start86400 or session_lastactive=extract (epoch from now())-3600; But of course that's going to be a nasty query, so that's why I think I'm missing something and need a fresh pair of eyes (or a whole mailing list's worth of eyes !!). Thanks guys ! -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general