Re: [HACKERS] writable CTEs
On 29/12/10 03:35, Peter Eisentraut wrote: On tis, 2010-12-28 at 00:19 +, Peter Geoghegan wrote: It's worth noting that officially (i.e. in the docs), we don't even call CTEs CTEs at any point. We call them WITH queries. I think that that's a mistake because we call them CTEs everywhere else. I think WITH query or WITH clause is more understandable than CTE, which to me is a term that has no relationship with anything else. Peter's comment certainly resonates with me. When I first heard about this CTE business I had to go to the web to discover that they were components of the WITH clause - which I was familiar with from my DB2 days... regards Mark
Re: [HACKERS] pg_primary_conninfo
On Tue, Dec 28, 2010 at 17:43, Guillaume Lelarge guilla...@lelarge.info wrote: Le 28/12/2010 17:36, Tom Lane a écrit : Guillaume Lelarge guilla...@lelarge.info writes: Le 28/12/2010 16:34, Tom Lane a écrit : 1. It'll have to be restricted to superusers, therefore ordinary users on the slave can't actually make use of it. pgAdmin's users usually connect as superusers. It would be a function for DBAs, of course. I don't see why normal users would be intersted in it, really. 2. It's not what you want, since you don't want to connect as the replication user. Therefore, you'd have to start by parsing out the parts you do need. Expecting every client to include conninfo parsing logic doesn't seem cool to me. I can see the point of, say, a primary_host_address() function returning inet, which would be way better on both those dimensions than the current proposal. But I'm not sure what else would be needed. Yeah, it would be better that way. I'm actually interested in Magnus's patch because, during 9.0 development phase, I had in mind to parse the primary_conninfo till I found I could not get this value with SHOW or current_setting(). But, actually, what I really need is host and port. This way, I could connect to the master node, with the same user and password that was used on the slave node. I agree it might well be more useful to have it split up for us. We'd need the host name (though it would have to be text and not inet, since we'd need the unix socket path for a local connection) and port. And username. But certainly not password, and probably none of the other parameters. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_primary_conninfo
On Tue, Dec 28, 2010 at 18:12, Robert Haas robertmh...@gmail.com wrote: On Dec 28, 2010, at 10:34 AM, Tom Lane t...@sss.pgh.pa.us wrote: I'm still wondering what's the actual use-case for exposing this inside SQL. Those with a legitimate need-to-know can look at the slave server's config files, no? SQL access is frequently more convenient, though. Yes. Reading it in the files does not scale with $LOTS of servers, be them slaves or masters or both. You can't assume that people have direct filesystem access to the server (or at least it's data directory) - particularly when the organisation is large enough that you have different teams running the db's and the OS's, not to mention when you have some on-call group who verifies the things in the middle of the night... Unless you mean reading them with pg_read_file() and then parsing it manually, but that just requires everybody to re-invent the wheel we already have in the parser. Although maybe now that we've made recovery.conf use the GUC lexer we oughta continue in that vein and expose those parameters as PGC_INTERNAL GUCs rather than inventing a new function for it... That's definitely another option that I wouldn't object to if people prefer that way. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_primary_conninfo
On 29.12.2010 10:36, Magnus Hagander wrote: On Tue, Dec 28, 2010 at 18:12, Robert Haasrobertmh...@gmail.com wrote: Although maybe now that we've made recovery.conf use the GUC lexer we oughta continue in that vein and expose those parameters as PGC_INTERNAL GUCs rather than inventing a new function for it... That's definitely another option that I wouldn't object to if people prefer that way. I recall from previous discussions that we have a consensus that we should unite recovery.conf and postgresql.conf, so that they're all GUCs and you can put all the settings in postgresql.conf. Let's do that. http://archives.postgresql.org/pgsql-hackers/2010-10/msg00033.php -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] and it's not a bunny rabbit, either
On 29.12.2010 06:54, Robert Haas wrote: With the patch: rhaas=# cluster v; ERROR: views do not support CLUSTER do not support sounds like a missing feature, rather than a nonsensical command. How about something like CLUSTER cannot be used on views The patch changes a bunch of heap_openrv() calls to relation_openrv(). Perhaps it would be better make the error message something like \%s\ is not a table, and keep the callers unchanged. It's not particularly useful to repeat the command in the error message, the user should know what command he issued. Even if it's buried deep in a PL/pgSQL function or something, it should be clear from the context lines. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] writable CTEs
On 2010-12-29 09:16, Mark Kirkwood wrote: On 29/12/10 03:35, Peter Eisentraut wrote: On tis, 2010-12-28 at 00:19 +, Peter Geoghegan wrote: It's worth noting that officially (i.e. in the docs), we don't even call CTEs CTEs at any point. We call them WITH queries. I think that that's a mistake because we call them CTEs everywhere else. I think WITH query or WITH clause is more understandable than CTE, which to me is a term that has no relationship with anything else. Peter's comment certainly resonates with me. When I first heard about this CTE business I had to go to the web to discover that they were components of the WITH clause - which I was familiar with from my DB2 days... For me it was the converse.. I first heard of Common Table Expressions from SQLserver users, at roughly the same time that CTE's were introduced in 8.4. When I decided to use them, it took me a while to figure out the docs refered to it as WITH queries. ISTM we're already past the choice to have a single name. IMHO it would be best if the documentation has a reference / index part in which both WITH queries and Common Table Expressions (CTE) are listed. Also, the terms CTE and CTEScan appear in EXPLAIN output, it would be nice to have a meaningful hit when looking for the term in the documentation page, instead of 'Your search for *cte* returned no hits.' regards, Yeb Havinga
Re: [HACKERS] Streaming replication as a separate permissions
On Tue, Dec 28, 2010 at 13:05, Magnus Hagander mag...@hagander.net wrote: On Mon, Dec 27, 2010 at 22:53, Magnus Hagander mag...@hagander.net wrote: On Mon, Dec 27, 2010 at 22:42, Tom Lane t...@sss.pgh.pa.us wrote: Magnus Hagander mag...@hagander.net writes: Updated patch, still pending docs, but otherwise updated: allow start/stop backup, make sure only superuser can turn on/off the flag, include in system views, show properly in psql. I'd suggest avoiding creating the static cache variable AuthenticatedUserIsReplicationRole. This can't possibly be sufficiently interesting from a performance point of view to justify the risks associated with stale cache values. Just look up the pg_authid syscache entry when needed, ie, treat it more like rolcreaterole than rolsuper. Sure, I catually had it that way first. But doing it this way was less code. But I realize I should've revisited that decision when I made the change to pg_start_backup and pg_stop_backup - before that the checks would only happen during a very short window of time at the start of the connection, but now it can happen later.. BTW, you forgot pg_dumpall support. Gah. I knew that, but somehow dropped it from my TODO. Thanks for the reminder! Ok, here's an updated patch that does both these and includes documentation and regression test changes. With that, I think we're good to go. I've applied this version (with some minor typo-fixes). -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_streamrecv for 9.1?
Would people be interested in putting pg_streamrecv (http://github.com/mhagander/pg_streamrecv) in bin/ or contrib/ for 9.1? I think it would make sense to do so. It could/should then also become the default tool for doing base-backup-over-libpq, assuming me or Heikki (or somebody else) finishes off the patch for that before 9.1. We need a tool for that of some kind if we add the functionality, after all... What do people think - is there interest in that, or is it better off being an outside tool? -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] and it's not a bunny rabbit, either
On Wed, Dec 29, 2010 at 4:09 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 29.12.2010 06:54, Robert Haas wrote: With the patch: rhaas=# cluster v; ERROR: views do not support CLUSTER do not support sounds like a missing feature, rather than a nonsensical command. How about something like CLUSTER cannot be used on views I'm fine with flipping the ordering around. I think I like it marginally better this way, but you and Tom both seem to prefer the opposite ordering, ergo so be it (barring a sudden influx of contrary votes). The patch changes a bunch of heap_openrv() calls to relation_openrv(). Perhaps it would be better make the error message something like \%s\ is not a table, and keep the callers unchanged. It's not particularly useful to repeat the command in the error message, the user should know what command he issued. Even if it's buried deep in a PL/pgSQL function or something, it should be clear from the context lines. Did you read the whole thread? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_streamrecv for 9.1?
On Wed, Dec 29, 2010 at 5:47 AM, Magnus Hagander mag...@hagander.net wrote: Would people be interested in putting pg_streamrecv (http://github.com/mhagander/pg_streamrecv) in bin/ or contrib/ for 9.1? I think it would make sense to do so. It could/should then also become the default tool for doing base-backup-over-libpq, assuming me or Heikki (or somebody else) finishes off the patch for that before 9.1. We need a tool for that of some kind if we add the functionality, after all... What do people think - is there interest in that, or is it better off being an outside tool? +1 for including it. If it's reasonably mature, +1 for bin rather than contrib. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] writable CTEs
On Wed, Dec 29, 2010 at 4:19 AM, Yeb Havinga yebhavi...@gmail.com wrote: Also, the terms CTE and CTEScan appear in EXPLAIN output, it would be nice to have a meaningful hit when looking for the term in the documentation page, instead of 'Your search for cte returned no hits.' This is an excellent point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: new patch of MERGE (merge_204) a question about duplicated ctid
Erik Rijkers wrote: I get some whitespace-warnings, followed by error: $ git apply /home/rijkers/download/pgpatches/0091/merge/20101206/merge_204_2010DEC06.patch /home/rijkers/download/pgpatches/0091/merge/20101206/merge_204_2010DEC06.patch:481: trailing whitespace. /home/rijkers/download/pgpatches/0091/merge/20101206/merge_204_2010DEC06.patch:482: trailing whitespace. if (IsA(plan, ModifyTable) /home/rijkers/download/pgpatches/0091/merge/20101206/merge_204_2010DEC06.patch:550: trailing whitespace. /*print the action qual*/ /home/rijkers/download/pgpatches/0091/merge/20101206/merge_204_2010DEC06.patch:556: trailing whitespace. (act_plan-operation == CMD_INSERT || /home/rijkers/download/pgpatches/0091/merge/20101206/merge_204_2010DEC06.patch:560: trailing whitespace. error: patch failed: src/backend/optimizer/plan/planner.c:739 error: src/backend/optimizer/plan/planner.c: patch does not appl Maybe I'm doing something wrong, but I've never had good luck with git apply. I took this patch and applied it the 12/15 copy of HEAD I had checked out (trying to minimize drift in there since the patch was created) using: patch -p 1 merge_204_2010DEC06.patch There was one trivial conflict it produced src/backend/optimizer/plan/planner.c.rej for, and that fix was straightforward to apply by hand. The result is now sitting as the merge204 branch in my github repo: https://github.com/greg2ndQuadrant/postgres/tree/merge204 if you did want to try this out. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_streamrecv for 9.1?
Em 29-12-2010 07:47, Magnus Hagander escreveu: Would people be interested in putting pg_streamrecv (http://github.com/mhagander/pg_streamrecv) in bin/ or contrib/ for 9.1? I think it would make sense to do so. +1 but... It could/should then also become the default tool for doing base-backup-over-libpq, assuming me or Heikki (or somebody else) finishes off the patch for that before 9.1. I think that the base backup feature is more important than simple streaming chunks of the WAL (SR already does this). Talking about the base backup over libpq, it is something we should implement to fulfill people's desire that claim an easy replication setup. IIRC, Dimitri already coded a base backup over libpq tool [1] but it is written in Python. [1] https://github.com/dimitri/pg_basebackup/ -- Euler Taveira de Oliveira http://www.timbira.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: new patch of MERGE (merge_204) a question about duplicated ctid
I did some basic testing of the latest update here, but quickly hit a problem that wasn't in the previous version. Attached is the standalone test script that used to work, but now fails like this: psql:simple.sql:12: ERROR: the vars in merge action tlist of qual should only belongs to the source table or target table This test case is intended to implement the common UPSERT situation that is one of the main requests that MERGE is intended to satisfy, using this syntax: MERGE INTO Stock t USING (VALUES(10,100)) AS s(item_id,balance) ON s.item_id=t.item_id WHEN MATCHED THEN UPDATE SET balance=t.balance + s.balance WHEN NOT MATCHED THEN INSERT VALUES(s.item_id,s.balance) ; If you can suggest an alternate way to express this that works with the new patch, I might switch to that and retry. I was never 100% sure this was the right way to write this, and I don't have another database with MERGE support here to try against. (Aside: if someone else does, I'd be really curious to see if the attached test case works or not on another database system. I think we need to include compatibility testing with other MERGE implementations into the test mix here soon.) Regardless, this failure suggests that you need to add this sort of test to the regression test set. We need to have an example of an UPSERT using constant data in there to make sure this continues to work in the future. This is a good week for me in terms of having time for PostgreSQL hacking, so if you can suggest something here or update the patch I'll try it soon afterwards. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us DROP TABLE Stock; CREATE TABLE Stock(item_id int UNIQUE, balance int); INSERT INTO Stock VALUES (10, 2200); INSERT INTO Stock VALUES (20, 1900); SELECT * FROM Stock ORDER BY item_id; MERGE INTO Stock t USING (VALUES(10,100)) AS s(item_id,balance) ON s.item_id=t.item_id WHEN MATCHED THEN UPDATE SET balance=t.balance + s.balance WHEN NOT MATCHED THEN INSERT VALUES(s.item_id,s.balance) ; SELECT * FROM Stock ORDER BY item_id; MERGE INTO Stock t USING (VALUES(30,2000)) AS s(item_id,balance) ON s.item_id=t.item_id WHEN MATCHED THEN UPDATE SET balance=t.balance + s.balance WHEN NOT MATCHED THEN INSERT VALUES(s.item_id,s.balance) ; SELECT * FROM Stock ORDER BY item_id; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why is sorting on two columns so slower thansortingon one column?
On Mon, Dec 27, 2010 at 3:58 AM, Simon Riggs si...@2ndquadrant.com wrote: I think the answer is that only the first column comparison is optimised. Second and subsequent comparisons are not optimised. What sort of optimization are you referring to here? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] small table left outer join big table
On Tue, Dec 28, 2010 at 5:13 AM, Jie Li jay23j...@gmail.com wrote: Hi, Please see the following plan: postgres=# explain select * from small_table left outer join big_table using (id); QUERY PLAN Hash Left Join (cost=126408.00..142436.98 rows=371 width=12) Hash Cond: (small_table.id = big_table.id) - Seq Scan on small_table (cost=0.00..1.09 rows=9 width=8) - Hash (cost=59142.00..59142.00 rows=410 width=8) - Seq Scan on big_table (cost=0.00..59142.00 rows=410 width=8) (5 rows) Here I have a puzzle, why not choose the small table to build hash table? It can avoid multiple batches thus save significant I/O cost, isn't it? Yeah, you'd think. Can you post a full reproducible test case? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] small table left outer join big table
On Wed, 2010-12-29 at 07:17 -0500, Robert Haas wrote: Here I have a puzzle, why not choose the small table to build hash table? It can avoid multiple batches thus save significant I/O cost, isn't it? Yeah, you'd think. Can you post a full reproducible test case? It's not a bug, that's the way it currently works. We don't need a test case for that. I agree that the optimisation would be a useful one. It allows you to ask the query Show me sales for each of my stores efficiently, rather than being forced to request the inner join query Show me the sales for each of my stores for which there have been sales, which is a much less useful query. -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_streamrecv for 9.1?
On Wed, Dec 29, 2010 at 13:03, Euler Taveira de Oliveira eu...@timbira.com wrote: Em 29-12-2010 07:47, Magnus Hagander escreveu: Would people be interested in putting pg_streamrecv (http://github.com/mhagander/pg_streamrecv) in bin/ or contrib/ for 9.1? I think it would make sense to do so. +1 but... It could/should then also become the default tool for doing base-backup-over-libpq, assuming me or Heikki (or somebody else) finishes off the patch for that before 9.1. I think that the base backup feature is more important than simple streaming chunks of the WAL (SR already does this). Talking about the base backup over libpq, it is something we should implement to fulfill people's desire that claim an easy replication setup. Yes, definitely. But that also needs server side support. IIRC, Dimitri already coded a base backup over libpq tool [1] but it is written in Python. Yeah, the WIP patch heikki posted is simliar, except it uses tar format and is implemented natively in the backend with no need for pl/pythonu to be installed. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] small table left outer join big table
Excerpts from Robert Haas's message of mié dic 29 09:17:17 -0300 2010: On Tue, Dec 28, 2010 at 5:13 AM, Jie Li jay23j...@gmail.com wrote: Hi, Please see the following plan: postgres=# explain select * from small_table left outer join big_table using (id); QUERY PLAN Hash Left Join (cost=126408.00..142436.98 rows=371 width=12) Hash Cond: (small_table.id = big_table.id) - Seq Scan on small_table (cost=0.00..1.09 rows=9 width=8) - Hash (cost=59142.00..59142.00 rows=410 width=8) - Seq Scan on big_table (cost=0.00..59142.00 rows=410 width=8) (5 rows) Here I have a puzzle, why not choose the small table to build hash table? It can avoid multiple batches thus save significant I/O cost, isn't it? Yeah, you'd think. Can you post a full reproducible test case? Also, what version is this? -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: new patch of MERGE (merge_204) a question about duplicated ctid
On 2010-12-29 2:14 PM, Greg Smith wrote: MERGE INTO Stock t USING (VALUES(10,100)) AS s(item_id,balance) ON s.item_id=t.item_id WHEN MATCHED THEN UPDATE SET balance=t.balance + s.balance WHEN NOT MATCHED THEN INSERT VALUES(s.item_id,s.balance) ; If you can suggest an alternate way to express this that works with the new patch, I might switch to that and retry. I was never 100% sure this was the right way to write this, and I don't have another database with MERGE support here to try against. As far as I can tell, this should work. I played around with the patch and the problem seems to be the VALUES: INTO Stock t USING (SELECT 30, 2000) AS s(item_id,balance) ON s.item_id=t.item_id WHEN MATCHED THEN UPDATE SET balance=t.balance + s.balance WHEN NOT MATCHED THEN INSERT VALUES(s.item_id,s.balance) ; MERGE 1 Regards, Marko Tiikkaja -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Avoiding rewrite in ALTER TABLE ALTER TYPE
ALTER TABLE ALTER TYPE always rewrites the table heap and its indexes. In some cases, we can determine that doing so is unhelpful, and that the conversion shall always succeed: CREATE DOMAIN loosedom AS text; CREATE TABLE t (c varchar(2)); ALTER TABLE t ALTER c TYPE varchar(4); ALTER TABLE t ALTER c TYPE text; ALTER TABLE t ALTER c TYPE loosedom; In other cases, we can determine that the rewrite is unhelpful, but a cast could still throw an error: CREATE DOMAIN tightdom AS text CHECK (value LIKE '%/'); CREATE TABLE t (c text); ALTER TABLE t ALTER c TYPE xml USING c::xml; ALTER TABLE t ALTER c TYPE varchar(64); ALTER TABLE t ALTER c TYPE tightdom; I wish to replace table rewrites with table verification scans where possible, then skip those verification scans where possible. I've noted three subproblems that require distinct treatment. In the following discussion, tuple is the tuple on disk, and tuple' is tuple we would form during a rewriting ALTER TABLE ALTER TYPE. Likewise col and col' for individual columns. 1. The table heap rewrite itself To legitimately leave the table heap intact, we must confirm datumIsEqual(col, col') for every tuple and column. Without any specific configuration to suggest when this might be true, we can always determine its truth through a _verification scan_ of the table. If the datumIsEqual test fails for any tuple during the verification scan, terminate the scan and proceed with a table rewrite. When should we attempt the verification scan in the first place, as opposed to proceeding directly to a rewrite? For some time, I thought it best to automatically attempt it when the type change degenerates to a single binary coercion. Cases like the text-xml and timestamp-timestamptz [timezone=UTC] are more difficult to identify mechanically, so I supposed to let the user identify them via a new keyword to ALTER TABLE ALTER TYPE. Having thought on it more, though, it actually seems best to attempt the verification scan *every* time. In most ineligible conversions, an inequality will appear very early, so the scan is effectively O(1) in the negative case. A notable exception is something like char(6)-varchar(6) in a table with a billion tuples having length(col) = 6 and one with length(col) = 5. The verification scan might read most of the table before finding the one tuple that forces a rewrite. That isn't a particularly regular scenario in my experience, so the just do the right thing aspect of preceding every potential rewrite with a verification scan seems to win out. Certain very popular type changes (see introduction) can be _exempt_ from the verification scan: we can determine that they will always succeed. To capture that, I propose extending CREATE CAST with the notion of an exemptor function: CREATE CAST (source_type AS target_type) { WITH FUNCTION function_name (argument_type [, ...]) [ WITH EXEMPTOR function_name ] | WITHOUT FUNCTION | WITH INOUT } [ AS ASSIGNMENT | AS IMPLICIT ] The exemptor shall have this signature: exemptor_func( integer, -- source_typmod integer -- dest_typmod ) RETURNS boolean The exemptor shall return true iff datumIsEqual(x, x::target_type(dest_typmod)) for every x in source_type most recently coerced to source_type(source_typmod). When the cast is WITHOUT FUNCTION (a binary coercion) and target_type lacks a length coercion cast, the cast has an implicit exemption, and an exemptor is superfluous: code can assume an exemptor that always returns true. Use of WITH EXEMPTOR mainly makes sense alongside multi-arg WITH FUNCTION (should probably raise an error upon other uses). When ATPrepAlterColumnType determines that a transformation expression is merely a series of casts (or perhaps some stronger condition), and every such cast has an exemptor or implicit exemption, it will call those exemptors with the old and new typmods. If the exemptors all return true and this continues to hold for all other ALTER TYPE subcommands of this ALTER TABLE, we will skip the verification scan and table rewrite. I will add exemptors to length coercion casts of types other than character and bit, which are exempt only in the no-change case. The candidates that are not pure length coercions, int8-bit and int4-bit, would not benefit. 2. CHECK constraints, index predicates, and expression indexes CHECK constraints, index predicates, and indexed expressions can change behavior without limitation when the types of columns they reference change. An exempt typmod change alone will not affect them. While there are some other safe cases (CHECK (length(col) = 2, over a varchar-text conversion), I don't have a general framework in mind for identifying them. For now, an exempt change that includes a type change, such as varchar(2)-text, will still require a verification scan whenever a CHECK constraint, index predicate, or indexed expression references the column. The verification scan will confirm
Re: [HACKERS] small table left outer join big table
On Wed, Dec 29, 2010 at 7:34 AM, Simon Riggs si...@2ndquadrant.com wrote: On Wed, 2010-12-29 at 07:17 -0500, Robert Haas wrote: Here I have a puzzle, why not choose the small table to build hash table? It can avoid multiple batches thus save significant I/O cost, isn't it? Yeah, you'd think. Can you post a full reproducible test case? It's not a bug, that's the way it currently works. We don't need a test case for that. I agree that the optimisation would be a useful one. It allows you to ask the query Show me sales for each of my stores efficiently, rather than being forced to request the inner join query Show me the sales for each of my stores for which there have been sales, which is a much less useful query. Oh, you're right. I missed the fact that it's a left join. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump --split patch
On Wed, Dec 29, 2010 at 2:27 AM, Joel Jacobson j...@gluefinance.com wrote: description of split stuff So, how different (or not) is this to the directory format that was coming out of the desire of a parallel pg_dump? a. -- Aidan Van Dyk Create like a god, ai...@highrise.ca command like a king, http://www.highrise.ca/ work like a slave. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump --split patch
2010/12/29 Aidan Van Dyk ai...@highrise.ca On Wed, Dec 29, 2010 at 2:27 AM, Joel Jacobson j...@gluefinance.com wrote: description of split stuff So, how different (or not) is this to the directory format that was coming out of the desire of a parallel pg_dump? Not sure what format you are referring to? Custom, tar or plain text? I noticed there are two undocumented formats as well, append and file. I tried both of these undocumented formats, but it did not procude any directory structure of the dumped objects. Could you please explain how to use the directory format is such a format already exists? I can't find it in the documentation nor the source code of HEAD. a. -- Aidan Van Dyk Create like a god, ai...@highrise.ca command like a king, http://www.highrise.ca/ work like a slave. -- Best regards, Joel Jacobson Glue Finance E: j...@gluefinance.com T: +46 70 360 38 01 Postal address: Glue Finance AB Box 549 114 11 Stockholm Sweden Visiting address: Glue Finance AB Birger Jarlsgatan 14 114 34 Stockholm Sweden
Re: [HACKERS] Streaming replication as a separate permissions
On Wed, Dec 29, 2010 at 5:09 AM, Magnus Hagander mag...@hagander.netwrote: Ok, here's an updated patch that does both these and includes documentation and regression test changes. With that, I think we're good to go. I've applied this version (with some minor typo-fixes). Do you think we could have worded these a bit better entryPrepare for performing on-line backup (restricted to superusers or replication roles)/entry to say 'restricted to superusers _and_ replication roles'. Saying 'restricted to superusers _or_ replication roles' may mean that at any time we allow one or the other, but not both (reader might assume that decision is based on some other GUC). Using 'and' would mean that we allow it for both of those roles. Any specific reason NOREPLICATION_P and REPLICATION_P use the _P suffix? AIUI, that suffix is used in gram.y to tag a token to mean it belongs to Parser, and to avoid conflict with the same token elsewhere; NULL_P is a good example. In pg_authid.h, 8 spaces used between 'bool' and 'rolreplication', instead tabs should have been used as the surrounding code. Regards, -- gurjeet.singh @ EnterpriseDB - The Enterprise Postgres Company http://www.EnterpriseDB.com singh.gurj...@{ gmail | yahoo }.com Twitter/Skype: singh_gurjeet Mail sent from my BlackLaptop device
Re: [HACKERS] pg_dump --split patch
On Wed, Dec 29, 2010 at 8:31 AM, Joel Jacobson j...@gluefinance.com wrote: 2010/12/29 Aidan Van Dyk ai...@highrise.ca On Wed, Dec 29, 2010 at 2:27 AM, Joel Jacobson j...@gluefinance.com wrote: description of split stuff So, how different (or not) is this to the directory format that was coming out of the desire of a parallel pg_dump? Not sure what format you are referring to? Custom, tar or plain text? I noticed there are two undocumented formats as well, append and file. I tried both of these undocumented formats, but it did not procude any directory structure of the dumped objects. Could you please explain how to use the directory format is such a format already exists? I can't find it in the documentation nor the source code of HEAD. It is still being discussed as a patch to pg_dump. Google for directory archive format for pg_dump, specifically in archives.postgresql.org. AFAIK, that applies to parallel dumps of data (may help in --schema-only dumps too), and what you are trying is for schema. Regards -- gurjeet.singh @ EnterpriseDB - The Enterprise Postgres Company http://www.EnterpriseDB.com singh.gurj...@{ gmail | yahoo }.com Twitter/Skype: singh_gurjeet Mail sent from my BlackLaptop device
Re: [HACKERS] pg_dump --split patch
On Wed, Dec 29, 2010 at 9:11 AM, Gurjeet Singh singh.gurj...@gmail.com wrote: On Wed, Dec 29, 2010 at 8:31 AM, Joel Jacobson j...@gluefinance.com wrote: 2010/12/29 Aidan Van Dyk ai...@highrise.ca On Wed, Dec 29, 2010 at 2:27 AM, Joel Jacobson j...@gluefinance.com wrote: description of split stuff So, how different (or not) is this to the directory format that was coming out of the desire of a parallel pg_dump? Not sure what format you are referring to? Custom, tar or plain text? I noticed there are two undocumented formats as well, append and file. I tried both of these undocumented formats, but it did not procude any directory structure of the dumped objects. Could you please explain how to use the directory format is such a format already exists? I can't find it in the documentation nor the source code of HEAD. It is still being discussed as a patch to pg_dump. Google for directory archive format for pg_dump, specifically in archives.postgresql.org. Specifically: Message-ID: aanlktimueltxwrsqdqnwxik_k1y3ych1u-9nghzqp...@mail.gmail.com AFAIK, that applies to parallel dumps of data (may help in --schema-only dumps too), and what you are trying is for schema. Right, but one of the things it does is break the dump in to parts, and put them in a directory/file organization. Both are doing it for different reasons, but doing pretty much the same thing. But can the layout/organization of Joachim's patch can be made human friendly in the vein of Joel's vision? a. -- Aidan Van Dyk Create like a god, ai...@highrise.ca command like a king, http://www.highrise.ca/ work like a slave. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication as a separate permissions
On Wed, Dec 29, 2010 at 15:05, Gurjeet Singh singh.gurj...@gmail.com wrote: On Wed, Dec 29, 2010 at 5:09 AM, Magnus Hagander mag...@hagander.net wrote: Ok, here's an updated patch that does both these and includes documentation and regression test changes. With that, I think we're good to go. I've applied this version (with some minor typo-fixes). Do you think we could have worded these a bit better entryPrepare for performing on-line backup (restricted to superusers or replication roles)/entry to say 'restricted to superusers _and_ replication roles'. Saying 'restricted to superusers _or_ replication roles' may mean that at any time we allow one or the other, but not both (reader might assume that decision is based on some other GUC). Uh, not sure, actually. I would read the and as meaning you needed *both*, which isn't true. We do allow, at any time, one or the other - *or* both. Any specific reason NOREPLICATION_P and REPLICATION_P use the _P suffix? Um, I just copied it off a similar entry elsewhere. I saw no comment about what _P actually means, and I can't say I know. I know very little about the bison files :-) AIUI, that suffix is used in gram.y to tag a token to mean it belongs to Parser, and to avoid conflict with the same token elsewhere; NULL_P is a good example. In pg_authid.h, 8 spaces used between 'bool' and 'rolreplication', instead tabs should have been used as the surrounding code. Bleh. Well, pgindent will fix that. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Libpq PGRES_COPY_BOTH - version compatibility
On Tue, Dec 28, 2010 at 16:15, Tom Lane t...@sss.pgh.pa.us wrote: Magnus Hagander mag...@hagander.net writes: On Tue, Dec 28, 2010 at 13:18, Robert Haas robertmh...@gmail.com wrote: Adding a #define to our headers that you can test for seems like the way to go. That's kind of what I was going for ;) I don't see the point. You're going to need a *run time* test on PQserverVersion to figure out what the server will return, no? I need *both*. Also, if you really do need to figure out which PG headers you're compiling against, looking at catversion.h is the accepted way to do it. There's no need for yet another symbol. This file is, AFAIK, not included with client installs? It's definitely not present in the libpq-dev package on debian. It's a backend development file, no? -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] small table left outer join big table
Robert Haas robertmh...@gmail.com writes: On Wed, Dec 29, 2010 at 7:34 AM, Simon Riggs si...@2ndquadrant.com wrote: It's not a bug, that's the way it currently works. We don't need a test case for that. Oh, you're right. I missed the fact that it's a left join. The only thing that struck me as curious about it was that the OP didn't get a nestloop-with-inner-indexscan plan. That would be explainable if there was no index on the large table's id column ... but columns named like that usually have indexes. I can't get all *that* excited about complicating hash joins as proposed. The query is still fundamentally going to be slow because you won't get out of having to seqscan the large table. The only way to make it really fast is to not read all of the large table, and nestloop-with-inner-indexscan is the only plan type with a hope of doing that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump --split patch
Aidan Van Dyk ai...@highrise.ca writes: On Wed, Dec 29, 2010 at 9:11 AM, Gurjeet Singh singh.gurj...@gmail.com wrote: AFAIK, that applies to parallel dumps of data (may help in --schema-only dumps too), and what you are trying is for schema. Right, but one of the things it does is break the dump in to parts, and put them in a directory/file organization. Both are doing it for different reasons, but doing pretty much the same thing. But can the layout/organization of Joachim's patch can be made human friendly in the vein of Joel's vision? I think they're fundamentally different things, because the previously proposed patch is an extension of the machine-readable archive format, and has to remain so because of the expectation that people will want to use parallel restore with it. Joel is arguing for a split-up of the text dump format. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Libpq PGRES_COPY_BOTH - version compatibility
Magnus Hagander mag...@hagander.net writes: On Tue, Dec 28, 2010 at 16:15, Tom Lane t...@sss.pgh.pa.us wrote: Also, if you really do need to figure out which PG headers you're compiling against, looking at catversion.h is the accepted way to do it. There's no need for yet another symbol. This file is, AFAIK, not included with client installs? It's definitely not present in the libpq-dev package on debian. It's a backend development file, no? [ shrug... ] We can't be held responsible for stupid packaging decisions by distros. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fixing pg_upgrade's check of available binaries
Tom Lane wrote: I've been fooling around with creating upgrade-in-place support for the Fedora/RHEL RPMs. What I want to have is a separate postgresql-upgrade RPM containing just the minimum possible set of previous-release files, together with pg_upgrade itself. Experimenting with this convinced me that pg_upgrade is a few bricks shy of a load in its tests for whether the old and new clusters have the right binaries available: * it insists on pg_dumpall and psql being present in the old cluster, though they are not in fact called * it fails to check for pg_resetxlog, even though it needs it in both old and new clusters * it fails to check for pg_config, which it does need in the new cluster. It does not however really need it in the old cluster, because it has no use for the old cluster's --pkglibdir path. I propose the attached patch to clean these things up. Any objections? Looks good to me. I was not super-strict about checking binaries because I assumed a full install on both clusters, but it doesn't hurt to make that more specific. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Libpq PGRES_COPY_BOTH - version compatibility
On Wed, Dec 29, 2010 at 16:12, Tom Lane t...@sss.pgh.pa.us wrote: Magnus Hagander mag...@hagander.net writes: On Tue, Dec 28, 2010 at 16:15, Tom Lane t...@sss.pgh.pa.us wrote: Also, if you really do need to figure out which PG headers you're compiling against, looking at catversion.h is the accepted way to do it. There's no need for yet another symbol. This file is, AFAIK, not included with client installs? It's definitely not present in the libpq-dev package on debian. It's a backend development file, no? [ shrug... ] We can't be held responsible for stupid packaging decisions by distros. Running make install in src/interfaces/libpq does not install catversion.h. If it's required to know which version of the libpq headers are in use, it should be, shouldn't it? We can be held responsible for the packaging decisions if they use *our* make install commands, imho. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI SLRU strategy choices
Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: I'm not sure how you arrived at that number, though. http://git.postgresql.org/gitweb?p=postgresql.git;a=blob;f=src/include/access/slru.h;h=710cca70acd67e03e5f3a255b048a719ae4c4709 The way I read this, each segment is (BLCKSZ * SLRU_PAGES_PER_SEGMENT) long, which is (8kB * 32), or 256kB. The number of files is limited to 64k because of the to segment file naming. So total space is limited to 16GB. When an SLRU is used to store xids for random access, that's 4 bytes per entry, so 2^32 entries are possible, but SLRU code considers it a problem for the space to become more than half full. With the eight byte entries I need, there are 2^31 slots for entries, with the ability to use 2^30 before it becomes half full and SLRU complains. Does that look right to you, or have I misunderstood something? The only issue I can see with that is that you allocate those 8 bytes for every xid, even if it's a non-serializable transaction or a subtransaction. But the overhead is probably not significant in practice. Right. And it avoids having to sequentially search for the desired xid. A sequential search seems to me like it would get into O(N^2) performance under extreme load, whereas this approach has a couple performance plateaus at O(1) which will be, I think, the normal case, and only goes to O(N) performance under extreme load. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump --split patch
2010/12/29 Tom Lane t...@sss.pgh.pa.us I think they're fundamentally different things, because the previously proposed patch is an extension of the machine-readable archive format, and has to remain so because of the expectation that people will want to use parallel restore with it. Joel is arguing for a split-up of the text dump format. Yes, exactly. My patch is of course also a lot smaller :-) pg_dump-directory.diff.: 112 853 bytes pg-dump-split-plain-text-files-9.1devel.patch..: 5 579 bytes I just tried the pg_dump-directory.diff patch. The only thing is has in common with my patch is it writes data to different files, and it's only the data which is splitted into different files, the schema appears to go into the single file TOC. Example, pg_dump-directory.diff: $ ./pg_dump -f /crypt/dirpatch -F d -s glue $ ls -la /crypt/dirpatch/ TOC (1 file) $ rm -rf /crypt/dirpatch $ ./pg_dump -f /crypt/dirpatch -F d glue $ ls /crypt/dirpatch/ 6503.dat 6504.dat ...lots of files... 6871.dat 6872.dat 6873.dat 6874.dat TOC Example, pg_dump --split patch: $ pg_dump -f /crypt/splitpatch -F p --split -s glue $ ls /crypt/splitpatch* /crypt/splitpatch (file) /crypt/splitpatch-split: (directory) myschema1 myschema2 public $ ls /crypt/splitpatch-split/public/ AGGREGATE CONSTRAINT FK_CONSTRAINT FUNCTION INDEX SEQUENCE TABLE TRIGGER TYPE VIEW $ ls /crypt/splitpatch-split/public/FUNCTION/ myfunc.sql otherfunc.sql $ cat /crypt/splitpatch -- -- PostgreSQL database dump -- SET statement_timeout = 0; SET client_encoding = 'UTF8'; SET standard_conforming_strings = off; SET check_function_bodies = false; SET client_min_messages = warning; SET escape_string_warning = off; ...etc... \i /crypt/splitpatch-split/public/FUNCTION/myfunc.sql \i /crypt/splitpatch-split/public/FUNCTION/otherfunc.sql -- Best regards, Joel Jacobson Glue Finance
Re: [HACKERS] Avoiding rewrite in ALTER TABLE ALTER TYPE
On Dec 29, 2010, at 7:56 AM, Noah Misch n...@leadboat.com wrote: ALTER TABLE ALTER TYPE always rewrites the table heap and its indexes. In some cases, we can determine that doing so is unhelpful, and that the conversion shall always succeed: CREATE DOMAIN loosedom AS text; CREATE TABLE t (c varchar(2)); ALTER TABLE t ALTER c TYPE varchar(4); ALTER TABLE t ALTER c TYPE text; ALTER TABLE t ALTER c TYPE loosedom; In other cases, we can determine that the rewrite is unhelpful, but a cast could still throw an error: CREATE DOMAIN tightdom AS text CHECK (value LIKE '%/'); CREATE TABLE t (c text); ALTER TABLE t ALTER c TYPE xml USING c::xml; ALTER TABLE t ALTER c TYPE varchar(64); ALTER TABLE t ALTER c TYPE tightdom; I wish to replace table rewrites with table verification scans where possible, then skip those verification scans where possible. Seems like a good idea. Having thought on it more, though, it actually seems best to attempt the verification scan *every* time. In most ineligible conversions, an inequality will appear very early, so the scan is effectively O(1) in the negative case. A notable exception is something like char(6)-varchar(6) in a table with a billion tuples having length(col) = 6 and one with length(col) = 5. The verification scan might read most of the table before finding the one tuple that forces a rewrite. That isn't a particularly regular scenario in my experience, so the just do the right thing aspect of preceding every potential rewrite with a verification scan seems to win out. I think this scenario will be more common than you might think. Tables don't contain random data; they contain data that the DBA thinks is valid. The situation where the data is mostly as you expect but with a few kooky rows is, in my experience, extremely common. And it makes the worst case a LOT worse. I really doubt this is worth the complexity anyway - converting between two types that are mostly-but-not-quite binary compatible seems like an edge case in every sense. Certain very popular type changes (see introduction) can be _exempt_ from the verification scan: we can determine that they will always succeed. To capture that, I propose extending CREATE CAST with the notion of an exemptor function: CREATE CAST (source_type AS target_type) { WITH FUNCTION function_name (argument_type [, ...]) [ WITH EXEMPTOR function_name ] | WITHOUT FUNCTION | WITH INOUT } [ AS ASSIGNMENT | AS IMPLICIT ] The exemptor shall have this signature: exemptor_func( integer, -- source_typmod integer -- dest_typmod ) RETURNS boolean The exemptor shall return true iff datumIsEqual(x, x::target_type(dest_typmod)) for every x in source_type most recently coerced to source_type(source_typmod). When the cast is WITHOUT FUNCTION (a binary coercion) and target_type lacks a length coercion cast, the cast has an implicit exemption, and an exemptor is superfluous: code can assume an exemptor that always returns true. Use of WITH EXEMPTOR mainly makes sense alongside multi-arg WITH FUNCTION (should probably raise an error upon other uses). I am not sure whether it's worth trying to be general here. Maybe we should just hard-code the known cases involving core datatypes. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding rewrite in ALTER TABLE ALTER TYPE
Noah Misch n...@leadboat.com writes: ALTER TABLE ALTER TYPE always rewrites the table heap and its indexes. In some cases, we can determine that doing so is unhelpful, and that the conversion shall always succeed: I wish to replace table rewrites with table verification scans where possible, then skip those verification scans where possible. This has been discussed before; have you read the previous threads? I really really dislike the notion of a verification scan: it's basically work that is going to be useless if it fails. I think your argument that it will usually fail quickly is quite unconvincing, and in any case the situations where it is useful at all are too thin on the ground to be worth the code space to implement it. It seems sufficient to me to skip the rewrite in cases of provable binary compatibility, with possibly an extra check for safe changes of typmod. With respect to the latter, I agree a type-specific function to compare the typmods would be the way to go, although exemptor seems a pretty badly chosen name for it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] small table left outer join big table
- Original Message - From: Alvaro Herrera alvhe...@commandprompt.com To: Robert Haas robertmh...@gmail.com Cc: Jie Li jay23j...@gmail.com; pgsql-hackers pgsql-hackers@postgresql.org Sent: Wednesday, December 29, 2010 8:39 PM Subject: Re: [HACKERS] small table left outer join big table Excerpts from Robert Haas's message of mié dic 29 09:17:17 -0300 2010: On Tue, Dec 28, 2010 at 5:13 AM, Jie Li jay23j...@gmail.com wrote: Hi, Please see the following plan: postgres=# explain select * from small_table left outer join big_table using (id); QUERY PLAN Hash Left Join (cost=126408.00..142436.98 rows=371 width=12) Hash Cond: (small_table.id = big_table.id) - Seq Scan on small_table (cost=0.00..1.09 rows=9 width=8) - Hash (cost=59142.00..59142.00 rows=410 width=8) - Seq Scan on big_table (cost=0.00..59142.00 rows=410 width=8) (5 rows) Here I have a puzzle, why not choose the small table to build hash table? It can avoid multiple batches thus save significant I/O cost, isn't it? Yeah, you'd think. Can you post a full reproducible test case? Also, what version is this? -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support The version is 9.0.1. I believe the latest version works in the same way. Thanks, Li Jie -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] small table left outer join big table
Thank you for all your comments. I think the condition of this optimization is whether the small table can fit into memory. If not, then it doesn't work since two tables still need to be written to disk. But if yes, we can save all I/O costs in the hash join process. Thanks, Li Jie - Original Message - From: Robert Haas robertmh...@gmail.com To: Simon Riggs si...@2ndquadrant.com Cc: Jie Li jay23j...@gmail.com; pgsql-hackers pgsql-hackers@postgresql.org Sent: Wednesday, December 29, 2010 8:59 PM Subject: Re: [HACKERS] small table left outer join big table On Wed, Dec 29, 2010 at 7:34 AM, Simon Riggs si...@2ndquadrant.com wrote: On Wed, 2010-12-29 at 07:17 -0500, Robert Haas wrote: Here I have a puzzle, why not choose the small table to build hash table? It can avoid multiple batches thus save significant I/O cost, isn't it? Yeah, you'd think. Can you post a full reproducible test case? It's not a bug, that's the way it currently works. We don't need a test case for that. I agree that the optimisation would be a useful one. It allows you to ask the query Show me sales for each of my stores efficiently, rather than being forced to request the inner join query Show me the sales for each of my stores for which there have been sales, which is a much less useful query. Oh, you're right. I missed the fact that it's a left join. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] small table left outer join big table
- Original Message - From: Tom Lane t...@sss.pgh.pa.us To: Robert Haas robertmh...@gmail.com Cc: Simon Riggs si...@2ndquadrant.com; Jie Li jay23j...@gmail.com; pgsql-hackers pgsql-hackers@postgresql.org Sent: Wednesday, December 29, 2010 10:59 PM Subject: Re: [HACKERS] small table left outer join big table Robert Haas robertmh...@gmail.com writes: On Wed, Dec 29, 2010 at 7:34 AM, Simon Riggs si...@2ndquadrant.com wrote: It's not a bug, that's the way it currently works. We don't need a test case for that. Oh, you're right. I missed the fact that it's a left join. The only thing that struck me as curious about it was that the OP didn't get a nestloop-with-inner-indexscan plan. That would be explainable if there was no index on the large table's id column ... but columns named like that usually have indexes. I can't get all *that* excited about complicating hash joins as proposed. The query is still fundamentally going to be slow because you won't get out of having to seqscan the large table. The only way to make it really fast is to not read all of the large table, and nestloop-with-inner-indexscan is the only plan type with a hope of doing that. regards, tom lane Yes there is no index on the joined column, otherwise nestloop-with-inner-indexscan should be preferred. But why can't outer join be as clever as inner join? Anyway, if we unfortunately don't have available index, we have no choice but rely on hash join, right? Thanks, Li Jie -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding rewrite in ALTER TABLE ALTER TYPE
Robert Haas robertmh...@gmail.com writes: On Dec 29, 2010, at 7:56 AM, Noah Misch n...@leadboat.com wrote: The exemptor shall have this signature: exemptor_func( integer, -- source_typmod integer -- dest_typmod ) RETURNS boolean The exemptor shall return true iff datumIsEqual(x, x::target_type(dest_typmod)) for every x in source_type most recently coerced to source_type(source_typmod). I am not sure whether it's worth trying to be general here. Maybe we should just hard-code the known cases involving core datatypes. I find the idea of hard coding to be pretty icky. However, the elephant in the room here is the possibility of typmod getting replaced by some other representation. It would make life simpler if we didn't invent this additional type-specific API until that dust has settled. So maybe the plan should be hard-coding in the short term and add an API later. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] and it's not a bunny rabbit, either
On 29.12.2010 13:17, Robert Haas wrote: Did you read the whole thread? Ah, sorry: I've had to change some of the heap_open(rv) calls to relation_open(rv) to avoid having the former throw the wrong error message before the latter kicks in. I think there might be stylistic objections to that, but I'm not sure what else to propose. I'm actually pretty suspicious that many of the heap_open(rv) calls I *didn't* change are either already a little iffy or likely to become so once the SQL/MED stuff for foreign tables goes in. They make it easy to forget that we've got a whole pile of relkinds and you actually need to really think about which ones you can handle. Hmm, I believe the idea of heap_open is to check that the relation is backed by a heap that you can read with heap_beginscan+heap_next. At the moment that includes normal tables, sequences and toast tables. Foreign tables would not fall into that category. Yeah, you're right that most of the callers of heap_open actually want to a tighter check than that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] and it's not a bunny rabbit, either
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: Hmm, I believe the idea of heap_open is to check that the relation is backed by a heap that you can read with heap_beginscan+heap_next. At the moment that includes normal tables, sequences and toast tables. Foreign tables would not fall into that category. I don't believe that that definition is documented anyplace; if we decide that's what we want it to mean, some code comments would be in order. Yeah, you're right that most of the callers of heap_open actually want to a tighter check than that. I think probably most of the physical calls of heap_open are actually associated with system catalog accesses, and the fact that the code says heap_open not relation_open has got more to do with copypaste than any real thought about what we're specifying. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_streamrecv for 9.1?
On Wed, Dec 29, 2010 at 11:47:53AM +0100, Magnus Hagander wrote: Would people be interested in putting pg_streamrecv (http://github.com/mhagander/pg_streamrecv) in bin/ or contrib/ for 9.1? I think it would make sense to do so. +1 for bin/ Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding rewrite in ALTER TABLE ALTER TYPE
On Wed, Dec 29, 2010 at 11:16:23AM -0500, Tom Lane wrote: Noah Misch n...@leadboat.com writes: ALTER TABLE ALTER TYPE always rewrites the table heap and its indexes. In some cases, we can determine that doing so is unhelpful, and that the conversion shall always succeed: I wish to replace table rewrites with table verification scans where possible, then skip those verification scans where possible. This has been discussed before; have you read the previous threads? It would help a lot if, in future, you include references to the previous threads you have in mind rather than simply mention that they exist. Saying, in effect, search the archives with our not-super-great search technology using keywords you didn't think of, comes off as pretty dismissive if not downright hostile. Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI memory mitigation false positive degradation
On 26.12.2010 21:40, Kevin Grittner wrote: To recap, I've had an open question on the Serializable Wiki page[1] since January about how we should handle long-running transactions. The algorithm published by Cahill et al requires keeping some transaction information in memory for all committed transactions which overlapped a still-running transaction. Since we need to keep this in shared memory, and the structures must have a finite allocation, there's an obvious looming limit, even if the allocation is relatively generous. Looking at the predicate lock splitting, it occurs to me that it's possible for a non-serializable transaction to be canceled if it needs to split a predicate lock held by a concurrent serializable transaction, and you run out of space in the shared memory predicate lock area. Any chance of upgrading the lock to a relation lock, or killing the serializable transaction instead? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_streamrecv for 9.1?
David Fetter da...@fetter.org writes: On Wed, Dec 29, 2010 at 11:47:53AM +0100, Magnus Hagander wrote: Would people be interested in putting pg_streamrecv (http://github.com/mhagander/pg_streamrecv) in bin/ or contrib/ for 9.1? I think it would make sense to do so. +1 for bin/ Is it really stable enough for bin/? My impression of the state of affairs is that there is nothing whatsoever about replication that is really stable yet. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI memory mitigation false positive degradation
Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Looking at the predicate lock splitting, it occurs to me that it's possible for a non-serializable transaction to be canceled if it needs to split a predicate lock held by a concurrent serializable transaction, and you run out of space in the shared memory predicate lock area. Good point. We don't want that, for sure. Any chance of upgrading the lock to a relation lock, or killing the serializable transaction instead? Absolutely. Good suggestion. Thanks! -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] and it's not a bunny rabbit, either
On Dec 29, 2010, at 12:49 PM, Tom Lane t...@sss.pgh.pa.us wrote: Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: Hmm, I believe the idea of heap_open is to check that the relation is backed by a heap that you can read with heap_beginscan+heap_next. At the moment that includes normal tables, sequences and toast tables. Foreign tables would not fall into that category. I don't believe that that definition is documented anyplace; if we decide that's what we want it to mean, some code comments would be in order. The existing comments mention that callers must check that the return value is not a view, if they care. So if there is currently a single coherent definition for what heap_open is supposed to do, it's clearly NOT the one Heikki proposes. My guess is that reality is closer to your theory of what got cut-and-pasted. ...Robert
Re: [HACKERS] pg_streamrecv for 9.1?
On Dec 29, 2010, at 1:01 PM, Tom Lane t...@sss.pgh.pa.us wrote: Is it really stable enough for bin/? My impression of the state of affairs is that there is nothing whatsoever about replication that is really stable yet. Well, that's not stopping us from shipping a core feature called replication. I'll defer to others on how mature pg_streamrecv is, but if it's no worse than replication in general I think putting it in bin/ is the right thing to do. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TODO item for pg_ctl and server detection
Bruce Momjian wrote: Yes, that was my calculus too. I realized that we create session ids by merging the process id and backend start time, so I went ahead and added the postmaster start time epoch to the postmaster.pid file. While there is no way to pass back the postmaster start time from PQping, I added code to pg_ctl to make sure the time in the postmaster.pid file is not _before_ pg_ctl started running. We only check PQping() after we have started the postmaster ourselves, so it fits our needs. Tom suggested that there might be clock skew between pg_ctl and the postmaster, so I added a 2-second slop in checking the postmaster start time. Tom also wanted the connection information to be output all at once, but that causes a problem with detecting pre-9.1 servers so I avoided it. Updated patch attached. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml index cda7f64..86bc5a6 100644 *** /tmp/pgdiff.14558/cXBdee_storage.sgml Wed Dec 29 13:42:38 2010 --- doc/src/sgml/storage.sgml Wed Dec 29 12:11:03 2010 *** last started with/entry *** 117,124 row entryfilenamepostmaster.pid//entry entryA lock file recording the current postmaster process id (PID), ! cluster data directory, port number, Unix domain socket directory, ! and shared memory segment ID/entry /row /tbody --- 117,125 row entryfilenamepostmaster.pid//entry entryA lock file recording the current postmaster process id (PID), ! postmaster start time, cluster data directory, port number, user-specified ! Unix domain socket directory, first valid listen_address host, and ! shared memory segment ID/entry /row /tbody diff --git a/src/backend/port/ipc_test.c b/src/backend/port/ipc_test.c index a003dc9..461a7a6 100644 *** /tmp/pgdiff.14558/o3NlPc_ipc_test.c Wed Dec 29 13:42:38 2010 --- src/backend/port/ipc_test.c Wed Dec 29 12:11:03 2010 *** on_exit_reset(void) *** 104,110 } void ! RecordSharedMemoryInLockFile(unsigned long id1, unsigned long id2) { } --- 104,110 } void ! AddToLockFile(int target_line, const char *str) { } diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c index d970eb2..ff77099 100644 *** /tmp/pgdiff.14558/mac69b_sysv_shmem.c Wed Dec 29 13:42:38 2010 --- src/backend/port/sysv_shmem.c Wed Dec 29 12:11:03 2010 *** InternalIpcMemoryCreate(IpcMemoryKey mem *** 198,206 /* Register on-exit routine to detach new segment before deleting */ on_shmem_exit(IpcMemoryDetach, PointerGetDatum(memAddress)); ! /* Record key and ID in lockfile for data directory. */ ! RecordSharedMemoryInLockFile((unsigned long) memKey, ! (unsigned long) shmid); return memAddress; } --- 198,214 /* Register on-exit routine to detach new segment before deleting */ on_shmem_exit(IpcMemoryDetach, PointerGetDatum(memAddress)); ! /* ! * Append record key and ID in lockfile for data directory. Format ! * to try to keep it the same length. ! */ ! { ! char line[32]; ! ! sprintf(line, %9lu %9lu\n, (unsigned long) memKey, ! (unsigned long) shmid); ! AddToLockFile(LOCK_FILE_LINES, line); ! } return memAddress; } diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index a46a323..c1e553a 100644 *** /tmp/pgdiff.14558/GKd28a_postmaster.c Wed Dec 29 13:42:38 2010 --- src/backend/postmaster/postmaster.c Wed Dec 29 12:11:03 2010 *** PostmasterMain(int argc, char *argv[]) *** 483,489 int status; char *userDoption = NULL; int i; ! MyProcPid = PostmasterPid = getpid(); MyStartTime = time(NULL); --- 483,490 int status; char *userDoption = NULL; int i; ! bool connection_line_output = false; ! MyProcPid = PostmasterPid = getpid(); MyStartTime = time(NULL); *** PostmasterMain(int argc, char *argv[]) *** 860,869 --- 861,882 UnixSocketDir, ListenSocket, MAXLISTEN); else + { status = StreamServerPort(AF_UNSPEC, curhost, (unsigned short) PostPortNumber, UnixSocketDir, ListenSocket, MAXLISTEN); + /* must supply a valid listen_address for PQping() */ + if (!connection_line_output) + { + char line[MAXPGPATH + 2]; + + sprintf(line, %s\n, curhost); + AddToLockFile(LOCK_FILE_LINES - 1, line); + connection_line_output = true; + } + } + if (status == STATUS_OK) success++; else *** PostmasterMain(int argc, char *argv[]) *** 880,885 --- 893,902 pfree(rawstring); } + /* Supply an empty listen_address line for PQping() */ + if (!connection_line_output) +
Re: [HACKERS] SSI memory mitigation false positive degradation
Kevin Grittner kevin.gritt...@wicourts.gov wrote: Any chance of upgrading the lock to a relation lock, or killing the serializable transaction instead? Absolutely. Good suggestion. Thanks! I pushed a TODO SSI comment at the appropriate point with my ideas on how best to fix this. I want to stick with the SLRU changes for now, rather than risk flushing brain cache on the topic just now. If Dan (or anyone else, for that matter) wants to fix this, feel free; just post first, as will I if nobody beats me to it. There are actually two spots in PredicateLockPageSplit and one in PredicateLockPageCombine where this needs to be addressed. I can't think of any other functions where we're vulnerable to having an impact on non-serializable transactions. We sure want to plug those -- I see it as critical to acceptance that we can honor the promise of no impact on any transactions at REPEATABLE READ or less strict isolation levels. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding rewrite in ALTER TABLE ALTER TYPE
On Dec 29, 2010, at 11:16 AM, Tom Lane t...@sss.pgh.pa.us wrote: I really really dislike the notion of a verification scan: it's basically work that is going to be useless if it fails. I think it has potential in cases like text to xml. In that case it'll either work or fail, with no possibility of requiring a do-over. Scanning the whole table is a whole lot cheaper than rewriting it. But I agree with your assessment of the optimistic verification scan case. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Libpq PGRES_COPY_BOTH - version compatibility
On Dec 29, 2010, at 10:14 AM, Magnus Hagander mag...@hagander.net wrote: We can be held responsible for the packaging decisions if they use *our* make install commands, imho. Yep. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding rewrite in ALTER TABLE ALTER TYPE
Robert Haas robertmh...@gmail.com writes: On Dec 29, 2010, at 11:16 AM, Tom Lane t...@sss.pgh.pa.us wrote: I really really dislike the notion of a verification scan: it's basically work that is going to be useless if it fails. I think it has potential in cases like text to xml. In that case it'll either work or fail, with no possibility of requiring a do-over. Scanning the whole table is a whole lot cheaper than rewriting it. I don't believe avoiding the write part (but not the read part, nor the XML syntax verification part) is a sufficiently compelling argument to justify having that code path. There are not enough distinct datatypes sharing binary representations to make this a worthwhile thing to worry over. Basically, I believe that the only use-case that will have more than epsilon number of users is I want to make this varchar(5) into varchar(10), or possibly text. We can fix that case without adding a boatload more code that we'll have to maintain. I do have some interest in the idea of having a type-specific function that can recognize no-op typmod changes, but I would envision that as being an expression evaluation optimization: let the planner throw away the call to the length-checking function when it isn't going to do anything. It's not by any means only useful in ALTER COLUMN TYPE --- and in fact probably doesn't even need any bespoke code there, if we put it into expression_planner() instead. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication as a separate permissions
Excerpts from Magnus Hagander's message of mié dic 29 11:40:34 -0300 2010: On Wed, Dec 29, 2010 at 15:05, Gurjeet Singh singh.gurj...@gmail.com wrote: Any specific reason NOREPLICATION_P and REPLICATION_P use the _P suffix? Um, I just copied it off a similar entry elsewhere. I saw no comment about what _P actually means, and I can't say I know. I know very little about the bison files :-) Some lexer keywords have a _P prefix because otherwise they'd collide with some symbol in Windows header files or something like that. It's old stuff, but I think you, Magnus, were around at that time. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_streamrecv for 9.1?
On Wed, Dec 29, 2010 at 1:42 PM, Robert Haas robertmh...@gmail.com wrote: On Dec 29, 2010, at 1:01 PM, Tom Lane t...@sss.pgh.pa.us wrote: Is it really stable enough for bin/? My impression of the state of affairs is that there is nothing whatsoever about replication that is really stable yet. Well, that's not stopping us from shipping a core feature called replication. I'll defer to others on how mature pg_streamrecv is, but if it's no worse than replication in general I think putting it in bin/ is the right thing to do. As the README says that is not self-contained (for no fault of its own) and one should typically set archive_command to guarantee zero WAL loss. quote TODO: Document some ways of setting up an archive_command that works well together with pg_streamrecv. /quote I think implementing just that TODO might make it a candidate. I have neither used it nor read the code, but if it works as advertised then it is definitely a +1 from me; no preference of bin/ or contrib/, since the community will have to maintain it anyway. Regards, -- gurjeet.singh @ EnterpriseDB - The Enterprise Postgres Company http://www.EnterpriseDB.com singh.gurj...@{ gmail | yahoo }.com Twitter/Skype: singh_gurjeet Mail sent from my BlackLaptop device
Re: [HACKERS] writable CTEs
On Tue, Dec 28, 2010 at 11:07:59PM +, Peter Geoghegan wrote: On 28 December 2010 20:07, Peter Eisentraut pete...@gmx.net wrote: The phrase common table expression does not appear anywhere in the SQL standard. The standard uses the grammar symbol with clause. I think we're losing sight of the issue a bit here. No one is proposing that we call WITH queries common table expressions. As I think we all agree, the term WITH query and common table expression are not synonymous. A WITH query is comprised of one or more common table expressions, plus a conventional SELECT query. As of 9.1, the thing appended to the CTE(s) can be a conventional DML query (SELECT, INSERT, UPDATE or DELETE). I'm hoping to expand this in future versions. :) All that I'm asking is that we /specify/ that the subqueries already mentioned in the docs are common table expressions. +1 Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] and it's not a bunny rabbit, either
Robert Haas robertmh...@gmail.com writes: The existing comments mention that callers must check that the return value is not a view, if they care. So if there is currently a single coherent definition for what heap_open is supposed to do, it's clearly NOT the one Heikki proposes. My guess is that reality is closer to your theory of what got cut-and-pasted. Well, reality is that in the beginning there was heap_open and index_open, and nothing else. And there weren't views, so basically those two functions covered all the interesting types of relations. We got to the current state of affairs by a series of whatever were the least invasive code changes at the time; nobody's ever taken a step back and tried to define what heap_open ought to allow from the standpoint of first principles. In practice I think it would make sense if heap_open accepts all relation types on which you can potentially do either a heapscan or indexscan (offhand those should be the same set of relkinds, I think; so this is the same in effect as Heikki's proposal, but phrased differently). So it would have to start rejecting views, and we'd need to go looking for the consequences of that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] and it's not a bunny rabbit, either
Excerpts from Tom Lane's message of mié dic 29 16:29:45 -0300 2010: In practice I think it would make sense if heap_open accepts all relation types on which you can potentially do either a heapscan or indexscan (offhand those should be the same set of relkinds, I think; so this is the same in effect as Heikki's proposal, but phrased differently). So it would have to start rejecting views, and we'd need to go looking for the consequences of that. This seems a very good idea, but I think we shouldn't let it sink the current patch. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extensions, patch v16
Oleg Bartunov wrote: Hi there, it's clear we need versions, probably, major.minor would be enough. The problem I see is how to keep .so in sync with .sql ? Should we store .sql in database ? Also, we need permissions for extension, since we have open/closed extensions. Don't people normally define the version number in the Makefile and pass the version string into the C code and perhaps a psql variable? --- Oleg On Sat, 11 Dec 2010, David E. Wheeler wrote: On Dec 11, 2010, at 1:09 PM, David Fetter wrote: Why is it in the makefile at all? If the makefile does need to know it, why don't we have it scrape the number out of the control file? Or even more to the point, since when do we need version numbers in extensions? We *absolutely* need version numbers in extensions. People will want to have a certain version, or a certain minimum version, etc., etc., etc., just as they do for any other software. Seriously, are you OK? One of the biggest mistakes in the creation of CPAN was allowing modules without extensions. It makes figuring out what to upgrade extremely difficult. Learning from that, PGXN requires version numbers for all extensions. Best, David Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI SLRU strategy choices
Excerpts from Kevin Grittner's message of mié dic 29 12:20:20 -0300 2010: http://git.postgresql.org/gitweb?p=postgresql.git;a=blob;f=src/include/access/slru.h;h=710cca70acd67e03e5f3a255b048a719ae4c4709 The way I read this, each segment is (BLCKSZ * SLRU_PAGES_PER_SEGMENT) long, which is (8kB * 32), or 256kB. The number of files is limited to 64k because of the to segment file naming. So total space is limited to 16GB. When an SLRU is used to store xids for random access, that's 4 bytes per entry, so 2^32 entries are possible, but SLRU code considers it a problem for the space to become more than half full. With the eight byte entries I need, there are 2^31 slots for entries, with the ability to use 2^30 before it becomes half full and SLRU complains. If these limitations become a problem, you can always change them. A couple of zeroes at the start of the pg_clog filenames aren't going to bother anyone, I don't think. Not so sure about your new proposed design's space usage. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extensions, patch v16
On Dec 29, 2010, at 12:00 PM, Bruce Momjian wrote: Don't people normally define the version number in the Makefile and pass the version string into the C code and perhaps a psql variable? There is no standard pattern AFAIK. A best practice would be welcome here. David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] and it's not a bunny rabbit, either
Alvaro Herrera alvhe...@commandprompt.com writes: Excerpts from Tom Lane's message of mié dic 29 16:29:45 -0300 2010: In practice I think it would make sense if heap_open accepts all relation types on which you can potentially do either a heapscan or indexscan (offhand those should be the same set of relkinds, I think; so this is the same in effect as Heikki's proposal, but phrased differently). So it would have to start rejecting views, and we'd need to go looking for the consequences of that. This seems a very good idea, but I think we shouldn't let it sink the current patch. No, but possibly regularizing what heap_open is defined to do would make Robert's patch simpler. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Anyone for SSDs?
Tom Lane wrote: Jeff Janes jeff.ja...@gmail.com writes: Of course if you do a full table scan because their are no better options, then it scans sequentially. But you have to scan the pages in *some* order, and it is hard to see how something other than sequential would be systematically better. In fact, if sequential *isn't* the best order for reading the whole file, the filesystem has lost its marbles completely; because that is the order in which most files are read, so files ought to be laid out on disk (or whatever storage device) to be read most quickly that way. Plus kernel read-ahead helps with sequential access too because the kernel can guess the next blocks to be requested --- hard to do that with random I/O. SSD have fast access but still benefit from read-ahead. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Anyone for SSDs?
Vaibhav Kaushal wrote: On Fri, 2010-12-10 at 18:07 -0800, Josh Berkus wrote: On 12/10/10 5:06 PM, Daniel Loureiro wrote: An quicksort method in sequential disk its just awful to be thinking in a non SSD world, but its possible in an SSD. So, code it. Shouldn't be hard to write a demo comparison. I don't believe that SSDs make quicksort-on-disk feasible, but would be happy to be proven wrong. I too do not believe it in normal case. However, considering the 'types' of SSDs, it may be feasible! Asking for 'the next page and getting it' has a time delay in the process. While on a regular HDD with spindles, the question is where is that page located, with SSDs, the question disappears, because the access time is uniform in case of SSDs. Also, the access time is about 100 times fasterm which would change quite a few things about the whole process. What _is_ interesting is that Postgres often has sequential and random/disk ways of doing things, and by reducing random_page_cost when using SSDs, you automatically use more random operations, so in a way the Postgres code was already prepared for SSD usage. Surprisingly, we had to change very little. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI SLRU strategy choices
Alvaro Herrera alvhe...@commandprompt.com wrote: If these limitations become a problem, you can always change them. A couple of zeroes at the start of the pg_clog filenames aren't going to bother anyone, I don't think. Not so sure about your new proposed design's space usage. I guess that's a call the community can make now -- if a serializable transaction which is not flagged as read only remains open long enough for over a billion other transactions to commit, is it OK for the old transaction to be automatically canceled? Is it worth messing with the SLRU limits to double that? Beyond a certain point you have transaction ID wrap-around, so at that point this would be the least of your troubles -- canceling the old transaction might even be helpful. I thought that was at 2 billion, but Heikki was saying it's at 1 billion in an earlier post. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extensions, patch v16
Bruce Momjian br...@momjian.us writes: Oleg Bartunov wrote: it's clear we need versions, probably, major.minor would be enough. The problem I see is how to keep .so in sync with .sql ? Should we store .sql in database ? Don't people normally define the version number in the Makefile and pass the version string into the C code and perhaps a psql variable? We had a long discussion upthread of what version numbers to keep where. IMHO the Makefile is about the *least* useful place to put a version number; the more so if you want more than one. What we seem to need is a version number in the .sql file itself (so that we can tell whether we need to take action to update the extension's catalog entries). I'm not convinced yet whether there needs to be another version number embedded in the .so file --- it may well be that the PG major version number embedded with PG_MODULE_MAGIC is sufficient. Personally I'd forget the notion of major.minor numbers here; all that will accomplish is to complicate storage and comparison of the numbers. We just need a simple integer that gets bumped whenever the extension's SQL script changes. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.1alpha3 release notes help
Josh Berkus j...@agliodbs.com writes: On 12/27/10 7:35 PM, Josh Berkus wrote: On 12/27/10 1:45 PM, Peter Eisentraut wrote: I'm unable to produce any really exciting release notes for alpha3. I have produced a draft here: http://wiki.postgresql.org/wiki/Alpha_release_notes_draft Please edit the bullet points if you have some idea. I'll see what I can do. Expanded with my edits on the wiki. Well pg_execute_from_file is the name of a patch that went in, but after review, the function of the same name has been taken out. As a result we do not have pg_execute_from_file() nor pg_execute_sql_file() in alpha3 and we won't have it later with extensions. There's the configuration file parsing merge that we might want to tell users about. You no longer have to follow different quoting rules in the recovery.conf and postgresql.conf files. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump --split patch
Joel Jacobson j...@gluefinance.com writes: Solution: I propose a new option to pg_dump, --split, which dumps each object to a separate file in a user friendly directory structure: Please have a look at getddl: https://github.com/dimitri/getddl Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Anyone for SSDs?
Bruce Momjian wrote: Vaibhav Kaushal wrote: On Fri, 2010-12-10 at 18:07 -0800, Josh Berkus wrote: On 12/10/10 5:06 PM, Daniel Loureiro wrote: An quicksort method in sequential disk its just awful to be thinking in a non SSD world, but its possible in an SSD. So, code it. Shouldn't be hard to write a demo comparison. I don't believe that SSDs make quicksort-on-disk feasible, but would be happy to be proven wrong. I too do not believe it in normal case. However, considering the 'types' of SSDs, it may be feasible! Asking for 'the next page and getting it' has a time delay in the process. While on a regular HDD with spindles, the question is where is that page located, with SSDs, the question disappears, because the access time is uniform in case of SSDs. Also, the access time is about 100 times fasterm which would change quite a few things about the whole process. What _is_ interesting is that Postgres often has sequential and random/disk ways of doing things, and by reducing random_page_cost when using SSDs, you automatically use more random operations, so in a way the Postgres code was already prepared for SSD usage. Surprisingly, we had to change very little. To add to this very late reply, we basically had random methods to do things (in RAM), and sequential/random methods for disk. By changing random_page_cost, we favor doing random things on disk. The big question is whether there are random things we have never implemented on disk that now make sense --- off hand, I can't think of any. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extensions, patch v16
On Dec 29, 2010, at 12:23 PM, Tom Lane wrote: We had a long discussion upthread of what version numbers to keep where. IMHO the Makefile is about the *least* useful place to put a version number; the more so if you want more than one. What we seem to need is a version number in the .sql file itself (so that we can tell whether we need to take action to update the extension's catalog entries). I'm not convinced yet whether there needs to be another version number embedded in the .so file --- it may well be that the PG major version number embedded with PG_MODULE_MAGIC is sufficient. For contrib maybe, but not 3rd-party extensions. Personally I'd forget the notion of major.minor numbers here; all that will accomplish is to complicate storage and comparison of the numbers. We just need a simple integer that gets bumped whenever the extension's SQL script changes. That won't be very flexible for third-party extensions. FWIW, for PGXN I mandated symantic version numbers (http://semver.org/), mainly because they're quite close to Pg core version numbers. I also created a basic data type for them: https://github.com/theory/pgxn-manager/blob/master/sql/02-types.sql#L70 Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI SLRU strategy choices
Kevin Grittner kevin.gritt...@wicourts.gov wrote: if a serializable transaction which is not flagged as read only remains open long enough for over a billion other transactions to commit Maybe a clarification and example would be useful. We're talking about going through a billion transactions which were assigned a TransactionId, not all database transactions. An example of how you could hit that is with a sustained commit rate of 5000 transactions per second which are modifying data while a single read write transaction stays open for 2.3 days. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] small table left outer join big table
On Wed, 2010-12-29 at 09:59 -0500, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: On Wed, Dec 29, 2010 at 7:34 AM, Simon Riggs si...@2ndquadrant.com wrote: It's not a bug, that's the way it currently works. We don't need a test case for that. Oh, you're right. I missed the fact that it's a left join. The only thing that struck me as curious about it was that the OP didn't get a nestloop-with-inner-indexscan plan. That would be explainable if there was no index on the large table's id column ... but columns named like that usually have indexes. I can't get all *that* excited about complicating hash joins as proposed. The query is still fundamentally going to be slow because you won't get out of having to seqscan the large table. The only way to make it really fast is to not read all of the large table, and nestloop-with-inner-indexscan is the only plan type with a hope of doing that. Seq scanning the big table isn't bad... we've gone to a lot of trouble to make it easy to do this, especially with many users. Maintaining many large indexes is definitely bad, all that random I/O is going to suck badly. Seems like an interesting and relatively optimisation to me. Not sure if this is a request for feature, or a proposal to write the optimisation. I hope its the latter. -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] and it's not a bunny rabbit, either
On Wed, Dec 29, 2010 at 3:01 PM, Tom Lane t...@sss.pgh.pa.us wrote: Alvaro Herrera alvhe...@commandprompt.com writes: Excerpts from Tom Lane's message of mié dic 29 16:29:45 -0300 2010: In practice I think it would make sense if heap_open accepts all relation types on which you can potentially do either a heapscan or indexscan (offhand those should be the same set of relkinds, I think; so this is the same in effect as Heikki's proposal, but phrased differently). So it would have to start rejecting views, and we'd need to go looking for the consequences of that. This seems a very good idea, but I think we shouldn't let it sink the current patch. No, but possibly regularizing what heap_open is defined to do would make Robert's patch simpler. I think that any meaning we assign to heap_open is going to be 95% arbitrary and of little practical help. There are at least six different sets of object classes which to which a particular commands or alter table subcommands can be legally applied: (1) tables only, (2) views only, (3) tables and views, (4) tables and indexes, (5) tables and composite types, (6) tables, views, and composite types. Adding foreign tables promises to add several more combinations immediately and likely more down the line; if we add materialized views, that'll add a bunch more. There's not really any single definition that's going to be a silver bullet. I think the best thing to do might be to get RID of heap_open(rv) and always use relation_openrv plus an appropriate relkind test. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_primary_conninfo
Magnus Hagander mag...@hagander.net writes: On Tue, Dec 28, 2010 at 18:12, Robert Haas robertmh...@gmail.com wrote: Although maybe now that we've made recovery.conf use the GUC lexer we oughta continue in that vein and expose those parameters as PGC_INTERNAL GUCs rather than inventing a new function for it... That's definitely another option that I wouldn't object to if people prefer that way. +1. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extensions, patch v16
On Wed, Dec 29, 2010 at 3:23 PM, Tom Lane t...@sss.pgh.pa.us wrote: Bruce Momjian br...@momjian.us writes: Oleg Bartunov wrote: it's clear we need versions, probably, major.minor would be enough. The problem I see is how to keep .so in sync with .sql ? Should we store .sql in database ? Don't people normally define the version number in the Makefile and pass the version string into the C code and perhaps a psql variable? We had a long discussion upthread of what version numbers to keep where. IMHO the Makefile is about the *least* useful place to put a version number; the more so if you want more than one. What we seem to need is a version number in the .sql file itself (so that we can tell whether we need to take action to update the extension's catalog entries). I'm not convinced yet whether there needs to be another version number embedded in the .so file --- it may well be that the PG major version number embedded with PG_MODULE_MAGIC is sufficient. Personally I'd forget the notion of major.minor numbers here; all that will accomplish is to complicate storage and comparison of the numbers. We just need a simple integer that gets bumped whenever the extension's SQL script changes. I think there are really two tasks here: 1. Identify whether a newer set of SQL definitions than the one installed is available. If so, the extension is a candidate for an upgrade. 2. Identify whether the installed version of the SQL definitions is compatible with the installed shared object. If it's not, we'd like the shared library load (or at a minimum, any use of the shared library) to fail when attempted, rather than attempting to plunge blindly onward and then crashing. As to point #2, what an extension author would typically do (I hope) is publish a new shared object is make it backward-compatible with some number of previous versions of the SQL definitions, but not necessarily all the way to the beginning of time. So the typical upgrade sequence would be to install the new .so, and then upgrade the SQL definitions. You'd want an interlock, though, in case someone tried to use a set of SQL definitions that were either too new or too old for the corresponding shared library. The too new case could occur if someone installed a new version of the shared library, upgraded the SQL definitions, and then put the old shared library file back. The too old case could occur if the extension were upgraded through many releases in a single step, such that whatever backward-compatibility support existed in the shared library didn't reach back far enough to cater to the ancient SQL definitions. In either case, you want to chunk an error when someone tries to use the module, rather than soldiering on blindly and crashing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] writable CTEs
On Tue, Dec 28, 2010 at 07:09:14AM -0500, Robert Haas wrote: On Tue, Dec 28, 2010 at 12:45 AM, David Fetter da...@fetter.org wrote: I don't see how people can be relying on links to 9.1-to-be's documentation. Well, it's always handy when the filenames are the same across versions. Ever looked at the 9.0 documentation for something and then modified the URL to see what it looked like in 8.1 or something? I have occasionally wondered if it would be possible to include in the SGML references to the same (or equivalent) page in previous versions, so that each page in the docs could reference the equivalent page in other versions. That would be extremely useful when trying to work out what works in your particular version. I've never done anything serious with SGML, but it'd be cool if it was possible. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ Patriotism is when love of your own people comes first; nationalism, when hate for people other than your own comes first. - Charles de Gaulle signature.asc Description: Digital signature
Re: [HACKERS] pg_streamrecv for 9.1?
Magnus Hagander mag...@hagander.net writes: Would people be interested in putting pg_streamrecv (http://github.com/mhagander/pg_streamrecv) in bin/ or contrib/ for 9.1? I think it would make sense to do so. +1 for having that in core, only available for the roles WITH REPLICATION I suppose? I think that the base backup feature is more important than simple streaming chunks of the WAL (SR already does this). Talking about the base backup over libpq, it is something we should implement to fulfill people's desire that claim an easy replication setup. Yes, definitely. But that also needs server side support. Yeah, but it's already in core for 9.1, we have pg_read_binary_file() there. We could propose a contrib module for previous version implementing the function in C, that should be pretty easy to code. The only reason I didn't do that for pg_basebackup is that I wanted a self-contained python script, so that offering a public git repo is all I needed as far as distributing the tool goes. Yeah, the WIP patch heikki posted is simliar, except it uses tar format and is implemented natively in the backend with no need for pl/pythonu to be installed. As of HEAD the dependency on pl/whatever is easily removed. The included C tool would need to have a parallel option from the get-go if at all possible, but if you look at the pg_basebackup prototype, it would be good to drop the wrong pg_xlog support in there and rely on a proper archiving setup on the master. Do you want to work on internal archive and restore commands over libpq in the same effort too? I think this tool should be either a one time client or a daemon with support for: - running a base backup when receiving a signal - continuous WAL streaming from a master - accepting standby connections and streaming to them - one-time libpq streaming of a WAL file, either way Maybe we don't need to daemonize the tool from the get-go, but if you're going parallel for the base-backup case you're almost there, aren't you? Also having internal commands for archive and restore commands that rely on this daemon running would be great too. I'd offer more help if it wasn't for finishing the extension patches, I'm currently working on 'alter extension ... upgrade', including how to upgrade from pre-9.1 extensions. But if that flies quicker than I want, count me in for more than only user specs :) Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extensions, patch v16
On Dec 29, 2010, at 1:27 PM, Robert Haas wrote: I think there are really two tasks here: 1. Identify whether a newer set of SQL definitions than the one installed is available. If so, the extension is a candidate for an upgrade. 2. Identify whether the installed version of the SQL definitions is compatible with the installed shared object. If it's not, we'd like the shared library load (or at a minimum, any use of the shared library) to fail when attempted, rather than attempting to plunge blindly onward and then crashing. 3. Check dependencies for one extension on other extensions. As to point #2, what an extension author would typically do (I hope) is publish a new shared object is make it backward-compatible with some number of previous versions of the SQL definitions, but not necessarily all the way to the beginning of time. So the typical upgrade sequence would be to install the new .so, and then upgrade the SQL definitions. You'd want an interlock, though, in case someone tried to use a set of SQL definitions that were either too new or too old for the corresponding shared library. The too new case could occur if someone installed a new version of the shared library, upgraded the SQL definitions, and then put the old shared library file back. The too old case could occur if the extension were upgraded through many releases in a single step, such that whatever backward-compatibility support existed in the shared library didn't reach back far enough to cater to the ancient SQL definitions. In either case, you want to chunk an error when someone tries to use the module, rather than soldiering on blindly and crashing. Yeah, makes sense. Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] and it's not a bunny rabbit, either
On Wed, Dec 29, 2010 at 4:09 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 29.12.2010 06:54, Robert Haas wrote: With the patch: rhaas=# cluster v; ERROR: views do not support CLUSTER do not support sounds like a missing feature, rather than a nonsensical command. How about something like CLUSTER cannot be used on views In the latest version of this patch, I created four translatable strings per object type: blats do not support %s (where %s is an SQL command) blats do not support constraints blats do not support rules blats do not support triggers It's reasonable enough to write CLUSTER cannot be used on views, but does constraints cannot be used on views seems more awkward to me. Or do we think that's OK? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Upgrading Extension, version numbers (was: [HACKERS] Extensions, patch v16)
Robert Haas robertmh...@gmail.com writes: On Wed, Dec 29, 2010 at 3:23 PM, Tom Lane t...@sss.pgh.pa.us wrote: We had a long discussion upthread of what version numbers to keep where. IMHO the Makefile is about the *least* useful place to put a version number; the more so if you want more than one. What we seem to need is a version number in the .sql file itself (so that we can tell whether we need to take action to update the extension's catalog entries). I'm not convinced yet whether there needs to be another version number embedded in the .so file --- it may well be that the PG major version number embedded with PG_MODULE_MAGIC is sufficient. In the .sql file? You mean something like: ALTER EXTENSION ... SET VERSION '...'; It's currently managed in the .control file of the extension, which allows us to list available extensions and their version number without having to parse the .sql script from the C code... Personally I'd forget the notion of major.minor numbers here; all that will accomplish is to complicate storage and comparison of the numbers. We just need a simple integer that gets bumped whenever the extension's SQL script changes. For contrib, as you wish. Now for third-party extensions, I don't see us having any authority on what people will use internally in their companies, etc. 1. Identify whether a newer set of SQL definitions than the one installed is available. If so, the extension is a candidate for an upgrade. Well, it's currently (WIP in the upgrade branch of my repo) easier than that, really. You have the control file on the file system and you have the extension's entry in the catalogs. http://git.postgresql.org/gitweb?p=postgresql-extension.git;a=shortlog;h=refs/heads/upgrade What upgrade means here is running a given SQL script, that you choose depending on the current and next version strings, following a scheme that has been extensively discussed in another thread, and is currently implemented like this: # lo comment = 'managing Large Objects' version = '9.1devel' relocatable = true upgrade_from_null = 'null = lo.upgrade.sql' Here, any property that begins with 'upgrade_from_' is considered as an upgrade setup and the part after the prefix is not considered. The value is meant to have two parts separated by '=', first is either null or a regexp matched against currently installed version number, second part is the upgrade script name to use at ALTER EXTENSION ... UPGRADE. We support 'null' version number to be able to upgrade from existing code which is not organized as an extension yet. The aim is to be able to: CREATE EMPTY EXTENSION lo; -- version is null here ALTER EXTENSION lo UPGRADE; And run a script containing lines that will look like this: alter domain @extsch...@.lo set extension lo; alter function @extsch...@.lo_oid(lo) set extension lo; alter function @extsch...@.lo_manage() set extension lo; Note that we always need to support the placeholder here, because of course following dependencies at this point isn't possible. 2. Identify whether the installed version of the SQL definitions is compatible with the installed shared object. If it's not, we'd like the shared library load (or at a minimum, any use of the shared library) to fail when attempted, rather than attempting to plunge blindly onward and then crashing. Well, the way I see things, it's already too late and there's nothing we can easily do to prevent that. What I mean is that the user will typically upgrade the OS-level package first, then apply the upgrade on the database(s). $ apt-get install postgresql-9.1-prefix $ psql -U postgres -c 'alter extension prefix upgrade' somedb At the time you tell PostgreSQL about the new extension, the shared object file has been in place for some time already, and the upgrade SQL script has not been ran yet. What I hope extension authors will do is document whether any upgrade requires a restart or will otherwise be responsible for instability in the server for backend started with the newer .so before the upgrade script has been run. So that users/DBA will know whether the upgrade calls for a maintenance window. I could see us trying to shoehorn such information into the control file too, but would ERRORing out on LOAD be any better than taking the compatibility chance? Knowing that the compatibility in most cases depends a lot on the actual call paths? Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] and it's not a bunny rabbit, either
On Wed, Dec 29, 2010 at 04:53:47PM -0500, Robert Haas wrote: On Wed, Dec 29, 2010 at 4:09 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 29.12.2010 06:54, Robert Haas wrote: With the patch: rhaas=# cluster v; ERROR: views do not support CLUSTER do not support sounds like a missing feature, rather than a nonsensical command. How about something like CLUSTER cannot be used on views In the latest version of this patch, I created four translatable strings per object type: blats do not support %s (where %s is an SQL command) blats do not support constraints blats do not support rules blats do not support triggers It's reasonable enough to write CLUSTER cannot be used on views, but does constraints cannot be used on views seems more awkward to me. Or do we think that's OK? That particular one looks good insofar as it describes reality. With predicate locks, etc., it may become untrue later, though :) Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] understanding minimum recovery ending location
Howdy, I am hoping someone can help me better understand what the minimum recovery ending location of pg_controldata represents with regards to 9.0 hot standbys. When I look at any of our 8.4 (or lower) installs this number is almost always somewhere in the past of the xlog timeline (presuming the server has been up for any length of time), which makes sense because we've done enough replay that we're covered for consistent slave recovery purposes. However, on the hot standby machines the number seems to normally be at some point in the future. Below is the relevant bits of pg_controldata output from a hot standby I was looking at earlier today: Database cluster state: in archive recovery pg_control last modified: Wed 29 Dec 2010 04:54:34 PM GMT Latest checkpoint location: 56/21020CA8 Prior checkpoint location:56/1E36D780 Latest checkpoint's REDO location:56/1F599068 Time of latest checkpoint:Wed 29 Dec 2010 04:46:09 PM GMT Minimum recovery ending location: 56/24B88500 Backup start location:0/0 Current wal_level setting:hot_standby As you can see, the minimum recovery ending location is clearly ahead of the most recent checkpoint activity. Now, it's not always like this, but most of the time it is, and since minimum recovery ending location seems to be regularly be updating going forward, it tends to make me think that either I misunderstand what this means, or it means something different in this context. Partially because I can query on the hot standby machine already, so I know I have a recoverable slave, but even more so in the context of pg_controldata on the master: Database cluster state: in production pg_control last modified: Wed 29 Dec 2010 04:54:04 PM GMT Latest checkpoint location: 56/234325B0 Prior checkpoint location:56/21020CA8 Latest checkpoint's REDO location:56/220558A8 Time of latest checkpoint:Wed 29 Dec 2010 04:51:09 PM GMT Minimum recovery ending location: 0/0 Backup start location:0/0 Current wal_level setting:hot_standby As you can see, the masters checkpoint information is actually ahead of the slaves (as expected), but even in this scenario, the slave is saying that the minimum recovery ending location is actually in the future compared to the latest checkpoint and redo locations of the master. This seems like a bug to me (how can it possibly be required that the minimum recovery ending location would be at a point in the xlog timeline that may never exist?) , but I am guessing this field is actually reporting something different in this context, so I am hoping someone can help clarify it for me? Robert Treat http://www.xzilla.net
Re: [HACKERS] understanding minimum recovery ending location
On 30.12.2010 00:19, Robert Treat wrote: Howdy, I am hoping someone can help me better understand what the minimum recovery ending location of pg_controldata represents with regards to 9.0 hot standbys. When I look at any of our 8.4 (or lower) installs this number is almost always somewhere in the past of the xlog timeline (presuming the server has been up for any length of time), which makes sense because we've done enough replay that we're covered for consistent slave recovery purposes. However, on the hot standby machines the number seems to normally be at some point in the future. Below is the relevant bits of pg_controldata output from a hot standby I was looking at earlier today: Database cluster state: in archive recovery pg_control last modified: Wed 29 Dec 2010 04:54:34 PM GMT Latest checkpoint location: 56/21020CA8 Prior checkpoint location:56/1E36D780 Latest checkpoint's REDO location:56/1F599068 Time of latest checkpoint:Wed 29 Dec 2010 04:46:09 PM GMT Minimum recovery ending location: 56/24B88500 Backup start location:0/0 Current wal_level setting:hot_standby As you can see, the minimum recovery ending location is clearly ahead of the most recent checkpoint activity. Minimum recovery ending location means: how far does the (standby) server need to replay WAL before it's safe to start it up. It is continuously updated as the recovery progresses, as data pages are flushed to disk. The reason is that if you kill the server during recovery, you have to recover up to the same point again, or the database wouldn't be consistent. Specifically, the WAL records for any data page changes that were already flushed to disk from the buffer cache before killing recovery must be re-replayed. So in practice the minimum recovery ending location follows somewhat behind the last replayed WAL record. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding rewrite in ALTER TABLE ALTER TYPE
On Wed, Dec 29, 2010 at 10:56:39AM -0500, Robert Haas wrote: On Dec 29, 2010, at 7:56 AM, Noah Misch n...@leadboat.com wrote: Having thought on it more, though, it actually seems best to attempt the verification scan *every* time. In most ineligible conversions, an inequality will appear very early, so the scan is effectively O(1) in the negative case. A notable exception is something like char(6)-varchar(6) in a table with a billion tuples having length(col) = 6 and one with length(col) = 5. The verification scan might read most of the table before finding the one tuple that forces a rewrite. That isn't a particularly regular scenario in my experience, so the just do the right thing aspect of preceding every potential rewrite with a verification scan seems to win out. I think this scenario will be more common than you might think. Tables don't contain random data; they contain data that the DBA thinks is valid. The situation where the data is mostly as you expect but with a few kooky rows is, in my experience, extremely common. Perhaps. A few kooky rows is indeed common, but we're talking about a specific breed of kookiness: 99.9% of the rows have identical bits after an ALTER TYPE transformation expression, and 0.1% have different bits. Is that common? In case it was not obvious, I'll note that any error thrown by a transformation expression during the verification scan still aborts the ALTER TABLE. A varchar(20)-varchar(10) that finds an 11-char string will fail permanently during the verification scan. Indeed, the primary value of the verification scan is to distinguish positive and error, not positive and negative. Expanding on my introduction, none of the following can yield a negative verification scan; the result is always positive or an error: CREATE DOMAIN loosedom AS text; CREATE DOMAIN tightdom AS text CHECK (value LIKE '%/'); CREATE TABLE t (c varchar(6)); INSERT INTO t VALUES ('abc/'),('de/'); ALTER TABLE t ALTER c TYPE varchar(8); ALTER TABLE t ALTER c TYPE text; ALTER TABLE t ALTER c TYPE loosedom; ALTER TABLE t ALTER c TYPE xml USING c::xml; ALTER TABLE t ALTER c TYPE varchar(64); ALTER TABLE t ALTER c TYPE tightdom; Adding a bpchar into the mix makes a negative verification scan possible, as does a USING clause having a truncating effect. Continuing the example, these can and would get a negative verification scan: ALTER TABLE t ALTER c TYPE character(6); ALTER TABLE t ALTER c TYPE varchar(5) USING c::varchar(5); Plenty of academic USING clause examples exist: ALTER TABLE t ALTER c TYPE varchar(8) USING CASE c WHEN 'de/' THEN 'foo' ELSE c END; Verification scans for conversions between fundamentally different types will generally end in the negative at the first tuple. Consider {timestamp,bigint, real,numeric,varbit}-text, int-bigint, interval-reltime, etc. I can't think of a decent non-academic example where heterogeneous conversions like these will dupe the verification scan for even a handful of tuples. Granted, one can flip that around as an argument for declaring the conversions that have a chance. And it makes the worst case a LOT worse. I suppose the absolute worst case would involve an ALTER TABLE adding brutally expensive CHECK constraints, such that the cost of computing those constraints would dominate the cost of both the verification scan and the rewrite, yielding a 100% slower ALTER TABLE run. A more realistic bad case might be a table much larger than memory with no indexes, and the verification scan adds a full seq scan for nothing. A crude test here has rewriting such a table taking 7x as long as a seq scan on it. By the back of that envelope, we'll take about 15% more time. The right mix of expensive expressions will raise that percentage, and the presence of indexes will drop it. Remember though, we're still only talking about the relatively-rare cases that even can get a negative verification scan. I really doubt this is worth the complexity anyway - We'd have the verification scan regardless of how we choose when to use it, because how else would we implement no-rewrite varchar(8)-varchar(4) or text-xml? An unconditional verification scan is merely the most trivial algorithm for deciding when to employ it. Those conversions are semantically similar to adding CHECK constraints, and in that sense we already have an initial verification scan implementation: ATRewriteTable(..., InvalidOid, ...). converting between two types that are mostly-but-not-quite binary compatible seems like an edge case in every sense. Yes. Indeed, that's the intuitive basis for my hypothesis that the verification scan will usually either fail early. I don't advocate this approach to pick up edge cases, but to pick up reasonable cases _without explicit annotations_ showing them to be achievable. Take the text-xml example, certainly of genuine value if not top-frequency. I see three ways to ensure we do a
[HACKERS] SLRU API tweak
Attached is a small patch to avoid putting an opaque structure into the slru.h file and using it in an external function call where external callers must always specify NULL. -Kevin *** a/src/backend/access/transam/clog.c --- b/src/backend/access/transam/clog.c *** *** 445,451 BootStrapCLOG(void) slotno = ZeroCLOGPage(0, false); /* Make sure it's written out */ ! SimpleLruWritePage(ClogCtl, slotno, NULL); Assert(!ClogCtl-shared-page_dirty[slotno]); LWLockRelease(CLogControlLock); --- 445,451 slotno = ZeroCLOGPage(0, false); /* Make sure it's written out */ ! SimpleLruWritePage(ClogCtl, slotno); Assert(!ClogCtl-shared-page_dirty[slotno]); LWLockRelease(CLogControlLock); *** *** 698,704 clog_redo(XLogRecPtr lsn, XLogRecord *record) LWLockAcquire(CLogControlLock, LW_EXCLUSIVE); slotno = ZeroCLOGPage(pageno, false); ! SimpleLruWritePage(ClogCtl, slotno, NULL); Assert(!ClogCtl-shared-page_dirty[slotno]); LWLockRelease(CLogControlLock); --- 698,704 LWLockAcquire(CLogControlLock, LW_EXCLUSIVE); slotno = ZeroCLOGPage(pageno, false); ! SimpleLruWritePage(ClogCtl, slotno); Assert(!ClogCtl-shared-page_dirty[slotno]); LWLockRelease(CLogControlLock); *** a/src/backend/access/transam/multixact.c --- b/src/backend/access/transam/multixact.c *** *** 1454,1460 BootStrapMultiXact(void) slotno = ZeroMultiXactOffsetPage(0, false); /* Make sure it's written out */ ! SimpleLruWritePage(MultiXactOffsetCtl, slotno, NULL); Assert(!MultiXactOffsetCtl-shared-page_dirty[slotno]); LWLockRelease(MultiXactOffsetControlLock); --- 1454,1460 slotno = ZeroMultiXactOffsetPage(0, false); /* Make sure it's written out */ ! SimpleLruWritePage(MultiXactOffsetCtl, slotno); Assert(!MultiXactOffsetCtl-shared-page_dirty[slotno]); LWLockRelease(MultiXactOffsetControlLock); *** *** 1465,1471 BootStrapMultiXact(void) slotno = ZeroMultiXactMemberPage(0, false); /* Make sure it's written out */ ! SimpleLruWritePage(MultiXactMemberCtl, slotno, NULL); Assert(!MultiXactMemberCtl-shared-page_dirty[slotno]); LWLockRelease(MultiXactMemberControlLock); --- 1465,1471 slotno = ZeroMultiXactMemberPage(0, false); /* Make sure it's written out */ ! SimpleLruWritePage(MultiXactMemberCtl, slotno); Assert(!MultiXactMemberCtl-shared-page_dirty[slotno]); LWLockRelease(MultiXactMemberControlLock); *** *** 1986,1992 multixact_redo(XLogRecPtr lsn, XLogRecord *record) LWLockAcquire(MultiXactOffsetControlLock, LW_EXCLUSIVE); slotno = ZeroMultiXactOffsetPage(pageno, false); ! SimpleLruWritePage(MultiXactOffsetCtl, slotno, NULL); Assert(!MultiXactOffsetCtl-shared-page_dirty[slotno]); LWLockRelease(MultiXactOffsetControlLock); --- 1986,1992 LWLockAcquire(MultiXactOffsetControlLock, LW_EXCLUSIVE); slotno = ZeroMultiXactOffsetPage(pageno, false); ! SimpleLruWritePage(MultiXactOffsetCtl, slotno); Assert(!MultiXactOffsetCtl-shared-page_dirty[slotno]); LWLockRelease(MultiXactOffsetControlLock); *** *** 2001,2007 multixact_redo(XLogRecPtr lsn, XLogRecord *record) LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE); slotno = ZeroMultiXactMemberPage(pageno, false); ! SimpleLruWritePage(MultiXactMemberCtl, slotno, NULL); Assert(!MultiXactMemberCtl-shared-page_dirty[slotno]); LWLockRelease(MultiXactMemberControlLock); --- 2001,2007 LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE); slotno = ZeroMultiXactMemberPage(pageno, false); ! SimpleLruWritePage(MultiXactMemberCtl, slotno); Assert(!MultiXactMemberCtl-shared-page_dirty[slotno]); LWLockRelease(MultiXactMemberControlLock); *** a/src/backend/access/transam/slru.c --- b/src/backend/access/transam/slru.c *** *** 78,83 typedef struct SlruFlushData --- 78,85 int segno[MAX_FLUSH_BUFFERS]; /* their log seg#s */ } SlruFlushData; + typedef struct SlruFlushData *SlruFlush; + /* * Macro to mark a buffer slot most recently used. Note multiple evaluation * of arguments! *** *** 123,128 static int slru_errno; --- 125,131 static void SimpleLruZeroLSNs(SlruCtl ctl, int slotno); static void SimpleLruWaitIO(SlruCtl ctl, int slotno); + static
Re: [HACKERS] estimating # of distinct values
Well, but that's not 7%, thats 7x! And the theorem says 'greater or equal' so this is actually the minimum - you can get a much bigger difference with lower probability. So you can easily get an estimate that is a few orders off. FWIW, based on query performance, estimates which are up to 5X off are tolerable, and anything within 3X is considered accurate. Above 5X the probability of bad query plans becomes problematically high. Of course, if you're doing cross-column stats, the accuracy of each individual column becomes critical since estimation error could be combiniational in the worst case (i.e. if colA is 3X and colB is 0.3X then colA-colB will be 9X off). Anyway, I look forward to your experiments with stream-based estimators. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Anyone for SSDs?
On Wed, Dec 29, 2010 at 3:34 PM, Bruce Momjian br...@momjian.us wrote: Bruce Momjian wrote: Vaibhav Kaushal wrote: On Fri, 2010-12-10 at 18:07 -0800, Josh Berkus wrote: On 12/10/10 5:06 PM, Daniel Loureiro wrote: An quicksort method in sequential disk its just awful to be thinking in a non SSD world, but its possible in an SSD. So, code it. Shouldn't be hard to write a demo comparison. I don't believe that SSDs make quicksort-on-disk feasible, but would be happy to be proven wrong. I too do not believe it in normal case. However, considering the 'types' of SSDs, it may be feasible! Asking for 'the next page and getting it' has a time delay in the process. While on a regular HDD with spindles, the question is where is that page located, with SSDs, the question disappears, because the access time is uniform in case of SSDs. Also, the access time is about 100 times fasterm which would change quite a few things about the whole process. What _is_ interesting is that Postgres often has sequential and random/disk ways of doing things, and by reducing random_page_cost when using SSDs, you automatically use more random operations, so in a way the Postgres code was already prepared for SSD usage. Surprisingly, we had to change very little. To add to this very late reply, we basically had random methods to do things (in RAM), and sequential/random methods for disk. By changing random_page_cost, we favor doing random things on disk. The big question is whether there are random things we have never implemented on disk that now make sense --- off hand, I can't think of any. The idea of us avoiding quicksort when we know we need to spill to disk is the type of thing that I wonder if it should be investigated, if you figure that spill to disk means ssd's so it's not so much of a performance hit. This reminds me of some performance testing we did maybe a year, year and a half ago, trying to see how best to get performance by adding some SSD's into one of our servers. Basically speed increased as we changed things like so: put entire $pgdata on sata put entire $pgdata on ssd put xlogs on ssd, pgdata on sata put pgdata and xlogs on sata, put arc on ssd, crank up postgres's memory settings arc being zfs's adaptive replacement cache, so basically giving the server a second, very large level of memory to work with, and then configuring postgres to make use of it. It wasn't terribly obvious to me why this ended up outperforming the initial idea of putting everything on ssd, but my impression was that the more you could force postgres into making decisions as if it was dealing with fast storage rather than slow storage, the better off you'd be (and that random_page_cost is not so wholly inclusive enough to do this for you). Robert Treat http://www.xzilla.net
Re: [HACKERS] Avoiding rewrite in ALTER TABLE ALTER TYPE
On Wed, Dec 29, 2010 at 11:16:23AM -0500, Tom Lane wrote: Noah Misch n...@leadboat.com writes: ALTER TABLE ALTER TYPE always rewrites the table heap and its indexes. In some cases, we can determine that doing so is unhelpful, and that the conversion shall always succeed: I wish to replace table rewrites with table verification scans where possible, then skip those verification scans where possible. This has been discussed before; have you read the previous threads? I cited two threads I had read on the subject. Were there other important ones? I really really dislike the notion of a verification scan: it's basically work that is going to be useless if it fails. I think your argument that it will usually fail quickly is quite unconvincing, and in any case the situations where it is useful at all are too thin on the ground to be worth the code space to implement it. It seems sufficient to me to skip the rewrite in cases of provable binary compatibility, with possibly an extra check for safe changes of typmod. With respect to the latter, I agree a type-specific function to compare the typmods would be the way to go, although exemptor seems a pretty badly chosen name for it. I have attempted to expand on these problems in my reply to Robert. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sepgsql contrib module
(2010/12/27 17:53), Simon Riggs wrote: On Fri, 2010-12-24 at 11:53 +0900, KaiGai Kohei wrote: The attached patch is the modular version of SE-PostgreSQL. Looks interesting. Couple of thoughts... Docs don't mention row-level security. If we don't have it, I think we should say that clearly. Indeed, it is a good idea the document mentions what features are not implemented in this version clearly, not only row-level security, but DDL permissions and so on. I'd like to revise it soon. I think we need a Guide to Security Labels section in the docs. Very soon, because its hard to know what is being delivered and what is not. Does it describe what is security label and the purpose of them? OK, I'd like to add this section here. Is the pg_seclabel table secure? Looks like the labels will be available to read. If we want to control visibility of each labels, we need row-level granularity here. How do we tell if sepgsql is installed? Check existence of GUC variables of sepgsql.*. What happens if someone alters the configuration so that the sepgsql plugin is no longer installed. Does the hidden data become visible? Yes. If sepgsql plugin is uninstalled, the hidden data become visible. But no matter. Since only a person who is allowed to edit postgresql.conf can uninstall it, we cannot uninstall it in run-time. (An exception is loading a malicious module, but we will be able to hook this operation in the future version.) Thanks, -- KaiGai Kohei kai...@kaigai.gr.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding rewrite in ALTER TABLE ALTER TYPE
On Wed, Dec 29, 2010 at 02:01:28PM -0500, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: On Dec 29, 2010, at 11:16 AM, Tom Lane t...@sss.pgh.pa.us wrote: I really really dislike the notion of a verification scan: it's basically work that is going to be useless if it fails. I think it has potential in cases like text to xml. In that case it'll either work or fail, with no possibility of requiring a do-over. Scanning the whole table is a whole lot cheaper than rewriting it. I don't believe avoiding the write part (but not the read part, nor the XML syntax verification part) is a sufficiently compelling argument to justify having that code path. There are not enough distinct datatypes sharing binary representations to make this a worthwhile thing to worry over. Basically, I believe that the only use-case that will have more than epsilon number of users is I want to make this varchar(5) into varchar(10), or possibly text. We can fix that case without adding a boatload more code that we'll have to maintain. Those are certainly the big ones, but I've also hit these in the field: SET timezone = 'UTC'; CREATE DOMAIN state AS text CHECK (VALUE ~ '[A-Z]{2}'); CREATE TABLE t (c0 text, c1 text, c2 timestamp); ALTER TABLE t ALTER c0 TYPE state, ALTER c1 TYPE varchar(6), ALTER c2 TYPE timestamptz; It felt normal, but I very well may have been in that epsilon unawares. I would be disappointed to end the project with no way to avoid rewrites for them. That being said, there are certainly more ways to achieve all of those than the one I have proposed. I do have some interest in the idea of having a type-specific function that can recognize no-op typmod changes, but I would envision that as being an expression evaluation optimization: let the planner throw away the call to the length-checking function when it isn't going to do anything. It's not by any means only useful in ALTER COLUMN TYPE --- and in fact probably doesn't even need any bespoke code there, if we put it into expression_planner() instead. That sounds significantly better. Ignorant question: how often will expression_planner have the old typmod available? Also, would this be worth attempting unconditionally, or might some callers use the expression too few times and suffer from the overhead of deciding whether to use it? Perhaps the order of patches I proposed was faulty and should have placed the most important use cases first, like this: 1. Modify ATPrepAlterColumnType to detect a bare RelabelType transformation expression and conditionally skip/downgrade ATRewriteTable. This would cover varchar-text and similar domain changes, but a full reindex remains. 2. In the no-rewrite case, modify the vicinity of finish_heap_swap to narrow us from reindexing the entire table to merely reindexing those indexes touched by the operator class changes. 3. Further skip reindex operations when the operator class has changed but the operator family has not. 4. Add pg_cast.castexemptor and modify CREATE CAST to populate it. Define an exemptor (or however we name it) for the varchar length coercion cast. Modify expression_planner to use it to strip out superfluous length coercion casts. This would cover varchar(10)-varchar(20) and similar domain changes. 5. Define exemptors for time, timetz, timestamp, timestamptz, and interval. 6. Define exemptor for bit varying. 7. Define exemptor for numeric. 8. (subject to further discussion) Add the verification scan to the vicinity of ATRewriteTable and skip the table rewrite when it finds no changes. This covers varchar(20)-varchar(10) and similar domain changes, timestamp-timestamptz, and text-xml. Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Vacuum of newly activated 8.3.12 standby receives warnings page xxx is uninitialized --- fixing
We have been seeing these warnings recently whenever a standby is brought up (typically to check it is ok). Sometimes they are coupled with corrupted indexes which require a REINDEX to fix. Initially I thought these uninitialized pages were due to primary crashes or hardware issues, however I've now managed to come up with a recipe to generate them on demand on my workstation. Pitrtools appears to be an essential part of the recipe - at this stage I'm not sure if it is actually doing something directly to cause this or merely tickling some Postgres recovery bug. The essential triggering element seems to be performing a base backup while the system is busy. Here's the description: 1/ Patch 8.3's pgbench using the attached diff, and initialize scale 100 dataset 2/ Get Pitrtools primary and standby config's setup (examples attached) 3/ Start pgbench with at least 4 clients and 20 transactions 4/ After history has approx 1 rows initiate backup from the standby 5/ After history has approx 14 rows bring up the standby and perform a VACUUM Typically I'm seeing a large number of consecutive uninitialized pages in the accounts table. What is also very interesting is that if I setup the standby in a more bare bones manner (i.e manually running pg_start_backup and rsync + pg_standby) then I can *never* elicit any uninitialized pages. I'm frankly puzzled about what Pitrtools is doing that is different - I only noticed it using rsync compression (-z) and doing rsync backups via pulling from the standby rather than pushing from the primary (I'm in the process of trying these variations out in the bare bones case). Just as I'm writing this I see Pitrtools rsync's pg_xlog - I wonder if there is/are timing issues which mean that recovery might use some (corrupted) logs from there before the (clean) archived ones arrive (will check). Some more detail about the system: Postgres 8.3.12 on Ubuntu Lucid x86_64 and Debian Lenny (lxc guests), rsync 3, Pitrtools 1.2-1 Postgres config changes: autovacuum = off # prevent any relation truncation max_fsm_pages = 2 # encourage new page creation Pitrtools Steps: primary: $ grep archive_command postgresql.conf archive_command = 'cmd_archiver -C /etc/pitrtools/cmd_archiver.ini -F %p' standby: $ cmd_standby -C /etc/pitrtools/cmd_standby.ini -B $ cmd_standby -C /etc/pitrtools/cmd_standby.ini -Astop_basebackup $ cp /etc/postgresql/8.3/main/pg_hba.conf \ /var/lib/postgresql/8.3/main/pg_hba.conf $ cp /etc/postgresql/8.3/main/postgresql.conf \ /var/lib/postgresql/8.3/main/postgresql.conf $ cmd_standby -C /etc/pitrtools/cmd_standby.ini -S $ cmd_standby -C /etc/pitrtools/cmd_standby.ini -F999 Bare Bones Steps: primary: $ grep archive_command postgresql.conf archive_command = 'rsync %p standby:/var/lib/postgresql/archive' $ psql -c SELECT pg_start_backup('backup'); $ rsync --exclude pg_xlog/\* --exclude postmaster.pid -a * \ standby:/var/lib/postgresql/8.3/main $ psql -c SELECT pg_stop_backup(); standby: $ grep restore_command recovery.conf restore_command = '/usr/lib/postgresql/8.3/bin/pg_standby -t /tmp/trigger.5432 /var/lib/postgresql/archive %f %p %r' $ /etc/init.d/postgresql-8.3 start $ touch /tmp/trigger.5432 regards Mark P.s: cc'ing Pg Hackers as variation of this topic has come up there several times. pgbench.diff.gz Description: GNU Zip compressed data cmd_archiver.ini.gz Description: GNU Zip compressed data cmd_standby.ini.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sepgsql contrib module
On Thu, 2010-12-30 at 09:26 +0900, KaiGai Kohei wrote: What happens if someone alters the configuration so that the sepgsql plugin is no longer installed. Does the hidden data become visible? Yes. If sepgsql plugin is uninstalled, the hidden data become visible. But no matter. Since only a person who is allowed to edit postgresql.conf can uninstall it, we cannot uninstall it in run-time. (An exception is loading a malicious module, but we will be able to hook this operation in the future version.) IMHO all security labels should be invisible if the provider is not installed correctly. That at least prevents us from accidentally de-installing a module and having top secret data be widely available. If you have multiple providers configured, you need to be careful not to allow a provider that incorrectly implements the plugin API, so that prior plugins are no longer effective. -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sepgsql contrib module
(2010/12/30 9:34), Simon Riggs wrote: On Thu, 2010-12-30 at 09:26 +0900, KaiGai Kohei wrote: What happens if someone alters the configuration so that the sepgsql plugin is no longer installed. Does the hidden data become visible? Yes. If sepgsql plugin is uninstalled, the hidden data become visible. But no matter. Since only a person who is allowed to edit postgresql.conf can uninstall it, we cannot uninstall it in run-time. (An exception is loading a malicious module, but we will be able to hook this operation in the future version.) IMHO all security labels should be invisible if the provider is not installed correctly. Probably, it needs row-level granularity to control visibility of each entries of pg_seclabel, because all the provider shares same system catalog. So, I don't think this mechanism is feasible right now. That at least prevents us from accidentally de-installing a module and having top secret data be widely available. If you have multiple providers configured, you need to be careful not to allow a provider that incorrectly implements the plugin API, so that prior plugins are no longer effective. Yep. It is responsibility of DBA who tries to set up security providers. DBA has to install only trustable or well-debugged modules (not limited to security providers) to avoid troubles. Thanks, -- KaiGai Kohei kai...@kaigai.gr.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: new patch of MERGE (merge_204) a question about duplicated ctid
Marko Tiikkaja wrote: As far as I can tell, this should work. I played around with the patch and the problem seems to be the VALUES: INTO Stock t USING (SELECT 30, 2000) AS s(item_id,balance) ON s.item_id=t.item_id WHEN MATCHED THEN UPDATE SET balance=t.balance + s.balance WHEN NOT MATCHED THEN INSERT VALUES(s.item_id,s.balance) ; MERGE 1 Good catch...while I think the VALUES syntax should work, that is a useful workaround so I could keep testing. I rewrote like this (original syntax commented out): MERGE INTO Stock t -- USING (VALUES(10,100)) AS s(item_id,balance) USING (SELECT 10,100) AS s(item_id,balance) ON s.item_id=t.item_id WHEN MATCHED THEN UPDATE SET balance=t.balance + s.balance WHEN NOT MATCHED THEN INSERT VALUES(s.item_id,s.balance) ; And that got me back again to concurrent testing. Moving onto next two problems...the basic MERGE feature seems to have stepped backwards a bit too. I'm now seeing these quite often: ERROR: duplicate key value violates unique constraint pgbench_accounts_pkey DETAIL: Key (aid)=(176641) already exists. STATEMENT: MERGE INTO pgbench_accounts t USING (SELECT 176641,1+(176641 / 100)::integer,168,'') AS s(aid,bid,balance,filler) ON s.aid=t.aid WHEN MATCHED THEN UPDATE SET abalance=abalance + s.balance WHEN NOT MATCHED THEN INSERT VALUES(s.aid,s.bid,s.balance,s.filler); On my concurrent pgbench test, which had been working before. Possibly causing that, the following assertion is tripping: TRAP: FailedAssertion(!(epqstate-origslot != ((void *)0)), File: execMain.c, Line: 1762) That's coming from the following code: void EvalPlanQualFetchRowMarks(EPQState *epqstate) { ListCell *l; Assert(epqstate-origslot != NULL); foreach(l, epqstate-rowMarks) Stepping back to summarize...here's a list of issues I know about with the current v204 code: 1) VALUE syntax doesn't work anymore 2) Assertion failure in EvalPlanQualFetchRowMarks 3) Duplicate key bug (possibly a direct result of #3) 4) Attempts to use MERGE in a fuction spit back ERROR: table is not a known fuction 5) The ctid junk attr handling needs to be reviewed more carefully, based on author request. I've attached the current revisions of all my testing code in hopes that Boxuan might try and replicate these (this makes it simple to replicate #1 through #3), and therefore confirm whether changes made do better. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books DROP TABLE Stock; CREATE TABLE Stock(item_id int UNIQUE, balance int); INSERT INTO Stock VALUES (10, 2200); INSERT INTO Stock VALUES (20, 1900); SELECT * FROM Stock ORDER BY item_id; MERGE INTO Stock t -- USING (VALUES(10,100)) AS s(item_id,balance) USING (SELECT 10,100) AS s(item_id,balance) ON s.item_id=t.item_id WHEN MATCHED THEN UPDATE SET balance=t.balance + s.balance WHEN NOT MATCHED THEN INSERT VALUES(s.item_id,s.balance) ; SELECT * FROM Stock ORDER BY item_id; MERGE INTO Stock t -- USING (VALUES(30,2000)) AS s(item_id,balance) USING (SELECT 30,2000) AS s(item_id,balance) ON s.item_id=t.item_id WHEN MATCHED THEN UPDATE SET balance=t.balance + s.balance WHEN NOT MATCHED THEN INSERT VALUES(s.item_id,s.balance) ; SELECT * FROM Stock ORDER BY item_id; \set nbranches :scale \set ntellers 10 * :scale \set naccounts 10 * :scale \setrandom aid 1 :naccounts \setrandom bid 1 :nbranches \setrandom tid 1 :ntellers \setrandom delta -5000 5000 MERGE INTO pgbench_accounts t USING (SELECT :aid,1+(:aid / 100)::integer,:delta,'') AS s(aid,bid,balance,filler) ON s.aid=t.aid WHEN MATCHED THEN UPDATE SET abalance=abalance + s.balance WHEN NOT MATCHED THEN INSERT VALUES(s.aid,s.bid,s.balance,s.filler); -- This syntax worked with MERGE v203 patch, but isn't compatible with v204 --MERGE INTO pgbench_accounts t USING (VALUES (:aid,1+(:aid / 100)::integer,:delta,'')) AS s(aid,bid,balance,filler) ON s.aid=t.aid WHEN MATCHED THEN UPDATE SET abalance=abalance + s.balance WHEN NOT MATCHED THEN INSERT VALUES(s.aid,s.bid,s.balance,s.filler); test-merge.sh Description: Bourne shell script -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: new patch of MERGE (merge_204) a question about duplicated ctid
On 2010-12-30 4:39 AM +0200, Greg Smith wrote: And that got me back again to concurrent testing. Moving onto next two problems...the basic MERGE feature seems to have stepped backwards a bit too. I'm now seeing these quite often: ERROR: duplicate key value violates unique constraint pgbench_accounts_pkey DETAIL: Key (aid)=(176641) already exists. STATEMENT: MERGE INTO pgbench_accounts t USING (SELECT 176641,1+(176641 / 100)::integer,168,'') AS s(aid,bid,balance,filler) ON s.aid=t.aid WHEN MATCHED THEN UPDATE SET abalance=abalance + s.balance WHEN NOT MATCHED THEN INSERT VALUES(s.aid,s.bid,s.balance,s.filler); On my concurrent pgbench test, which had been working before. I have no idea why it worked in the past, but the patch was never designed to work for UPSERT. This has been discussed in the past and some people thought that that's not a huge deal. Regards, Marko Tiikkaja -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] small table left outer join big table
On Wed, Dec 29, 2010 at 3:58 PM, Simon Riggs si...@2ndquadrant.com wrote: On Wed, 2010-12-29 at 09:59 -0500, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: On Wed, Dec 29, 2010 at 7:34 AM, Simon Riggs si...@2ndquadrant.com wrote: It's not a bug, that's the way it currently works. We don't need a test case for that. Oh, you're right. I missed the fact that it's a left join. The only thing that struck me as curious about it was that the OP didn't get a nestloop-with-inner-indexscan plan. That would be explainable if there was no index on the large table's id column ... but columns named like that usually have indexes. I can't get all *that* excited about complicating hash joins as proposed. The query is still fundamentally going to be slow because you won't get out of having to seqscan the large table. The only way to make it really fast is to not read all of the large table, and nestloop-with-inner-indexscan is the only plan type with a hope of doing that. Seq scanning the big table isn't bad... we've gone to a lot of trouble to make it easy to do this, especially with many users. Maintaining many large indexes is definitely bad, all that random I/O is going to suck badly. Seems like an interesting and relatively optimisation to me. Not sure if this is a request for feature, or a proposal to write the optimisation. I hope its the latter. Thanks for your comments. Yeah I'm excited to write code for PostgreSQL, but I'm new here and not familiar with the code routine or patch submission. I will try to learn in near future. So for the moment, it is a request for feature, and I'm looking forward to any pgsql-hackers working on this. Thanks, Li Jie
Re: [HACKERS] Avoiding rewrite in ALTER TABLE ALTER TYPE
On Wed, Dec 29, 2010 at 6:46 PM, Noah Misch n...@leadboat.com wrote: I think this scenario will be more common than you might think. Tables don't contain random data; they contain data that the DBA thinks is valid. The situation where the data is mostly as you expect but with a few kooky rows is, in my experience, extremely common. Perhaps. A few kooky rows is indeed common, but we're talking about a specific breed of kookiness: 99.9% of the rows have identical bits after an ALTER TYPE transformation expression, and 0.1% have different bits. Is that common? I think it's common enough to be worth worrying about. Expanding on my introduction, none of the following can yield a negative verification scan; the result is always positive or an error: CREATE DOMAIN loosedom AS text; CREATE DOMAIN tightdom AS text CHECK (value LIKE '%/'); CREATE TABLE t (c varchar(6)); INSERT INTO t VALUES ('abc/'),('de/'); ALTER TABLE t ALTER c TYPE varchar(8); ALTER TABLE t ALTER c TYPE text; ALTER TABLE t ALTER c TYPE loosedom; ALTER TABLE t ALTER c TYPE xml USING c::xml; ALTER TABLE t ALTER c TYPE varchar(64); ALTER TABLE t ALTER c TYPE tightdom; +1 for trying to optimize these cases (but maybe after we optimize the varchar - text and varchar(less) - varchar(more) cases to skip the scan altogether). Adding a bpchar into the mix makes a negative verification scan possible, as does a USING clause having a truncating effect. Continuing the example, these can and would get a negative verification scan: ALTER TABLE t ALTER c TYPE character(6); ALTER TABLE In case it was not obvious, I'll note that any error thrown by a transformation t ALTER c TYPE varchar(5) USING c::varchar(5); Plenty of academic USING clause examples exist: ALTER TABLE t ALTER c TYPE varchar(8) USING CASE c WHEN 'de/' THEN 'foo' ELSE c END; I am not really convinced that there's much value in optimizing these cases. They're not likely to arise very often in practice, and DBAs like predictability. There's tangible value in being able to say this is going to scan your table at most once - it might rewrite it, or it might just verify what's there, or it might decide no scan is necessary, but the absolute worst case is one scan with rewrite. That's simple to understand and simple to document and probably simpler to code too, and I think it covers very nearly all of the cases people are likely to care about in practice. Yes. Indeed, that's the intuitive basis for my hypothesis that the verification scan will usually either fail early. I don't advocate this approach to pick up edge cases, but to pick up reasonable cases _without explicit annotations_ showing them to be achievable. Take the text-xml example, certainly of genuine value if not top-frequency. I see three ways to ensure we do a verification scan for it: 1. Have the user tell us: ALTER TABLE t ALTER c TYPE xml USING c::xml IMPLICIT 2. Mark the text-xml cast as possibly no-rewrite and look for that 3. Do a verification scan every time I think for any pair of types (T1, T2) we should first determine whether we can skip the scan altogether. If yes, we're done. If no, then we should have a way of determining whether a verify-only scan is guaranteed to be sufficient (in your terminology, the verification scan is guaranteed to return either positive or error, not negative). If yes, then we do a verification scan. If no, we do a rewrite. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers