Re: [HACKERS] Moving ExecInsertIndexTuples and friends to new file
On 04/24/2015 06:30 AM, Stephen Frost wrote: * Peter Geoghegan (p...@heroku.com) wrote: On Thu, Apr 23, 2015 at 12:05 PM, Heikki Linnakangas hlinn...@iki.fi wrote: While looking at Peter's INSERT ... ON CONFLICT patch, I started to feel that ExecInsertIndexTuples() and friends would deserve a file of their own, and not be buried in the middle of execUtils.c. I propose that we split execUtils.c into two, moving ExecOpenIndices(), ExecCloseIndices() ExecInsertIndexTuples() and related functions into a new file called executor/execIndexing.c. That split makes a lot of sense to me. No objections here. Ok, moved. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] adding more information about process(es) cpu and memory usage
On Thu, Apr 23, 2015 at 9:28 PM, Tom Lane t...@sss.pgh.pa.us wrote: The reason nobody's gotten around to that in the last fifteen years is that per-process rusage isn't actually all that interesting; there's too much that happens in background daemons, for instance. There's *some* stuff that happens in background daemons, but if you want to measure user and system time consume by a particularly query, this would actually be a pretty handy way to do that, I think. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [committers] pgsql: RLS fixes, new hooks, and new test module
On 24 April 2015 at 04:31, Stephen Frost sfr...@snowman.net wrote: Christian, * Christian Ullrich (ch...@chrullrich.net) wrote: * Stephen Frost wrote: RLS fixes, new hooks, and new test module The buildfarm says that with -DCLOBBER_CACHE_ALWAYS, the RLS violations get blamed on the wrong tables. Mostly, they are catalogs (I have seen pg_opclass, pg_am, and pg_amproc), but some also come up with binary garbage instead, e.g. - http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=markhordt=2015-04-23%2000%3A00%3A12 - http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=friarbirddt=2015-04-23%2004%3A20%3A00 - http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=tickdt=2015-04-22%2019%3A56%3A53 Yup, thanks, Robert pointed this out on another thread and I'm looking into it. Ah, yes it looks like a couple of places in get_row_security_policies() ought to be making a copy of the relation name when setting it on the WCO. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improving vacuum/VM/etc
On Thu, Apr 23, 2015 at 3:09 PM, Jim Nasby jim.na...@bluetreble.com wrote: Unfortunately, the states I came up with using existing semantics don't look hugely useful[4], but if we take Robert's idea and make all-visible mean all-frozen, we can do much better: 0: Newly inserted tuples Tracking this state allows us to aggressively set hint bits. Who is us? And what do you mean by aggressively? As things stand, any process that has to touch a tuple always sets any applicable hint bits. 1: Newly deleted There are tuples that have been deleted but not pruned. There may also be newly inserted tuples that need hinting (state 0). Similar to state 0, we'd want to be fairly aggressive with these pages, because as soon as the deleting XID is committed and older than all snapshots we can prune. Because we can prune without hitting indexes, this is still a fairly cheap operation, though not as cheap as 0. What behavior difference would you foresee between state 0 and state 1? 2: Fully hinted, not frozen This is the really painful state to clean up, because we have to deal with indexes. We must enter this state after being in 1. Neither the fact that a page is fully hinted nor the fact that it is or is not frozen implies anything about dealing with indexes. We need to deal with indexes because the page contains either dead tuples (as a result of an aborted insert, a committed delete, or an aborted or committed update) or dead line pointers (as a result of pruning dead tuples). 3: All-visible-frozen Every tuple on the page is visible and frozen. Pages in this state need no maintenance at all. We might be able to enter this state directly from state 0. BENEFITS This tracking should help at least 3 problems: the need to set hint bits after insert, SELECT queries doing pruning (Simon's recent complaint), and needing to scan an entire table for freezing. The improvement in hinting and pruning is based on the idea that normally there would not be a lot of pages in state 0 or 1, and pages that were in those states are very likely to still be in disk cache (if not shared buffers). That means we can have a background process (or 2) that is very aggressive at targeting pages in these states. OK, I agree that a background process could be useful. Whenever it sees a dirty page, it could attempt to aggressively set hint bits, prune, mark all-visible, and freeze the page before that page gets evicted. However, that doesn't require the sort of state map you're proposing here. I think your statement about pages that were in those states are still likely to be in the disk cache is not really true. I mean, if we're doing OLTP, yes. But not if we're bulk-loading. Not needing to scan everything that's frozen is thanks to state 3. I think it's OK (at least for now) if only vacuum puts pages into this state, which means it can actually freeze the tuples when it does it (thanks to 37484ad we won't lose forensic data doing this). That means there's no extra work necessary by a foreground process that's dirtying a page. Did you notice the discussion on the other thread about this increasing WAL volume by a factor of 113? Because of 37484ad, I think as part of this we should also deprecate vacuum_freeze_min_age, or at least change it's behavior. AFAIK the only objection to aggressive freezing was loss of forensic data, and that's gone now. So vacuum (and presumably the bg process(es) than handle state 0 and 1) should freeze tuples if it would allow the whole page to be frozen. Possibly it should just do it any time it's dirtying the page. (We could actually do this right now; it would let us eliminate the GUC, but I'm not sure there'd be other benefit without the rest of this.) Reducing vacuum_freeze_min_age certainly seems worth considering. I don't know how to judge whether it's a good idea, though. You're balancing less I/O later against a lot more WAL right now. DOWNSIDES This does mean doubling the size of the VM. It would still be 32,000 times smaller than the heap with 8k pages (and 128,000 times smaller with the common warehouse 32k page size), so I suspect this is a non-issue, but it's worth mentioning. It might have some effect on a almost entirely read-only system; but I suspect in most other cases the other benefits will outweigh this. I don't think that's a problem. This approach still does nothing to help the index related activity in vacuum. My gut says state 2 should be further split; but I'm not sure why. Perhaps if we had another state we could do something more intelligent with index cleanup... I can't really follow why you've got these states to begin with. 0, 1, and 2 are all pretty much the same. The useful distinction AFAICS is between not-all-visible, all-visible, and all-visible-plus-frozen. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list
[HACKERS] Bug in planner
Hi! I faced with planner error: ERROR: could not find RelOptInfo for given relids A test case is in attachment, all supported versions of pgsql and HEAD are affected. Suppose, there is something wrong in pull_up_sublinks_qual_recurse() near converting NOT EXISTS into join, because preventing such convertion makes query worked. But I'm not sure. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ DROP TABLE IF EXISTS t1; DROP TABLE IF EXISTS t2; DROP TABLE IF EXISTS t3; DROP TABLE IF EXISTS t4; DROP TABLE IF EXISTS t5; CREATE TABLE t1 ( c1 int ); CREATE TABLE t2 ( c1 int, c2 int, c3 int ); CREATE TABLE t3 ( c1 int, c2 int ); CREATE TABLE t4 ( c1 int, c2 int ); CREATE TABLE t5 ( c1 int ); SELECT * FROM t1 WHERE NOT ( EXISTS ( SELECT 1 FROM ( SELECT t2.c2 AS c1 FROM t2 LEFT OUTER JOIN t3 ON (t2.c3 = t3.c1 ) LEFT OUTER JOIN ( SELECT t5.c1 AS c1 FROM t4 LEFT OUTER JOIN t5 ON ( t4.c2 = t5.c1 ) ) a1 ON ( a1.c1 = t3.c2 ) ) a2 WHERE t1.c1 = a2.c1 ) ); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freeze avoidance of very large table.
On Thu, Apr 23, 2015 at 9:03 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Thu, Apr 23, 2015 at 10:42 PM, Robert Haas wrote: On Thu, Apr 23, 2015 at 4:19 AM, Simon Riggs wrote: We only need a freeze/backup map for larger relations. So if we map 1000 blocks per map page, we skip having a map at all when size 1000. Agreed. We might also want to map multiple blocks per map slot - e.g. one slot per 32 blocks. That would keep the map quite small even for very large relations, and would not compromise efficiency that much since reading 256kB sequentially probably takes only a little longer than reading 8kB. I think the idea of integrating the freeze map into the VM fork is also worth considering. Then, the incremental backup map could be optional; if you don't want incremental backup, you can shut it off and have less overhead. When I read that I think about something configurable at relation-level.There are cases where you may want to have more granularity of this information at block level by having the VM slots to track less blocks than 32, and vice-versa. What are those cases? To me that sounds like making things complicated to no obvious benefit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PL/pgSQL, RAISE and error context
2015-04-24 12:11 GMT+02:00 Joel Jacobson j...@trustly.com: Entering the discussion because this is a huge pain for me in my daily work as well. This is not a reply to any specific post in this thread, but my first message in the thread. I see a great value in providing both a GUC and a new RAISE syntax. The different benefits of the two are maybe obvious, but perhaps worth pointing out: GUC: Good because you don't have to change any existing code. RAISE syntax: Good because you can control exactly what message should be emitted or not be emitted at that line of code. I think preserving backwards compatibility is very important. Not changing the default is not a problem for me, as long as it can be overridden. Whatever the default behaviour is, I think the need expressed by all users in this thread boils down to any of these two sentences: I want CONTEXT to be (DISPLAYED|SUPPRESSED) for (ALL|ONLY THIS LINE) RAISE (NOTICE|WARNING|ERROR) OR I don't want to change the default current behaviour of CONTEXT So we basically need a boolean setting value, where: NULL means the default behaviour TRUE means DISPLAY CONTEXT FALSE means SUPPRESS CONTEXT And the (ALL|ONLY THIS) part translates into using, * a GUC to change behaviour for ALL lines of code, * or using the RAISE syntax to change the behaviour of ONLY THIS line of code. And then we have the different message levels, for which CONTEXT is sometimes desirable in some situations: * The RAISE syntax allows controlling any message level in a natural way, as the message level is part of the syntax. * Allowing the same control using GUC would mean the message level would need to be part of the GUC key name, which means either add multiple GUCs, one for each message level, or only allow controlling the most important one and ignore the possibly need to control the other message levels. If it would be possible to somehow combine multiple message levels in the same GUC, that would solve the latter problem. We already have comma separated values for many GUCs, so maybe we could use that approach here as well. It looks like adding these two GUCs would meet the demands of all users: suppress_context_messages (enum) display_context_messages (enum) This proposal looks very practical - it can be very good start point - and it doesn't block any next discuss about enhancing RAISE statement, what I would to have too (bat can be separate issue). I like it. Regards Pavel This would allow doing something crazy as: suppress_context_messages = warning,error display_context_messages = notice
Re: [HACKERS] [committers] pgsql: RLS fixes, new hooks, and new test module
Dean, * Dean Rasheed (dean.a.rash...@gmail.com) wrote: On 24 April 2015 at 04:31, Stephen Frost sfr...@snowman.net wrote: * Christian Ullrich (ch...@chrullrich.net) wrote: * Stephen Frost wrote: RLS fixes, new hooks, and new test module The buildfarm says that with -DCLOBBER_CACHE_ALWAYS, the RLS violations get blamed on the wrong tables. Mostly, they are catalogs (I have seen pg_opclass, pg_am, and pg_amproc), but some also come up with binary garbage instead, e.g. - http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=markhordt=2015-04-23%2000%3A00%3A12 - http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=friarbirddt=2015-04-23%2004%3A20%3A00 - http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=tickdt=2015-04-22%2019%3A56%3A53 Yup, thanks, Robert pointed this out on another thread and I'm looking into it. Ah, yes it looks like a couple of places in get_row_security_policies() ought to be making a copy of the relation name when setting it on the WCO. Yup, that was the same conclusion that I came to last night, just hadn't finished testing before needing sleep. :) Will push a fix shortly for it. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] PL/pgSQL, RAISE and error context
2015-04-24 13:16 GMT+02:00 Robert Haas robertmh...@gmail.com: On Fri, Apr 24, 2015 at 6:11 AM, Joel Jacobson j...@trustly.com wrote: Entering the discussion because this is a huge pain for me in my daily work as well. This is not a reply to any specific post in this thread, but my first message in the thread. I see a great value in providing both a GUC and a new RAISE syntax. The different benefits of the two are maybe obvious, but perhaps worth pointing out: GUC: Good because you don't have to change any existing code. RAISE syntax: Good because you can control exactly what message should be emitted or not be emitted at that line of code. I think preserving backwards compatibility is very important. Not changing the default is not a problem for me, as long as it can be overridden. Whatever the default behaviour is, I think the need expressed by all users in this thread boils down to any of these two sentences: I want CONTEXT to be (DISPLAYED|SUPPRESSED) for (ALL|ONLY THIS LINE) RAISE (NOTICE|WARNING|ERROR) OR I don't want to change the default current behaviour of CONTEXT So we basically need a boolean setting value, where: NULL means the default behaviour TRUE means DISPLAY CONTEXT FALSE means SUPPRESS CONTEXT And the (ALL|ONLY THIS) part translates into using, * a GUC to change behaviour for ALL lines of code, * or using the RAISE syntax to change the behaviour of ONLY THIS line of code. And then we have the different message levels, for which CONTEXT is sometimes desirable in some situations: * The RAISE syntax allows controlling any message level in a natural way, as the message level is part of the syntax. * Allowing the same control using GUC would mean the message level would need to be part of the GUC key name, which means either add multiple GUCs, one for each message level, or only allow controlling the most important one and ignore the possibly need to control the other message levels. If it would be possible to somehow combine multiple message levels in the same GUC, that would solve the latter problem. We already have comma separated values for many GUCs, so maybe we could use that approach here as well. It looks like adding these two GUCs would meet the demands of all users: suppress_context_messages (enum) display_context_messages (enum) This would allow doing something crazy as: suppress_context_messages = warning,error display_context_messages = notice This is a very flexible proposal, but it's a tremendous amount of machinery for what's really a very minor issue. If we added two GUCs for every comparably important issue, we'd have about 40,000 of them. It can be PLpgSQL only GUC. Probably it is a most problematic case. I suggest we add the RAISE syntax first, because everybody agrees on that. Then, we can argue about the other stuff. There is a agreement about it - but I am expecting a harder discussion about what will be default, and the discussion about syntax should not be simple too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] PL/pgSQL, RAISE and error context
Entering the discussion because this is a huge pain for me in my daily work as well. This is not a reply to any specific post in this thread, but my first message in the thread. I see a great value in providing both a GUC and a new RAISE syntax. The different benefits of the two are maybe obvious, but perhaps worth pointing out: GUC: Good because you don't have to change any existing code. RAISE syntax: Good because you can control exactly what message should be emitted or not be emitted at that line of code. I think preserving backwards compatibility is very important. Not changing the default is not a problem for me, as long as it can be overridden. Whatever the default behaviour is, I think the need expressed by all users in this thread boils down to any of these two sentences: I want CONTEXT to be (DISPLAYED|SUPPRESSED) for (ALL|ONLY THIS LINE) RAISE (NOTICE|WARNING|ERROR) OR I don't want to change the default current behaviour of CONTEXT So we basically need a boolean setting value, where: NULL means the default behaviour TRUE means DISPLAY CONTEXT FALSE means SUPPRESS CONTEXT And the (ALL|ONLY THIS) part translates into using, * a GUC to change behaviour for ALL lines of code, * or using the RAISE syntax to change the behaviour of ONLY THIS line of code. And then we have the different message levels, for which CONTEXT is sometimes desirable in some situations: * The RAISE syntax allows controlling any message level in a natural way, as the message level is part of the syntax. * Allowing the same control using GUC would mean the message level would need to be part of the GUC key name, which means either add multiple GUCs, one for each message level, or only allow controlling the most important one and ignore the possibly need to control the other message levels. If it would be possible to somehow combine multiple message levels in the same GUC, that would solve the latter problem. We already have comma separated values for many GUCs, so maybe we could use that approach here as well. It looks like adding these two GUCs would meet the demands of all users: suppress_context_messages (enum) display_context_messages (enum) This would allow doing something crazy as: suppress_context_messages = warning,error display_context_messages = notice -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PL/pgSQL, RAISE and error context
On Fri, Apr 24, 2015 at 6:11 AM, Joel Jacobson j...@trustly.com wrote: Entering the discussion because this is a huge pain for me in my daily work as well. This is not a reply to any specific post in this thread, but my first message in the thread. I see a great value in providing both a GUC and a new RAISE syntax. The different benefits of the two are maybe obvious, but perhaps worth pointing out: GUC: Good because you don't have to change any existing code. RAISE syntax: Good because you can control exactly what message should be emitted or not be emitted at that line of code. I think preserving backwards compatibility is very important. Not changing the default is not a problem for me, as long as it can be overridden. Whatever the default behaviour is, I think the need expressed by all users in this thread boils down to any of these two sentences: I want CONTEXT to be (DISPLAYED|SUPPRESSED) for (ALL|ONLY THIS LINE) RAISE (NOTICE|WARNING|ERROR) OR I don't want to change the default current behaviour of CONTEXT So we basically need a boolean setting value, where: NULL means the default behaviour TRUE means DISPLAY CONTEXT FALSE means SUPPRESS CONTEXT And the (ALL|ONLY THIS) part translates into using, * a GUC to change behaviour for ALL lines of code, * or using the RAISE syntax to change the behaviour of ONLY THIS line of code. And then we have the different message levels, for which CONTEXT is sometimes desirable in some situations: * The RAISE syntax allows controlling any message level in a natural way, as the message level is part of the syntax. * Allowing the same control using GUC would mean the message level would need to be part of the GUC key name, which means either add multiple GUCs, one for each message level, or only allow controlling the most important one and ignore the possibly need to control the other message levels. If it would be possible to somehow combine multiple message levels in the same GUC, that would solve the latter problem. We already have comma separated values for many GUCs, so maybe we could use that approach here as well. It looks like adding these two GUCs would meet the demands of all users: suppress_context_messages (enum) display_context_messages (enum) This would allow doing something crazy as: suppress_context_messages = warning,error display_context_messages = notice This is a very flexible proposal, but it's a tremendous amount of machinery for what's really a very minor issue. If we added two GUCs for every comparably important issue, we'd have about 40,000 of them. I suggest we add the RAISE syntax first, because everybody agrees on that. Then, we can argue about the other stuff. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tablespaces inside $PGDATA considered harmful
On Thu, Apr 23, 2015 at 4:30 PM, Andres Freund and...@anarazel.de wrote: On 2015-04-23 16:26:09 -0400, Robert Haas wrote: But pg_upgrade automates all that, so you can't use pg_upgrade in that case. If we add a GUC as I suggested, you can still use pg_upgrade. But we also have to live with data directories being in a shit state forever onward. We won't really be able to remove the option realistically. It's not that hard to just move the tablespace out of the data directory while the server. As long as you move it on the same partition, it's even fast. OK, fair point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)
Hi Ashutosh, Thanks for the review. 2015/04/22 19:28、Ashutosh Bapat ashutosh.ba...@enterprisedb.com のメール: Tests --- 1.The postgres_fdw test is re/setting enable_mergejoin at various places. The goal of these tests seems to be to test the sanity of foreign plans generated. So, it might be better to reset enable_mergejoin (and may be all of enable_hashjoin, enable_nestloop_join etc.) to false at the beginning of the testcase and set them again at the end. That way, we will also make sure that foreign plans are chosen irrespective of future planner changes. I have different, rather opposite opinion about it. I disabled other join types as least as the tests pass, because I worry oversights come from planner changes. I hope to eliminate enable_foo from the test script, by improving costing model smarter. 2. In the patch, I see that the inheritance testcases have been deleted from postgres_fdw.sql, is that intentional? I do not see those being replaced anywhere else. It’s accidental removal, I restored the tests about inheritance feature. 3. We need one test for each join type (or at least for INNER and LEFT OUTER) where there are unsafe to push conditions in ON clause along-with safe-to-push conditions. For INNER join, the join should get pushed down with the safe conditions and for OUTER join it shouldn't be. Same goes for WHERE clause, in which case the join will be pushed down but the unsafe-to-push conditions will be applied locally. Currently INNER JOINs with unsafe join conditions are not pushed down, so such test is not in the suit. As you say, in theory, INNER JOINs can be pushed down even they have push-down-unsafe join conditions, because such conditions can be evaluated no local side against rows retrieved without those conditions. 4. All the tests have ORDER BY, LIMIT in them, so the setref code is being exercised. But, something like aggregates would test the setref code better. So, we should add at-least one test like select avg(ft1.c1 + ft2.c2) from ft1 join ft2 on (ft1.c1 = ft2.c1). Added an aggregate case, and also added an UNION case for Append. 5. It will be good to add some test which contain join between few foreign and few local tables to see whether we are able to push down the largest possible foreign join tree to the foreign server. Code --- In classifyConditions(), the code is now appending RestrictInfo::clause rather than RestrictInfo itself. But the callers of classifyConditions() have not changed. Is this change intentional? Yes, the purpose of the change is to make appendConditions (former name is appendWhereClause) can handle JOIN ON clause, list of Expr. The functions which consume the lists produced by this function handle expressions as well RestrictInfo, so you may not have noticed it. Because of this change, we might be missing some optimizations e.g. in function postgresGetForeignPlan() 793 if (list_member_ptr(fpinfo-remote_conds, rinfo)) 794 remote_conds = lappend(remote_conds, rinfo-clause); 795 else if (list_member_ptr(fpinfo-local_conds, rinfo)) 796 local_exprs = lappend(local_exprs, rinfo-clause); 797 else if (is_foreign_expr(root, baserel, rinfo-clause)) 798 remote_conds = lappend(remote_conds, rinfo-clause); 799 else 800 local_exprs = lappend(local_exprs, rinfo-clause); Finding a RestrictInfo in remote_conds avoids another call to is_foreign_expr(). So with this change, I think we are doing an extra call to is_foreign_expr(). Hm, it seems better to revert my change and make appendConditions downcast given information into RestrictInfo or Expr according to the node tag. The function get_jointype_name() returns an empty string for unsupported join types. Instead of that it should throw an error, if some code path accidentally calls the function with unsupported join type e.g. SEMI_JOIN. Agreed, fixed. While deparsing the SQL with rowmarks, the placement of FOR UPDATE/SHARE clause in the original query is not being honored, which means that we will end up locking the rows which are not part of the join result even when the join is pushed to the foreign server. E.g take the following query (it uses the tables created in postgres_fdw.sql tests) contrib_regression=# explain verbose select * from ft1 join ft2 on (ft1.c1 = ft2.c1) for update of ft1; QUERY PLAN
Re: [HACKERS] Fwd: [GENERAL] 4B row limit for CLOB tables
On Thu, Apr 23, 2015 at 11:24 PM, Tom Lane t...@sss.pgh.pa.us wrote: Having said all that, if we did try to fix it today, I'd imagine changing TOAST value identifiers to int64 and inventing a new TOAST pointer format for use when 32 bits isn't wide enough for the ID. But I think we're best advised to hold off doing that until the need becomes pressing. Just out of curiosity, has anyone thought about inventing a new TOAST pointer format on the grounds that our TOAST pointers are unreasonably large? IIUC, a TOAST pointer right now is 18 bytes: 16 for a varatt_external, and then that gets embedded in a varattrib_1b_e with a va_header byte and a va_tag byte. Eliminating one or both of va_rawsize and va_extsize from the TOAST pointer itself seems like it could save quite a bit of space on disk. Maybe you could even find a way to get rid of va_toastrelid; after all, at the point when you first acquire a pointer to the tuple, you surely know what relation it's a part of. You'd probably want to force de-TOASTing (or converting to a more expressive form of TOAST pointer, anyway) when you extracted the column from the tuple, which might be hard to arrange. But the benefits could be pretty significant. Suppose you have a table where each tuple is 4K untoasted, with all but 100 bytes of that in a single column. So, as stored, you've got 100 bytes of regular stuff plus an 18-byte TOAST header. If you could trim 2 of the above-mentioned 4-byte fields out of the TOAST header, that would reduce the size of the main relation fork by almost 7%. If you could trim all 3 of them out, you'd save more than 10%. That's not nothing, and the benefits could be even larger for rows that contain multiple TOAST pointers. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)
On Wed, Apr 22, 2015 at 10:48 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: + else if (scan-scanrelid == 0 +(IsA(scan, ForeignScan) || IsA(scan, CustomScan))) + varno = INDEX_VAR; Suppose scan-scanrelid == 0 but the scan type is something else? Is that legal? Is varno == 0 the correct outcome in that case? Right now, no other scan type has capability to return a tuples with flexible type/attributes more than static definition. I think it is a valid restriction that only foreign/custom-scan can have scanrelid == 0. But the code as you've written it doesn't enforce any such restriction. It just spends CPU cycles testing for a condition which, to the best of your knowledge, will never happen. If it's really a can't happen condition, how about checking it via an Assert()? else if (scan-scanrelid == 0) { Assert(IsA(scan, ForeignScan) || IsA(scan, CustomScan)); varno = INDEX_VAR; } -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 24/04/15 14:32, Andres Freund wrote: On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote: On 04/17/2015 11:54 AM, Andres Freund wrote: I've attached a rebased patch, that adds decision about origin logging to the relevant XLogInsert() callsites for external 2 byte identifiers and removes the pad-reusing version in the interest of moving forward. Putting aside the 2 vs. 4 byte identifier issue, let's discuss naming: I just realized that it talks about replication identifier as the new fundamental concept. The system table is called pg_replication_identifier. But that's like talking about index identifiers, instead of just indexes, and calling the system table pg_index_oid. The important concept this patch actually adds is the *origin* of each transaction. That term is already used in some parts of the patch. I think we should roughly do a search-replace of replication identifier - replication origin to the patch. Or even transaction origin. Attached is a patch that does this, and some more, renaming. That was more work than I'd imagined. I've also made the internal naming in origin.c more consistent/simpler and did a bunch of other cleanup. There are few oversights in the renaming: doc/src/sgml/func.sgml: +Return the replay position for the passed in replication +identifier. The parameter parameterflush/parameter src/include/replication/origin.h: + * replication_identifier.h +extern PGDLLIMPORT RepOriginId replident_sesssion_origin; +extern PGDLLIMPORT XLogRecPtr replident_sesssion_origin_lsn; +extern PGDLLIMPORT TimestampTz replident_sesssion_origin_timestamp; (these are used then in multiple places in code afterwards and also mentioned in comment above replorigin_advance) src/backend/replication/logical/origin.c: + ereport(ERROR, + (errcode(ERRCODE_OBJECT_IN_USE), +errmsg(replication identiefer + default: + elog(PANIC, replident_redo: unknown op code + * This function may only be called if a origin was setup with + * replident_session_setup(). I also think the replident_checkpoint file should be renamed to replorigin_checkpoint. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: [GENERAL] 4B row limit for CLOB tables
* Robert Haas (robertmh...@gmail.com) wrote: On Thu, Apr 23, 2015 at 11:24 PM, Tom Lane t...@sss.pgh.pa.us wrote: Having said all that, if we did try to fix it today, I'd imagine changing TOAST value identifiers to int64 and inventing a new TOAST pointer format for use when 32 bits isn't wide enough for the ID. But I think we're best advised to hold off doing that until the need becomes pressing. Just out of curiosity, has anyone thought about inventing a new TOAST pointer format on the grounds that our TOAST pointers are unreasonably large? I'd not thought about it, but sure sounds like a good idea from here. Would be particularly great if we were able to do this and increase the number of supported toast pointers and avoid having to go hunting for unused identifiers due to wrapping. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] PL/pgSQL, RAISE and error context
On Fri, Apr 24, 2015 at 1:16 PM, Robert Haas robertmh...@gmail.com wrote: This would allow doing something crazy as: suppress_context_messages = warning,error display_context_messages = notice This is a very flexible proposal, but it's a tremendous amount of machinery for what's really a very minor issue. If we added two GUCs for every comparably important issue, we'd have about 40,000 of them. I agree. The one-dimensional GUC syntax is not well suited for multi-dimensional config settings. And that's a good thing mostly I think. It would be a nightmare if the config file values could in JSON format, it's good they are simple. But I'm thinking maybe we could improve the config file syntax for the general case when you have multiple things you want to control, in this case the message levels, and for each such thing, you want to turn something on/off, in this case the CONTEXT. Maybe we could simply use plus + and minus - to mean on and off? Example: context_messages = -warning, -error, +notice -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)
On Wed, Apr 22, 2015 at 10:48 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: + else if (scan-scanrelid == 0 +(IsA(scan, ForeignScan) || IsA(scan, CustomScan))) + varno = INDEX_VAR; Suppose scan-scanrelid == 0 but the scan type is something else? Is that legal? Is varno == 0 the correct outcome in that case? Right now, no other scan type has capability to return a tuples with flexible type/attributes more than static definition. I think it is a valid restriction that only foreign/custom-scan can have scanrelid == 0. But the code as you've written it doesn't enforce any such restriction. It just spends CPU cycles testing for a condition which, to the best of your knowledge, will never happen. If it's really a can't happen condition, how about checking it via an Assert()? else if (scan-scanrelid == 0) { Assert(IsA(scan, ForeignScan) || IsA(scan, CustomScan)); varno = INDEX_VAR; } Thanks for your suggestion. I'd like to use this idea on the next patch. -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
On Thu, Apr 23, 2015 at 2:26 AM, Robert Haas robertmh...@gmail.com wrote: On Wed, Apr 22, 2015 at 8:48 AM, Amit Kapila amit.kapil...@gmail.com wrote: I have implemented this idea (note that I have to expose a new API shm_mq_from_handle as TupleQueueFunnel stores shm_mq_handle* and we sum_mq* to call shm_mq_detach) and apart this I have fixed other problems reported on this thread: 1. Execution of initPlan by master backend and then pass the required PARAM_EXEC parameter values to workers. 2. Avoid consuming dsm's by freeing the parallel context after the last tuple is fetched. 3. Allow execution of Result node in worker backend as that can be added as a gating filter on top of PartialSeqScan. 4. Merged parallel heap scan descriptor patch To apply the patch, please follow below sequence: HEAD Commit-Id: 4d930eee parallel-mode-v9.patch [1] assess-parallel-safety-v4.patch [2] (don't forget to run fixpgproc.pl in the patch) parallel_seqscan_v14.patch (Attached with this mail) Thanks, this version looks like an improvement. However, I still see some problems: - I believe the separation of concerns between ExecFunnel() and ExecEndFunnel() is not quite right. If the scan is shut down before it runs to completion (e.g. because of LIMIT), then I think we'll call ExecEndFunnel() before ExecFunnel() hits the TupIsNull(slot) path. I think you probably need to create a static subroutine that is called both as soon as TupIsNull(slot) and also from ExecEndFunnel(), in each case cleaning up whatever resources remain. Right, will fix as per suggestion. - InitializeParallelWorkers() still mixes together general parallel executor concerns with concerns specific to parallel sequential scan (e.g. EstimatePartialSeqScanSpace). Here we are doing 2 things, first one is for planned statement and then second one is node specific which in the case is parallelheapscan descriptor. So If I understand correctly, you want that we remove second one and have a recursive function to achieve the same. - shm_mq_from_handle() is probably reasonable, but can we rename it shm_mq_get_queue()? Okay, will change. - It's hard to believe this is right: + if (parallelstmt-inst_options) + receiver = None_Receiver; Really? Flush the tuples if there are *any instrumentation options whatsoever*? At the very least, that doesn't look too future-proof, but I'm suspicious that it's outright incorrect. instrumentation info is for explain statement where we don't need tuples and it is set same way for it as well, refer ExplainOnePlan(). What makes you feel this is incorrect? - I think ParallelStmt probably shouldn't be defined in parsenodes.h. That file is included in a lot of places, and adding all of those extra #includes there doesn't seem like a good idea for modularity reasons even if you don't care about partial rebuilds. Something that includes a shm_mq obviously isn't a parse node in any meaningful sense anyway. How about tcop/tcopprot.h? - I don't think you need both setup cost and startup cost. Starting up more workers isn't particularly more expensive than starting up fewer of them, because most of the overhead is in waiting for them to actually start, and the number of workers is reasonable, then they're all be doing that in parallel with each other. I suggest removing parallel_startup_cost and keeping parallel_setup_cost. There is some work (like creation of shm queues, launching of workers) which is done proportional to number of workers during setup time. I have kept 2 parameters to distinguish such work. I think you have a point that start of some or all workers could be parallel, but I feel that still is a work proportinal to number of workers. For future parallel operations also such a parameter could be useful where we need to setup IPC between workers or some other stuff where work is proportional to workers. - In cost_funnel(), I don't think it's right to divide the run cost by nWorkers + 1. Suppose we've got a plan that looks like this: Funnel - Hash Join - Partial Seq Scan on a - Hash - Seq Scan on b The sequential scan on b is going to get executed once per worker, whereas the effort for the sequential scan on a is going to be divided over all the workers. So the right way to cost this is as follows: (a) The cost of the partial sequential scan on a is equal to the cost of a regular sequential scan, plus a little bit of overhead to account for communication via the ParallelHeapScanDesc, divided by the number of workers + 1. (b) The cost of the remaining nodes under the funnel works normally. (c) The cost of the funnel is equal to the cost of the hash join plus number of tuples multiplied by per-tuple communication overhead plus a large fixed overhead reflecting the time it takes the workers to start. IIUC, the change for this would be to remove the change related to run cost
Re: [HACKERS] anole - test case sha2 fails on all branches
Hi Noah The build at commit 36e5247 also failed. With new buildfarm client 4.15, we used additional configure options like --with-gssapi -with-libxml --with-libxml --with-ldap --with-libxslt . The build at commit 36e5247 and last commit is successful without these options. I'll check which of these libs is causing an issue and if updating it resolves it. Thanks. On Fri, Apr 24, 2015 at 7:55 AM, Noah Misch n...@leadboat.com wrote: On Thu, Apr 23, 2015 at 10:54:45AM -0400, Tom Lane wrote: I wrote: Given that anole is the only one reporting this, I'm not sure that we should immediately blame Postgres itself. I have a vague recollection that we've seen this symptom before and traced it to a bug in some supporting library. Is anole using any particularly out-of-date versions of openssl, kerberos, etc? A bit of digging in the archives suggests that my hindbrain remembered this: http://www.postgresql.org/message-id/flat/4f5a8404.8020...@dunslane.net The specifics probably don't apply to anole, but the conclusion that inconsistent openssl header and library files triggered the bug might. A library problem is plausible. anole builds without OpenSSL, and I have no guess for which remaining library could be at fault. I could not reproduce this in an HP-UX IA-64 build configured as follows (no HP compiler available): ./configure --enable-debug --enable-cassert --enable-depend --without-readline --without-zlib CC='gcc -pthread -mlp64' Sandeep, I suggest trying a build at commit 36e5247, the last REL9_4_STABLE commit known good on anole. If that build fails, you'll know there's an environmental problem, like a broken dependency library. If that build succeeds, please use git bisect to find which commit broke things, and report the commit hash here. Thanks, nm -- Sandeep Thakkar
Re: [HACKERS] Moving ExecInsertIndexTuples and friends to new file
On 04/24/2015 09:36 AM, Heikki Linnakangas wrote: On 04/24/2015 06:30 AM, Stephen Frost wrote: * Peter Geoghegan (p...@heroku.com) wrote: On Thu, Apr 23, 2015 at 12:05 PM, Heikki Linnakangas hlinn...@iki.fi wrote: While looking at Peter's INSERT ... ON CONFLICT patch, I started to feel that ExecInsertIndexTuples() and friends would deserve a file of their own, and not be buried in the middle of execUtils.c. I propose that we split execUtils.c into two, moving ExecOpenIndices(), ExecCloseIndices() ExecInsertIndexTuples() and related functions into a new file called executor/execIndexing.c. That split makes a lot of sense to me. No objections here. Ok, moved. I wrote a little overview text on how unique and exclusion constraints are enforced. Most of the information can be gleaned from comments elsewhere, but I think it's helpful to have it in one place. Makes it easier to compare how unique and exclusion constraints work. The harmless deadlocks with exclusion constraints are not explained elsewhere AFAICS. This is also in preparation for Peter's INSERT ON CONFLICT patch. That will add another section explaining how the deadlocks and livelocks are avoided. That's easier to understand after you grok the potential for deadlocks with exclusion constraints. This also removes a comment from 1989 claiming that the code should be moved elsewhere. I think the code is in the right place. - Heikki diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c index de8fcdb..a697682 100644 --- a/src/backend/executor/execIndexing.c +++ b/src/backend/executor/execIndexing.c @@ -1,7 +1,55 @@ /*- * * execIndexing.c - * executor support for maintaining indexes + * routines for inserting index tuples and enforcing unique and + * exclusive constraints. + * + * ExecInsertIndexTuples() is the main entry point. It's called after + * inserting a tuple to the heap, and it inserts corresponding index tuples + * into all indexes. At the same time, it enforces any unique and + * exclusion constraints: + * + * Unique Indexes + * -- + * + * Enforcing a unique constraint is straightforward. When the index AM + * inserts the tuple to the index, it also checks that there are no + * conflicting tuples in the index already. It does so atomically, so that + * even if two backends try to insert the same key concurrently, only one + * of them will succeed. All the logic to ensure atomicity, and to wait + * for in-progress transactions to finish, is handled by the index AM. + * + * If a unique constraint is deferred, we request the index AM to not + * throw an error if a conflict is found. Instead, we make note that there + * was a conflict and return the list of indexes with conflicts to the + * caller. The caller must re-check them later, by calling index_insert() + * with the UNIQUE_CHECK_EXISTING option. + * + * Exclusion Constraints + * - + * + * Exclusion constraints are different from unique indexes in that when the + * tuple is inserted to the index, the index AM does not check for + * duplicate keys at the same time. After the insertion, we perform a + * separate scan on the index to check for conflicting tuples, and if one + * is found, we throw an error and the transaction is aborted. If the + * conflicting tuple's inserter or deleter is in-progress, we wait for it + * to finish first. + * + * There is a chance of deadlock, if two backends insert a tuple at the + * same time, and then perform the scan to check for conflicts. They will + * find each other's tuple, and both try to wait for each other. The + * deadlock detector will detect that, and abort one of the transactions. + * That's fairly harmless, as one of them was bound to abort with a + * duplicate key error anyway, although you get a different error + * message. + * + * If an exclusion constraint is deferred, we still perform the conflict + * checking scan immediately after inserting the index tuple. But instead + * of throwing an error if a conflict is found, we return that information + * to the caller. The caller must re-check them later by calling + * check_exclusion_constraint(). + * * * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California @@ -134,13 +182,10 @@ ExecCloseIndices(ResultRelInfo *resultRelInfo) * This routine takes care of inserting index tuples * into all the relations indexing the result relation * when a heap tuple is inserted into the result relation. - * Much of this code should be moved into the genam - * stuff as it only exists here because the genam stuff - * doesn't provide the functionality needed by the - * executor.. -cim 9/27/89 * - * This returns a list of index OIDs for any unique or exclusion - * constraints that are deferred and that had + * Unique
Re: [HACKERS] Replication identifiers, take 4
On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote: On 04/17/2015 11:54 AM, Andres Freund wrote: I've attached a rebased patch, that adds decision about origin logging to the relevant XLogInsert() callsites for external 2 byte identifiers and removes the pad-reusing version in the interest of moving forward. Putting aside the 2 vs. 4 byte identifier issue, let's discuss naming: I just realized that it talks about replication identifier as the new fundamental concept. The system table is called pg_replication_identifier. But that's like talking about index identifiers, instead of just indexes, and calling the system table pg_index_oid. The important concept this patch actually adds is the *origin* of each transaction. That term is already used in some parts of the patch. I think we should roughly do a search-replace of replication identifier - replication origin to the patch. Or even transaction origin. Attached is a patch that does this, and some more, renaming. That was more work than I'd imagined. I've also made the internal naming in origin.c more consistent/simpler and did a bunch of other cleanup. I'm pretty happy with this state. Greetings, Andres Freund From fc406a87f2e2ac08a7ac112ef6b75be1e8256a16 Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Fri, 24 Apr 2015 14:25:56 +0200 Subject: [PATCH] Introduce replication progress tracking infrastructure. (v2.0) When implementing a replication solution ontop of logical decoding two related problems exist: * How to safely keep track of replication progress * How to change replication behavior, based on the origin of a row; e.g. to avoid loops in bi-directional replication setups The solution to these problems, as implemented in this commit, consist out of three parts: 1) 'replication origins', which identify nodes in a replication setup. 2) 'replication progress tracking', which remembers, for each replication origin, how far replay has progressed in a efficient and crash safe manner. 3) The ability to filter out changes performed on the behest of a replication origin during logical decoding; this allows complex replication topologies. Most of this could also be implemented in userspace, e.g. by inserting additional rows contain origin information, but that ends up being much less efficient and more complicated. We don't want to require various replication solutions to reimplement logic for this independently. The infrastructure is intended to be generic enough to be reusable. This infrastructure also replaces the 'nodeid' infrastructure of commit timestamps. It is intended to provide all former capabilities, except that there's only 2^16 different origins; but now they integrate with logical decoding. Additionally more functionality is accessible via SQL. Since the commit timestamp infrastructure has also been introduced in 9.5 that's not a problem. For now the number of origins for which the replication progress can be tracked is determined by the max_replication_slots GUC. That GUC is not a perfect match to configure this, but there doesn't seem to be sufficient reason to introduce a separate new one. Bumps both catversion and wal page magic. Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer Reviewed-By: Heikki Linnakangas, Robert Haas, Steve Singer Discussion: 20150216002155.gi15...@awork2.anarazel.de, 20140923182422.ga15...@alap3.anarazel.de, 20131114172632.ge7...@alap2.anarazel.de --- contrib/test_decoding/Makefile |3 +- contrib/test_decoding/expected/replorigin.out | 141 +++ contrib/test_decoding/sql/replorigin.sql| 64 + contrib/test_decoding/test_decoding.c | 28 + doc/src/sgml/catalogs.sgml | 123 ++ doc/src/sgml/filelist.sgml |1 + doc/src/sgml/func.sgml | 201 ++- doc/src/sgml/logicaldecoding.sgml | 35 +- doc/src/sgml/postgres.sgml |1 + doc/src/sgml/replication-origins.sgml | 93 ++ src/backend/access/heap/heapam.c| 19 + src/backend/access/rmgrdesc/Makefile|4 +- src/backend/access/rmgrdesc/replorigindesc.c| 61 + src/backend/access/rmgrdesc/xactdesc.c | 24 +- src/backend/access/transam/commit_ts.c | 53 +- src/backend/access/transam/rmgr.c |1 + src/backend/access/transam/xact.c | 72 +- src/backend/access/transam/xlog.c |8 + src/backend/access/transam/xloginsert.c | 27 +- src/backend/access/transam/xlogreader.c |6 + src/backend/catalog/Makefile|2 +- src/backend/catalog/catalog.c |8 +- src/backend/catalog/system_views.sql|7 + src/backend/replication/logical/Makefile|3 +- src/backend/replication/logical/decode.c| 49 +-
Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)
On Fri, Apr 24, 2015 at 8:46 AM, Abhijit Menon-Sen a...@2ndquadrant.com wrote: At 2015-04-24 08:35:40 +0530, amit.kapil...@gmail.com wrote: Just stick a PG_RETURN_NULL() at the end? That should also work. OK, updated patch attached with just that one change. Patch looks good to me, will mark it as Ready for Committer if Tomas or anyone else doesn't have more to add in terms of review. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Moving ExecInsertIndexTuples and friends to new file
On Fri, Apr 24, 2015 at 6:38 AM, Heikki Linnakangas hlinn...@iki.fi wrote: I wrote a little overview text on how unique and exclusion constraints are enforced. Most of the information can be gleaned from comments elsewhere, but I think it's helpful to have it in one place. Makes it easier to compare how unique and exclusion constraints work. The harmless deadlocks with exclusion constraints are not explained elsewhere AFAICS. FWIW, both Jeff Davis and Tom Lane were well aware of this issue back when exclusion constraints went in - it was judged to be acceptable at the time, which I agree with. I happened to discuss this with Jeff in New York recently. I agree that it should definitely be documented like this (and the fact that ordinary unique indexes are unaffected, too). This is also in preparation for Peter's INSERT ON CONFLICT patch. That will add another section explaining how the deadlocks and livelocks are avoided. That's easier to understand after you grok the potential for deadlocks with exclusion constraints. Makes sense. This also removes a comment from 1989 claiming that the code should be moved elsewhere. I think the code is in the right place. +1 -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] improving speed of make check-world
On 4/23/15 1:22 PM, Jeff Janes wrote: Something about this commit (dcae5faccab64776376d354d) broke make check in parallel conditions when started from a clean directory. It fails with a different error each time, one example: make -j4 check /dev/null In file included from gram.y:14515: scan.c: In function 'yy_try_NUL_trans': scan.c:10307: warning: unused variable 'yyg' /usr/bin/ld: tab-complete.o: No such file: No such file or directory collect2: ld returned 1 exit status make[3]: *** [psql] Error 1 make[2]: *** [all-psql-recurse] Error 2 make[2]: *** Waiting for unfinished jobs make[1]: *** [all-bin-recurse] Error 2 make: *** [all-src-recurse] Error 2 make: *** Waiting for unfinished jobs I think the problem is that check depends on all, but now also depends on temp-install, which in turn runs install and all. With a sufficient amount of parallelism, you end up running two alls on top of each other. It seems this can be fixed by removing the check: all dependency. Try removing that in the top-level GNUmakefile.in and see if the problem goes away. For completeness, we should then also remove it in the other makefiles. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump: largeobject behavior issues (possible bug)
On 04/24/2015 03:41 PM, Tom Lane wrote: Given that large objects don't have any individual dependencies, one could envision fixing this by replacing the individual large-object DumpableObjects by a single placeholder to participate in the sort phase, and then when it's time to dump that, scan the large objects using a cursor and create/print/delete the information separately for each one. This would likely involve some rather painful refactoring in pg_dump however. Andrew G mentioned something about using a cursor for the main query that pulls the info. He said that it wasn't a solution but may be a bandaid (my words). Is that something we may want to look into as a stop gap? regards, tom lane -- The most kicking donkey PostgreSQL Infrastructure company in existence. The oldest, the most experienced, the consulting company to the stars. Command Prompt, Inc. http://www.commandprompt.com/ +1 -503-667-4564 - 24x7 - 365 - Proactive and Managed Professional Services! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0
Peter, * Peter Geoghegan (p...@heroku.com) wrote: I came up with something that is along the lines of what we discussed. I'll wait for you to push Dean's code, and rebase on top of that before submitting what I came up with. Maybe this can be rolled into my next revision if I work through Andres' most recent feedback without much delay. This is done (finally!). Please take a look and let me know if you have any questions or concerns. Happy to chat again also, of course. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] Fwd: [GENERAL] 4B row limit for CLOB tables
On 24/04/15 05:24, Tom Lane wrote: Stephen Frost sfr...@snowman.net writes: * Bruce Momjian (br...@momjian.us) wrote: On Sun, Feb 1, 2015 at 03:54:03PM +0100, Álvaro Hernández Tortosa wrote: The problem here is that performance degrades exponentially, or worse. Speaking here from experience, we already tested this for a very similar case (table creation, where two oids are consumed from a global sequence when inserting to pg_class). Have a look at http://www.slideshare.net/nosys/billion-tables-project-nycpug-2013, slides 43-45. We tested there this scenario and shown that table creations per second dropped from 10K to a few per second and then to a few per day. In the graphs you can't even realize there were more tables been created. At around 8K tables from the theoretical limit of 4B oids consumed, the process basically stopped (doing more insertions). We don't report the maximum number of tables per database, or the maximum number of TOAST values. Agreed? For my 2c, this limitation is a surprise to users and therefore we should add documentation to point out that it exists, unless we're going to actually fix it (which is certainly what I'd prefer to see...). TBH, I've got very little enthusiasm for fixing this given the number of reports of trouble from the field, which so far as I recall is zero. Álvaro's case came up through intentionally trying to create an unreasonable number of tables, not from real usage. This thread likewise appears to contain lots of speculation and no reports of anyone hitting a problem in practice. It is certainly true that this was a very synthetic case. I envision, however, certain use cases where we may hit a very large number of tables: - Massive multitenancy - Aggressive partitioning - Massive multitenancy with aggressive partitioning - Software dynamically generated tables, like those created by ToroDB (https://github.com/torodb/torodb). In ToroDB we generate tables depending only on the input data, so we may end up having as many as required by the datasource. For example, a general purpose json datastore may generate several tables per document inserted. Certainly this is likely to become an issue at some point in the future, but I'm not finding it very compelling to worry about now. By the time it does become an issue, we may have additional considerations or use cases that should inform a solution; which seems to me to be a good argument not to try to fix it in advance of real problems. Perhaps, I understand this argument, and it makes sense. However, on the other side, given the long time it may take from patch to commit and then release version to companies finally using it in production, I'd rather try to fix it soon, as there are already reports and use cases that may hit it, rather than wait three years until it explodes in our faces. After all, 640Kb RAM is enough, right? So maybe 2B tables is not that far in the horizon. Who knows. Regards, Álvaro -- Álvaro Hernández Tortosa --- 8Kdata -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Row security violation error is misleading
Dean, * Dean Rasheed (dean.a.rash...@gmail.com) wrote: On 21 April 2015 at 20:50, Stephen Frost sfr...@snowman.net wrote: Thanks a lot for this. Please take a look at the attached. I've given this a quick read-through, and it looks good to me. The interaction of permissive and restrictive policies from hooks matches my expections, and it's a definite improvement having tests for RLS hooks. The only thing I spotted was that the file comment for test_rls_hooks.c needs updating. Fixed! Is there any documentation for hooks? If not, perhaps that's something we should be considering too. We don't generally document hooks beyond in the source code.. I'm happy to expand on what's there, if anyone feels it'd be helpful to do so. Having the test module is a form of documentation also, of course. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0
On Fri, Apr 24, 2015 at 5:50 PM, Stephen Frost sfr...@snowman.net wrote: * Peter Geoghegan (p...@heroku.com) wrote: I came up with something that is along the lines of what we discussed. I'll wait for you to push Dean's code, and rebase on top of that before submitting what I came up with. Maybe this can be rolled into my next revision if I work through Andres' most recent feedback without much delay. This is done (finally!). Please take a look and let me know if you have any questions or concerns. Happy to chat again also, of course. Great, thanks. I didn't actually wait for you (as earlier indicated) before posting the new approach to RLS in V3.4. However, I have some decent tests for the new behaviors (I did test-driven development), so I think that when I rebase on top of the new RLS stuff tomorrow, I'll find that it won't be too difficult. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump: largeobject behavior issues (possible bug)
Andrew Dunstan and...@dunslane.net writes: On 04/23/2015 04:04 PM, Andrew Gierth wrote: The relevant code is getBlobs in pg_dump.c, which queries the whole of pg_largeobject_metadata without using a cursor (so the PGresult is already huge thanks to having 100 million rows), and then mallocs a BlobInfo array and populates it from the PGresult, also using pg_strdup for the oid string, owner name, and ACL if any. I'm surprised this hasn't come up before. I have a client that I persuaded to convert all their LOs to bytea fields because of problems with pg_dump handling millions of LOs, and kept them on an older postgres version until they made that change. Yeah, this was brought up when we added per-large-object metadata; it was obvious that that patch would cause pg_dump to choke on large numbers of large objects. The (perhaps rather lame) argument was that you wouldn't have that many of them. Given that large objects don't have any individual dependencies, one could envision fixing this by replacing the individual large-object DumpableObjects by a single placeholder to participate in the sort phase, and then when it's time to dump that, scan the large objects using a cursor and create/print/delete the information separately for each one. This would likely involve some rather painful refactoring in pg_dump however. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Row security violation error is misleading
Dean, * Dean Rasheed (dean.a.rash...@gmail.com) wrote: The second patch [2] is the one that is actually relevant to this thread. This patch is primarily to apply the RLS checks earlier, before an update is attempted, more like a regular permissions check. This adds a new enum to classify the kinds of WCO, a side benefit of which is that it allows different error messages when RLS checks are violated, as opposed to WITH CHECK OPTIONs on views. I've gone ahead and pushed this, please take a look and test it and let me know if you see any issues or have any questions or concerns. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] Feedback on getting rid of VACUUM FULL
Jim Nasby wrote: It looks like the biggest complaint (aside from allowing a limited number of tuples to be moved) is in [1] and [2], where Tom is saying that you can't simply call heap_update() like this without holding an exclusive lock on the table. Is that because we're not actually changing the tuple? That's nonsense -- obviously UPDATE can do heap_update without an exclusive lock on the table, so the explanation must be something else. I think his actual complaint was that you can't remove the old tuple until concurrent readers of the table have already finished scanning it, or you get into a situation where they might need to read the page in which the original version resided, but your mini-vacuum already removed it. So before removing it you need to wait until they are all finished. This is the reason I mentioned CREATE INDEX CONCURRENTLY: if you wait until those transactions are all gone (like CIC does), you are then free to remove the old versions of the tuple, because you know that all readers have a snapshot new enough to see the new version of the tuple. Another issue is both HOT and KeyUpdate; I think we need to completely ignore/over-ride that stuff for this. You don't need anything for HOT, because cross-page updates are never HOT. Not sure what you mean about KeyUpdate, but yeah you might need something there -- obviously, you don't want to create multixacts unnecessarily. Instead of adding forcefsm, I think it would be more useful to accept a target block number. That way we can actually control where the new tuple goes. Whatever makes the most sense, I suppose. (Maybe we shouldn't consider this a tweaked heap_update -- which is already complex enough -- but a separate heapam entry point.) -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Feedback on getting rid of VACUUM FULL
On 4/24/15 5:30 PM, Alvaro Herrera wrote: Jim Nasby wrote: It looks like the biggest complaint (aside from allowing a limited number of tuples to be moved) is in [1] and [2], where Tom is saying that you can't simply call heap_update() like this without holding an exclusive lock on the table. Is that because we're not actually changing the tuple? That's nonsense -- obviously UPDATE can do heap_update without an exclusive lock on the table, so the explanation must be something else. I think his actual complaint was that you can't remove the old tuple until concurrent readers of the table have already finished scanning it, or you get into a situation where they might need to read the page in which the original version resided, but your mini-vacuum already removed it. So before removing it you need to wait until they are all finished. This is the reason I mentioned CREATE INDEX CONCURRENTLY: if you wait until those transactions are all gone (like CIC does), you are then free to remove the old versions of the tuple, because you know that all readers have a snapshot new enough to see the new version of the tuple. Except I don't see anywhere in the patch that's actually removing the old tuple... Another issue is both HOT and KeyUpdate; I think we need to completely ignore/over-ride that stuff for this. You don't need anything for HOT, because cross-page updates are never HOT. Not sure what you mean about KeyUpdate, but yeah you might need something there -- obviously, you don't want to create multixacts unnecessarily. If I'm not mistaken, if there's enough room left on the page then HeapSatisfiesHOTandKeyUpdate() will say this tuple satisfies HOT. So we'd have to do something to over-ride that, and I don't think the current patch does that. (It might force it to a new page anyway, but it does nothing with satisfies_hot, which I suspect isn't safe.) Instead of adding forcefsm, I think it would be more useful to accept a target block number. That way we can actually control where the new tuple goes. Whatever makes the most sense, I suppose. (Maybe we shouldn't consider this a tweaked heap_update -- which is already complex enough -- but a separate heapam entry point.) Yeah, I thought about creating heap_move, but I suspect that would still have to worry about a lot of this other stuff anyway. Far more likely for a change to be missed in heap_move than heap_update too. I am tempted to add a SQL heap_move function though, assuming it's not much extra work. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0
* Peter Geoghegan (p...@heroku.com) wrote: On Fri, Apr 24, 2015 at 5:50 PM, Stephen Frost sfr...@snowman.net wrote: * Peter Geoghegan (p...@heroku.com) wrote: I came up with something that is along the lines of what we discussed. I'll wait for you to push Dean's code, and rebase on top of that before submitting what I came up with. Maybe this can be rolled into my next revision if I work through Andres' most recent feedback without much delay. This is done (finally!). Please take a look and let me know if you have any questions or concerns. Happy to chat again also, of course. Great, thanks. I didn't actually wait for you (as earlier indicated) before posting the new approach to RLS in V3.4. However, I have some decent tests for the new behaviors (I did test-driven development), so I think that when I rebase on top of the new RLS stuff tomorrow, I'll find that it won't be too difficult. Fantastic, glad to hear it! Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] PL/pgSQL, RAISE and error context
2015-04-24 16:02 GMT+02:00 Joel Jacobson j...@trustly.com: On Fri, Apr 24, 2015 at 1:16 PM, Robert Haas robertmh...@gmail.com wrote: This would allow doing something crazy as: suppress_context_messages = warning,error display_context_messages = notice This is a very flexible proposal, but it's a tremendous amount of machinery for what's really a very minor issue. If we added two GUCs for every comparably important issue, we'd have about 40,000 of them. I agree. The one-dimensional GUC syntax is not well suited for multi-dimensional config settings. And that's a good thing mostly I think. It would be a nightmare if the config file values could in JSON format, it's good they are simple. But I'm thinking maybe we could improve the config file syntax for the general case when you have multiple things you want to control, in this case the message levels, and for each such thing, you want to turn something on/off, in this case the CONTEXT. Maybe we could simply use plus + and minus - to mean on and off? Example: context_messages = -warning, -error, +notice I prefer your first proposal - and there is a precedent for plpgsql - plpgsql_extra_checks It is clean for anybody. +-identifiers looks like horrible httpd config. :) Regards Pavel
Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)
On 04/24/15 14:58, Amit Kapila wrote: On Fri, Apr 24, 2015 at 8:46 AM, Abhijit Menon-Sen a...@2ndquadrant.com mailto:a...@2ndquadrant.com wrote: At 2015-04-24 08:35:40 +0530, amit.kapil...@gmail.com mailto:amit.kapil...@gmail.com wrote: Just stick a PG_RETURN_NULL() at the end? That should also work. OK, updated patch attached with just that one change. Patch looks good to me, will mark it as Ready for Committer if Tomas or anyone else doesn't have more to add in terms of review. FWIW, I'm OK with the patch as is. Your reviews were spot on. regards -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: [GENERAL] 4B row limit for CLOB tables
On 4/24/15 7:11 PM, Álvaro Hernández Tortosa wrote: On 24/04/15 05:24, Tom Lane wrote: ... TBH, I've got very little enthusiasm for fixing this given the number of reports of trouble from the field, which so far as I recall is zero. Álvaro's case came up through intentionally trying to create an unreasonable number of tables, not from real usage. This thread likewise appears to contain lots of speculation and no reports of anyone hitting a problem in practice. It is certainly true that this was a very synthetic case. I envision, however, certain use cases where we may hit a very large number of tables: The original case has NOTHING to do with the number of tables and everything to do with the number of toasted values a table can have. If you have to toast 4B attributes in a single relation it will fail. In reality, if you get anywhere close to that things will fall apart due to OID conflicts. This case isn't nearly as insane as 4B tables. A table storing 10 text fields each of which is 2K would hit this limit with only 400M rows. If my math is right that's only 8TB; certainly not anything insane space-wise or rowcount-wise. Perhaps it's still not fixing, but I think it's definitely worth documenting. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0
On Thu, Apr 23, 2015 at 6:07 PM, Peter Geoghegan p...@heroku.com wrote: Curious about what you think about my proposal to go back to putting the inference specification WHERE clause within the parenthesis, since this solves several problems, including clarifying to users that the predicate is part of the inference clause. I've *provisionally* pushed code that goes back to the old way, Andres: https://github.com/petergeoghegan/postgres/commit/2a5d80b27d2c5832ad26dde4651c64dd2004f401 Perhaps this is the least worst way, after all. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing tuple overhead
On Sat, Apr 25, 2015 at 1:58 AM, Jim Nasby jim.na...@bluetreble.com wrote: On 4/23/15 10:40 PM, Amit Kapila wrote: I agree with you and what I think one of the major reasons of bloat is that Index segment doesn't have visibility information due to which clearing of Index needs to be tied along with heap. Now if we can move transaction information at page level, then we can even think of having it in Index segment as well and then Index can delete/prune it's tuples on it's own which can reduce the bloat in index significantly and there is a benefit to Vacuum as well. I don't see how putting visibility at the page level helps indexes at all. We could already put XMIN in indexes if we wanted, but it won't help, because... We can do that by putting transaction info at tuple level in index as well but that will be huge increase in size of index unless we devise a way to have variable index tuple header rather than a fixed. Now this has some downsides as well like Delete needs to traverse Index segment as well to Delete mark the tuples, but I think the upsides of reducing bloat can certainly outweigh the downsides. ... which isn't possible. You can not go from a heap tuple to an index tuple. We will have the access to index value during delete, so why do you think that we need linkage between heap and index tuple to perform Delete operation? I think we need to think more to design Delete .. by CTID, but that should be doable. This has been discussed in the past. I have tried to search in archive, but not getting what is the exact problem. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] PL/pgSQL, RAISE and error context
On Fri, Apr 24, 2015 at 6:07 PM, Pavel Stehule pavel.steh...@gmail.com wrote: Example: context_messages = -warning, -error, +notice I prefer your first proposal - and there is a precedent for plpgsql - plpgsql_extra_checks It is clean for anybody. +-identifiers looks like horrible httpd config. :) I have to agree on that :) Just thought this is the best we can do if we want to reduce the number of GUCs to a minimum. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tablespaces inside $PGDATA considered harmful
On Wed, Apr 22, 2015 at 10:41:02PM -0400, Bruce Momjian wrote: josh=# create tablespace tbl2 location '/home/josh/pg94/data/pg_xlog/'; CREATE TABLESPACE It really seems like we ought to block *THAT*. Of course, if we block tablespace creation in PGDATA generally, then that's covered. I have developed the attached patch to warn about creating tablespaces inside the data directory. The case this doesn't catch is referencing a symbolic link that points to the same directory. We can't make it an error so people can use pg_upgrade these setups. This would be for 9.5 only. OK, based on later discussions, I have updated my 9.5 patch to have pg_upgrade also display a warning (the warning will also appear in the pg_upgrade logs, but I doubt the user will see it), e.g.: Setting next OID for new clusterok Sync data directory to disk ok Creating script to analyze new cluster ok WARNING: user-defined tablespace locations should not be inside the data directory, e.g. /u/pgsql.old/data/pg_tblspc Upgrade Complete Optimizer statistics are not transferred by pg_upgrade so, once you start the new server, consider running: ./analyze_new_cluster.sh Could not create a script to delete the old cluster's data files because user-defined tablespaces exist in the old cluster directory. The old cluster's contents must be deleted manually. This way, both pg_dump and pg_upgrade will issue warnings, though, of course, those warnings can be ignored. I am hopeful these two warnings will be sufficient and we will not need make these errors, with the possible inconvenience it will cause. I am still afraid that someone will ignore the new errors pg_dump would generate and lose data. I just don't remember enough cases where we threw new errors on _data_ restore. Frankly, those using pg_upgrade already will have to move the old tablespaces out of the old cluster if they ever want to delete those clusters, so I am hopeful these additional warnings will help eliminate this practice, which is already cumbersome and useless. I am not planning to revisit this for 9.6. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PL/pgSQL, RAISE and error context
2015-04-24 19:16 GMT+02:00 Joel Jacobson j...@trustly.com: On Fri, Apr 24, 2015 at 6:07 PM, Pavel Stehule pavel.steh...@gmail.com wrote: Example: context_messages = -warning, -error, +notice I prefer your first proposal - and there is a precedent for plpgsql - plpgsql_extra_checks It is clean for anybody. +-identifiers looks like horrible httpd config. :) I have to agree on that :) Just thought this is the best we can do if we want to reduce the number of GUCs to a minimum. It looks like discussion KDE x GNOME. GUC that has simply effect on behave without performance impact should not be problem - like log_lock_wait, log_min_duration and similar. I am sure so we would it. The problematic GUC are a performance, planner, bgwriter, checkpoint related.
Re: [HACKERS] adding more information about process(es) cpu and memory usage
On 4/24/15 6:29 AM, Robert Haas wrote: On Thu, Apr 23, 2015 at 9:28 PM, Tom Lane t...@sss.pgh.pa.us wrote: The reason nobody's gotten around to that in the last fifteen years is that per-process rusage isn't actually all that interesting; there's too much that happens in background daemons, for instance. There's *some* stuff that happens in background daemons, but if you want to measure user and system time consume by a particularly query, this would actually be a pretty handy way to do that, I think. I more often am wondering what a running backend is doing OS-wise, but being able to see what happened when it finished would definitely be better than what's available now. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Typo in a comment in set_rel_size()
Amit Langote langote_amit...@lab.ntt.co.jp writes: Attached fixes what I suppose is a typo: * so set up a single dummy path for it. Here we only check this for * regular baserels; if it's an otherrel, CE was already checked in -* set_append_rel_pathlist(). +* set_append_rel_size(). * It's not a typo; the comment was correct when written. But I evidently missed updating it when set_append_rel_pathlist() got split into two functions. Applied, thanks for noticing! regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Feedback on getting rid of VACUUM FULL
Heikki Linnakangas wrote: Hmm, AFAICT pg_reorg is much more complex, writing stuff to a temp table and swapping relfilenodes afterwards. More like the VACUUM REWRITE that's been discussed. For the kicks, I looked at what it would take to write a utility like that. It turns out to be quite trivial, patch attached. It uses the same principle as VACUUM FULL, scans from the end, moving tuples to lower-numbered pages until it can't do it anymore. It requires a small change to heap_update(), to override the preference to store the new tuple on the same page as the old one, but other than that, it's all in the external module. More than five years have passed since Heikki posted this, and we still haven't found a solution to the problem -- which neverthless keeps biting people to the point that multiple user-space implementations of similar techniques are out there. I think what we need here is something that does heap_update to tuples at the end of the table, moving them to earlier pages; then wait for old snapshots to die (the infrastructure for which we have now, thanks to CREATE INDEX CONCURRENTLY); then truncate the empty pages. Of course, there are lots of details to resolve. It doesn't really matter that this runs for long: a process doing this for hours might be better than AccessExclusiveLock on the table for a much shorter period. Are there any takers? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Ignoring some binaries generated in src/test
Michael Paquier michael.paqu...@gmail.com writes: A couple of binaries in src/test, that are not part of the main make flow, can be built but they are actually not ignored in the tree: examples/testlibpq examples/testlibpq2 examples/testlibpq3 examples/testlibpq4 examples/testlo examples/testlo64 locale/test-ctype thread/thread_test I recall that some of them were target for removal, still shouldn't they have their own entries in a .gitignore, like in the patch attached? Perhaps, but if we're going to support doing a make in those subdirectories, I think it would also be appropriate to fix src/test/Makefile so that clean and related targets recurse to those subdirectories. The current logic in src/test/Makefile, particularly the way that the modules subdirectory is handled, seems pretty ugly/convoluted anyway. I wonder why it was done that way rather than just ensuring that modules/ doesn't do anything for make install? We'd still need special cases for examples/ et al because we don't want them built during make all, but I think just adding them to ALWAYS_SUBDIRS might suffice. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0
On 04/24/2015 02:52 AM, Peter Geoghegan wrote: I found a bug that seems to be down to commit e3144183562d08e347f832f0b29daefe8bac617b on Github: commit e3144183562d08e347f832f0b29daefe8bac617b Author: Heikki Linnakangas heikki.linnakan...@iki.fi Date: Thu Apr 23 18:38:11 2015 +0300 Minor cleanup of check_exclusion_or_unique_constraint. To improve readability. At least, that's what a git bisect session showed. Ok, I see now that I totally screwed up the conditions on waitMode in that commit. I just pushed a fix to your github repository. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0
On Fri, Apr 24, 2015 at 12:31 AM, Heikki Linnakangas hlinn...@iki.fi wrote: Ok, I see now that I totally screwed up the conditions on waitMode in that commit. I just pushed a fix to your github repository. I can confirm that the commit you just pushed prevents the implementation from attempting to lock an invisible tuple, where previously it would reliably fall given the same testcase. Thanks -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] collate.linux.utf8 test coverage
The optional buildfarm module that runs this test was broken by commit dcae5faccab64776376d354decda0017c648bb53 Since nobody has responded to my complaint about this, I have disabled it on crake, the only buildfarm machine that has actually been running the test, so we now have no buildfarm coverage for it. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improving vacuum/VM/etc
On 4/24/15 6:50 AM, Robert Haas wrote: Thanks for looking at this. On Thu, Apr 23, 2015 at 3:09 PM, Jim Nasby jim.na...@bluetreble.com wrote: Unfortunately, the states I came up with using existing semantics don't look hugely useful[4], but if we take Robert's idea and make all-visible mean all-frozen, we can do much better: 0: Newly inserted tuples Tracking this state allows us to aggressively set hint bits. Who is us? And what do you mean by aggressively? As things stand, any process that has to touch a tuple always sets any applicable hint bits. A background process that will proactively hint tuples before a foreground process needs to. But see also below... 1: Newly deleted There are tuples that have been deleted but not pruned. There may also be newly inserted tuples that need hinting (state 0). Similar to state 0, we'd want to be fairly aggressive with these pages, because as soon as the deleting XID is committed and older than all snapshots we can prune. Because we can prune without hitting indexes, this is still a fairly cheap operation, though not as cheap as 0. What behavior difference would you foresee between state 0 and state 1? Below. 2: Fully hinted, not frozen This is the really painful state to clean up, because we have to deal with indexes. We must enter this state after being in 1. Neither the fact that a page is fully hinted nor the fact that it is or is not frozen implies anything about dealing with indexes. We need to deal with indexes because the page contains either dead tuples (as a result of an aborted insert, a committed delete, or an aborted or committed update) or dead line pointers (as a result of pruning dead tuples). The idea I was shooting for is that the worst-case scenario in cleanup is dealing with indexes, which we need to do any time a tuple becomes dead. That's why I made 1 a separate state from 0, but it occurs to me now that I wasn't very clear about this. My goal here is that there are two separate paths for a page to be in: either it needs index vacuuming at some point, or it doesn't. If a page is in state 0, once we can make the page all-visible/frozen it can go into state 3 and *we never have to clean it again*. OTOH, if a tuple is marked dead (non-HOT), then we can be aggressive about hinting (and pruning, if there were HOT updates as well), but no matter what we must eventually include that page in index cleanup. So once a page enters state 1 or 2, it may never move to state 0 or 3 without an index scan pass. OK, I agree that a background process could be useful. Whenever it sees a dirty page, it could attempt to aggressively set hint bits, prune, mark all-visible, and freeze the page before that page gets evicted. However, that doesn't require the sort of state map you're proposing here. I think your statement about pages that were in those states are still likely to be in the disk cache is not really true. I mean, if we're doing OLTP, yes. But not if we're bulk-loading. Right, but at least we'd know we had a table with a load of unhinted or newly dead tuples. That means there's cleanup work we can do without needing an index pass. Not needing to scan everything that's frozen is thanks to state 3. I think it's OK (at least for now) if only vacuum puts pages into this state, which means it can actually freeze the tuples when it does it (thanks to 37484ad we won't lose forensic data doing this). That means there's no extra work necessary by a foreground process that's dirtying a page. Did you notice the discussion on the other thread about this increasing WAL volume by a factor of 113? Yeah, though I'd forgotten about it. :( I wonder if there's some way we can reduce that. I just looked at what we WAL log for a freeze and it appears to only be xl_heap_freeze_tuple, which if my math is correct is 12 bytes (11 ignoring alignment). I don't understand how that can be 113 times worse than a plain vacuum. I can't really follow why you've got these states to begin with. 0, 1, and 2 are all pretty much the same. The useful distinction AFAICS is between not-all-visible, all-visible, and all-visible-plus-frozen. Index scanning is probably the most expensive part of cleanup, so it seems like it would be useful to be able to track that as visible/frozen. (What would probably be more useful is a way to directly link a heap tuple to any index tuples pointing at it, but that would certainly be a lot harder to do.) There's also the idea of being proactive about hinting and pruning, instead of foisting that onto later foreground processes or hoping that vacuum comes along. Certainly the most obvious part is doing that before buffers are evicted, but it's not uncommon for the OS cache to be 10x larger (or more). Even if we can't hit these pages before they're all the way on disk, if we at least know there's a pile of them we can do something before a foreground process (or at least let the
Re: [HACKERS] Feedback on getting rid of VACUUM FULL
On 4/24/15 2:04 PM, Alvaro Herrera wrote: Heikki Linnakangas wrote: Hmm, AFAICT pg_reorg is much more complex, writing stuff to a temp table and swapping relfilenodes afterwards. More like the VACUUM REWRITE that's been discussed. For the kicks, I looked at what it would take to write a utility like that. It turns out to be quite trivial, patch attached. It uses the same principle as VACUUM FULL, scans from the end, moving tuples to lower-numbered pages until it can't do it anymore. It requires a small change to heap_update(), to override the preference to store the new tuple on the same page as the old one, but other than that, it's all in the external module. More than five years have passed since Heikki posted this, and we still haven't found a solution to the problem -- which neverthless keeps biting people to the point that multiple user-space implementations of similar techniques are out there. I think what we need here is something that does heap_update to tuples at the end of the table, moving them to earlier pages; then wait for old snapshots to die (the infrastructure for which we have now, thanks to CREATE INDEX CONCURRENTLY); then truncate the empty pages. Of course, there are lots of details to resolve. It doesn't really matter that this runs for long: a process doing this for hours might be better than AccessExclusiveLock on the table for a much shorter period. Are there any takers? Honestly, I'd prefer we exposed some way to influence where a new tuple gets put, and perhaps better ways of accessing tuples on a specific page. That would make it a lot easier to handle this in userspace, but it would also make it easier to do things like concurrent clustering. Or just organizing a table however you wanted. That said, why not just pull what Heikki did into contrib, and add the necessary mode to heap_update? -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On April 24, 2015 10:26:23 PM GMT+02:00, Peter Eisentraut pete...@gmx.net wrote: On 4/24/15 8:32 AM, Andres Freund wrote: On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote: On 04/17/2015 11:54 AM, Andres Freund wrote: I've attached a rebased patch, that adds decision about origin logging to the relevant XLogInsert() callsites for external 2 byte identifiers and removes the pad-reusing version in the interest of moving forward. Putting aside the 2 vs. 4 byte identifier issue, let's discuss naming: I just realized that it talks about replication identifier as the new fundamental concept. The system table is called pg_replication_identifier. But that's like talking about index identifiers, instead of just indexes, and calling the system table pg_index_oid. The important concept this patch actually adds is the *origin* of each transaction. That term is already used in some parts of the patch. I think we should roughly do a search-replace of replication identifier - replication origin to the patch. Or even transaction origin. Attached is a patch that does this, and some more, renaming. That was more work than I'd imagined. I've also made the internal naming in origin.c more consistent/simpler and did a bunch of other cleanup. I'm pretty happy with this state. Shouldn't this be backed up by pg_dump(all?)? Given it deals with LSNs and is, quite fundamentally due to concurrency, non transactional, I doubt it's worth it. The other side's slots also aren't going to be backed up as pg dump obviously can't know about then. So the represented data won't make much sense. Andres --- Please excuse brevity and formatting - I am writing this on my mobile phone. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Ignoring some binaries generated in src/test
Tom Lane wrote: The current logic in src/test/Makefile, particularly the way that the modules subdirectory is handled, seems pretty ugly/convoluted anyway. I wonder why it was done that way rather than just ensuring that modules/ doesn't do anything for make install? Because we do want to have the Makefile in src/test/modules to install the modules if make install is invoked there. That way, you can run make -C src/test/modules install installcheck, and it works. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freeze avoidance of very large table.
On 4/24/15 6:52 AM, Robert Haas wrote: On Thu, Apr 23, 2015 at 9:03 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Thu, Apr 23, 2015 at 10:42 PM, Robert Haas wrote: On Thu, Apr 23, 2015 at 4:19 AM, Simon Riggs wrote: We only need a freeze/backup map for larger relations. So if we map 1000 blocks per map page, we skip having a map at all when size 1000. Agreed. We might also want to map multiple blocks per map slot - e.g. one slot per 32 blocks. That would keep the map quite small even for very large relations, and would not compromise efficiency that much since reading 256kB sequentially probably takes only a little longer than reading 8kB. I think the idea of integrating the freeze map into the VM fork is also worth considering. Then, the incremental backup map could be optional; if you don't want incremental backup, you can shut it off and have less overhead. When I read that I think about something configurable at relation-level.There are cases where you may want to have more granularity of this information at block level by having the VM slots to track less blocks than 32, and vice-versa. What are those cases? To me that sounds like making things complicated to no obvious benefit. Tables that get few/no dead tuples, like bulk insert tables. You'll have large sections of blocks with the same visibility. I suspect the added code to allow setting 1 bit for multiple pages without having to lock all those pages simultaneously will probably outweigh making this a reloption anyway. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 4/24/15 8:32 AM, Andres Freund wrote: On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote: On 04/17/2015 11:54 AM, Andres Freund wrote: I've attached a rebased patch, that adds decision about origin logging to the relevant XLogInsert() callsites for external 2 byte identifiers and removes the pad-reusing version in the interest of moving forward. Putting aside the 2 vs. 4 byte identifier issue, let's discuss naming: I just realized that it talks about replication identifier as the new fundamental concept. The system table is called pg_replication_identifier. But that's like talking about index identifiers, instead of just indexes, and calling the system table pg_index_oid. The important concept this patch actually adds is the *origin* of each transaction. That term is already used in some parts of the patch. I think we should roughly do a search-replace of replication identifier - replication origin to the patch. Or even transaction origin. Attached is a patch that does this, and some more, renaming. That was more work than I'd imagined. I've also made the internal naming in origin.c more consistent/simpler and did a bunch of other cleanup. I'm pretty happy with this state. Shouldn't this be backed up by pg_dump(all?)? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing tuple overhead
On 4/23/15 10:40 PM, Amit Kapila wrote: I agree with you and what I think one of the major reasons of bloat is that Index segment doesn't have visibility information due to which clearing of Index needs to be tied along with heap. Now if we can move transaction information at page level, then we can even think of having it in Index segment as well and then Index can delete/prune it's tuples on it's own which can reduce the bloat in index significantly and there is a benefit to Vacuum as well. I don't see how putting visibility at the page level helps indexes at all. We could already put XMIN in indexes if we wanted, but it won't help, because... Now this has some downsides as well like Delete needs to traverse Index segment as well to Delete mark the tuples, but I think the upsides of reducing bloat can certainly outweigh the downsides. ... which isn't possible. You can not go from a heap tuple to an index tuple. This has been discussed in the past. If we could do that then vacuum would become REALLY cheap compared to today. BTW, before actually tackling anything we should try and get more data from Robert/Jan about where the extra 80% came from. We don't know if it's indexes or what. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Feedback on getting rid of VACUUM FULL
Jim Nasby wrote: Honestly, I'd prefer we exposed some way to influence where a new tuple gets put, and perhaps better ways of accessing tuples on a specific page. That would make it a lot easier to handle this in userspace, but it would also make it easier to do things like concurrent clustering. Or just organizing a table however you wanted. That's great and all, but it doesn't help people who have already, for whatever reason, ran into severe bloat and cannot take long enough downtime to run VACUUM FULL. That said, why not just pull what Heikki did into contrib, and add the necessary mode to heap_update? Sure, that's what I suggest. We just need to fix the bugs and (as Tom puts it) infelicities. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Ignoring some binaries generated in src/test
Alvaro Herrera alvhe...@2ndquadrant.com writes: Tom Lane wrote: The current logic in src/test/Makefile, particularly the way that the modules subdirectory is handled, seems pretty ugly/convoluted anyway. I wonder why it was done that way rather than just ensuring that modules/ doesn't do anything for make install? Because we do want to have the Makefile in src/test/modules to install the modules if make install is invoked there. That way, you can run make -C src/test/modules install installcheck, and it works. OK. I still wonder if there isn't a better way to get that effect, but I left it alone for now. I committed Michael's new .gitignore files and fixed the Makefiles so that make clean and friends clean up properly. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Feedback on getting rid of VACUUM FULL
On 4/24/15 3:34 PM, Alvaro Herrera wrote: Jim Nasby wrote: Honestly, I'd prefer we exposed some way to influence where a new tuple gets put, and perhaps better ways of accessing tuples on a specific page. That would make it a lot easier to handle this in userspace, but it would also make it easier to do things like concurrent clustering. Or just organizing a table however you wanted. That's great and all, but it doesn't help people who have already, for whatever reason, ran into severe bloat and cannot take long enough downtime to run VACUUM FULL. That said, why not just pull what Heikki did into contrib, and add the necessary mode to heap_update? Sure, that's what I suggest. We just need to fix the bugs and (as Tom puts it) infelicities. It looks like the biggest complaint (aside from allowing a limited number of tuples to be moved) is in [1] and [2], where Tom is saying that you can't simply call heap_update() like this without holding an exclusive lock on the table. Is that because we're not actually changing the tuple? Another issue is both HOT and KeyUpdate; I think we need to completely ignore/over-ride that stuff for this. Instead of adding forcefsm, I think it would be more useful to accept a target block number. That way we can actually control where the new tuple goes. For this particular case we'd presumably go with normal FSM page selection logic, but someone could chose to to do something more sophisticated if they wanted. [1] http://postgresql.org/message-id/3409.1253147...@sss.pgh.pa.us [2] http://postgresql.org/message-id/3631.1253149...@sss.pgh.pa.us -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers