Re: [HACKERS] Moving ExecInsertIndexTuples and friends to new file

2015-04-24 Thread Heikki Linnakangas

On 04/24/2015 06:30 AM, Stephen Frost wrote:

* Peter Geoghegan (p...@heroku.com) wrote:

On Thu, Apr 23, 2015 at 12:05 PM, Heikki Linnakangas hlinn...@iki.fi wrote:

While looking at Peter's INSERT ... ON CONFLICT patch, I started to feel
that ExecInsertIndexTuples() and friends would deserve a file of their own,
and not be buried in the middle of execUtils.c. I propose that we split
execUtils.c into two, moving ExecOpenIndices(), ExecCloseIndices()
ExecInsertIndexTuples() and related functions into a new file called
executor/execIndexing.c.


That split makes a lot of sense to me.


No objections here.


Ok, moved.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] adding more information about process(es) cpu and memory usage

2015-04-24 Thread Robert Haas
On Thu, Apr 23, 2015 at 9:28 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 The reason nobody's gotten around to that in the last fifteen years is
 that per-process rusage isn't actually all that interesting; there's
 too much that happens in background daemons, for instance.

There's *some* stuff that happens in background daemons, but if you
want to measure user and system time consume by a particularly query,
this would actually be a pretty handy way to do that, I think.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [committers] pgsql: RLS fixes, new hooks, and new test module

2015-04-24 Thread Dean Rasheed
On 24 April 2015 at 04:31, Stephen Frost sfr...@snowman.net wrote:
 Christian,

 * Christian Ullrich (ch...@chrullrich.net) wrote:
 * Stephen Frost wrote:

 RLS fixes, new hooks, and new test module

 The buildfarm says that with -DCLOBBER_CACHE_ALWAYS, the RLS
 violations get blamed on the wrong tables. Mostly, they are catalogs
 (I have seen pg_opclass, pg_am, and pg_amproc), but some also come
 up with binary garbage instead, e.g.

 - 
 http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=markhordt=2015-04-23%2000%3A00%3A12
 - 
 http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=friarbirddt=2015-04-23%2004%3A20%3A00
 - 
 http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=tickdt=2015-04-22%2019%3A56%3A53

 Yup, thanks, Robert pointed this out on another thread and I'm looking
 into it.


Ah, yes it looks like a couple of places in
get_row_security_policies() ought to be making a copy of the relation
name when setting it on the WCO.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improving vacuum/VM/etc

2015-04-24 Thread Robert Haas
On Thu, Apr 23, 2015 at 3:09 PM, Jim Nasby jim.na...@bluetreble.com wrote:
 Unfortunately, the states I came up with using existing semantics don't look
 hugely useful[4], but if we take Robert's idea and make all-visible mean
 all-frozen, we can do much better:

 0: Newly inserted tuples
 Tracking this state allows us to aggressively set hint bits.

Who is us?  And what do you mean by aggressively?  As things
stand, any process that has to touch a tuple always sets any
applicable hint bits.

 1: Newly deleted
 There are tuples that have been deleted but not pruned. There may also be
 newly inserted tuples that need hinting (state 0).

 Similar to state 0, we'd want to be fairly aggressive with these pages,
 because as soon as the deleting XID is committed and older than all
 snapshots we can prune. Because we can prune without hitting indexes, this
 is still a fairly cheap operation, though not as cheap as 0.

What behavior difference would you foresee between state 0 and state 1?

 2: Fully hinted, not frozen
 This is the really painful state to clean up, because we have to deal with
 indexes. We must enter this state after being in 1.

Neither the fact that a page is fully hinted nor the fact that it is
or is not frozen implies anything about dealing with indexes.  We need
to deal with indexes because the page contains either dead tuples (as
a result of an aborted insert, a committed delete, or an aborted or
committed update) or dead line pointers (as a result of pruning dead
tuples).

 3: All-visible-frozen
 Every tuple on the page is visible and frozen. Pages in this state need no
 maintenance at all. We might be able to enter this state directly from state
 0.


 BENEFITS
 This tracking should help at least 3 problems: the need to set hint bits
 after insert, SELECT queries doing pruning (Simon's recent complaint), and
 needing to scan an entire table for freezing.

 The improvement in hinting and pruning is based on the idea that normally
 there would not be a lot of pages in state 0 or 1, and pages that were in
 those states are very likely to still be in disk cache (if not shared
 buffers). That means we can have a background process (or 2) that is very
 aggressive at targeting pages in these states.

OK, I agree that a background process could be useful.  Whenever it
sees a dirty page, it could attempt to aggressively set hint bits,
prune, mark all-visible, and freeze the page before that page gets
evicted.  However, that doesn't require the sort of state map you're
proposing here.

I think your statement about pages that were in those states are
still likely to be in the disk cache is not really true.  I mean, if
we're doing OLTP, yes.  But not if we're bulk-loading.

 Not needing to scan everything that's frozen is thanks to state 3. I think
 it's OK (at least for now) if only vacuum puts pages into this state, which
 means it can actually freeze the tuples when it does it (thanks to 37484ad
 we won't lose forensic data doing this). That means there's no extra work
 necessary by a foreground process that's dirtying a page.

Did you notice the discussion on the other thread about this
increasing WAL volume by a factor of 113?

 Because of 37484ad, I think as part of this we should also deprecate
 vacuum_freeze_min_age, or at least change it's behavior. AFAIK the only
 objection to aggressive freezing was loss of forensic data, and that's gone
 now. So vacuum (and presumably the bg process(es) than handle state 0 and 1)
 should freeze tuples if it would allow the whole page to be frozen. Possibly
 it should just do it any time it's dirtying the page. (We could actually do
 this right now; it would let us eliminate the GUC, but I'm not sure there'd
 be other benefit without the rest of this.)

Reducing vacuum_freeze_min_age certainly seems worth considering.  I
don't know how to judge whether it's a good idea, though.  You're
balancing less I/O later against a lot more WAL right now.

 DOWNSIDES
 This does mean doubling the size of the VM. It would still be 32,000 times
 smaller than the heap with 8k pages (and 128,000 times smaller with the
 common warehouse 32k page size), so I suspect this is a non-issue, but it's
 worth mentioning. It might have some effect on a almost entirely read-only
 system; but I suspect in most other cases the other benefits will outweigh
 this.

I don't think that's a problem.

 This approach still does nothing to help the index related activity in
 vacuum. My gut says state 2 should be further split; but I'm not sure why.
 Perhaps if we had another state we could do something more intelligent with
 index cleanup...

I can't really follow why you've got these states to begin with.  0,
1, and 2 are all pretty much the same.  The useful distinction AFAICS
is between not-all-visible, all-visible, and all-visible-plus-frozen.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list 

[HACKERS] Bug in planner

2015-04-24 Thread Teodor Sigaev

Hi!

I faced with planner error:
ERROR:  could not find RelOptInfo for given relids

A test case is in attachment, all supported versions of pgsql and HEAD are 
affected.


Suppose, there is something wrong in pull_up_sublinks_qual_recurse() near 
converting NOT EXISTS into join, because preventing such convertion makes query 
worked. But I'm not sure.




--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/
DROP TABLE IF EXISTS t1;
DROP TABLE IF EXISTS t2;
DROP TABLE IF EXISTS t3;
DROP TABLE IF EXISTS t4;
DROP TABLE IF EXISTS t5;

CREATE TABLE t1 (
c1 int
);

CREATE TABLE t2 (
c1 int,
c2 int,
c3 int
);

CREATE TABLE t3 (
c1 int,
c2 int
);

CREATE TABLE t4 (
c1 int,
c2 int
);

CREATE TABLE t5 (
c1 int
);

SELECT
*
FROM
t1
WHERE
NOT (
EXISTS (
SELECT
1
FROM (
SELECT
t2.c2 AS c1
FROM
t2 
LEFT OUTER JOIN t3 ON (t2.c3 = t3.c1 )
LEFT OUTER JOIN (
SELECT
t5.c1 AS c1
FROM
t4
LEFT OUTER JOIN t5 ON ( t4.c2 = t5.c1 )
)
a1 ON ( a1.c1 = t3.c2  )
) a2
WHERE
t1.c1 = a2.c1 
)
);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-24 Thread Robert Haas
On Thu, Apr 23, 2015 at 9:03 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Thu, Apr 23, 2015 at 10:42 PM, Robert Haas wrote:
 On Thu, Apr 23, 2015 at 4:19 AM, Simon Riggs  wrote:
 We only need a freeze/backup map for larger relations. So if we map 1000
 blocks per map page, we skip having a map at all when size  1000.

 Agreed.  We might also want to map multiple blocks per map slot - e.g.
 one slot per 32 blocks.  That would keep the map quite small even for
 very large relations, and would not compromise efficiency that much
 since reading 256kB sequentially probably takes only a little longer
 than reading 8kB.

 I think the idea of integrating the freeze map into the VM fork is
 also worth considering.  Then, the incremental backup map could be
 optional; if you don't want incremental backup, you can shut it off
 and have less overhead.

 When I read that I think about something configurable at
 relation-level.There are cases where you may want to have more
 granularity of this information at block level by having the VM slots
 to track less blocks than 32, and vice-versa.

What are those cases?  To me that sounds like making things
complicated to no obvious benefit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PL/pgSQL, RAISE and error context

2015-04-24 Thread Pavel Stehule
2015-04-24 12:11 GMT+02:00 Joel Jacobson j...@trustly.com:

 Entering the discussion because this is a huge pain for me in my daily
 work as well.

 This is not a reply to any specific post in this thread, but my first
 message in the thread.

 I see a great value in providing both a GUC and a new RAISE syntax.
 The different benefits of the two are maybe obvious, but perhaps worth
 pointing out:
 GUC: Good because you don't have to change any existing code.
 RAISE syntax: Good because you can control exactly what message should
 be emitted or not be emitted at that line of code.

 I think preserving backwards compatibility is very important.
 Not changing the default is not a problem for me, as long as it can be
 overridden.

 Whatever the default behaviour is, I think the need expressed by all
 users in this thread boils down to any of these two sentences:

 I want CONTEXT to be (DISPLAYED|SUPPRESSED) for (ALL|ONLY THIS LINE)
 RAISE (NOTICE|WARNING|ERROR)
 OR
 I don't want to change the default current behaviour of CONTEXT

 So we basically need a boolean setting value, where:
 NULL means the default behaviour
 TRUE means DISPLAY CONTEXT
 FALSE means SUPPRESS CONTEXT

 And the (ALL|ONLY THIS) part translates into using,
 * a GUC to change behaviour for ALL lines of code,
 * or using the RAISE syntax to change the behaviour of ONLY THIS line of
 code.

 And then we have the different message levels, for which CONTEXT is
 sometimes desirable in some situations:
 * The RAISE syntax allows controlling any message level in a natural
 way, as the message level is part of the syntax.
 * Allowing the same control using GUC would mean the message level
 would need to be part of the GUC key name, which means either add
 multiple GUCs, one for each message level, or only allow controlling
 the most important one and ignore the possibly need to control the
 other message levels.

 If it would be possible to somehow combine multiple message levels in
 the same GUC, that would solve the latter problem.

 We already have comma separated values for many GUCs, so maybe we
 could use that approach here as well.

 It looks like adding these two GUCs would meet the demands of all users:




 suppress_context_messages (enum)
 display_context_messages (enum)



This proposal looks very practical - it can be very good start point - and
it doesn't block any next discuss about enhancing RAISE statement, what I
would to have too (bat can be separate issue). I like it.

Regards

Pavel




 This would allow doing something crazy as:

 suppress_context_messages = warning,error
 display_context_messages = notice



Re: [HACKERS] [committers] pgsql: RLS fixes, new hooks, and new test module

2015-04-24 Thread Stephen Frost
Dean,

* Dean Rasheed (dean.a.rash...@gmail.com) wrote:
 On 24 April 2015 at 04:31, Stephen Frost sfr...@snowman.net wrote:
  * Christian Ullrich (ch...@chrullrich.net) wrote:
  * Stephen Frost wrote:
  RLS fixes, new hooks, and new test module
 
  The buildfarm says that with -DCLOBBER_CACHE_ALWAYS, the RLS
  violations get blamed on the wrong tables. Mostly, they are catalogs
  (I have seen pg_opclass, pg_am, and pg_amproc), but some also come
  up with binary garbage instead, e.g.
 
  - 
  http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=markhordt=2015-04-23%2000%3A00%3A12
  - 
  http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=friarbirddt=2015-04-23%2004%3A20%3A00
  - 
  http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=tickdt=2015-04-22%2019%3A56%3A53
 
  Yup, thanks, Robert pointed this out on another thread and I'm looking
  into it.
 
 
 Ah, yes it looks like a couple of places in
 get_row_security_policies() ought to be making a copy of the relation
 name when setting it on the WCO.

Yup, that was the same conclusion that I came to last night, just hadn't
finished testing before needing sleep. :)  Will push a fix shortly for
it.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] PL/pgSQL, RAISE and error context

2015-04-24 Thread Pavel Stehule
2015-04-24 13:16 GMT+02:00 Robert Haas robertmh...@gmail.com:

 On Fri, Apr 24, 2015 at 6:11 AM, Joel Jacobson j...@trustly.com wrote:
  Entering the discussion because this is a huge pain for me in my daily
  work as well.
 
  This is not a reply to any specific post in this thread, but my first
  message in the thread.
 
  I see a great value in providing both a GUC and a new RAISE syntax.
  The different benefits of the two are maybe obvious, but perhaps worth
  pointing out:
  GUC: Good because you don't have to change any existing code.
  RAISE syntax: Good because you can control exactly what message should
  be emitted or not be emitted at that line of code.
 
  I think preserving backwards compatibility is very important.
  Not changing the default is not a problem for me, as long as it can be
  overridden.
 
  Whatever the default behaviour is, I think the need expressed by all
  users in this thread boils down to any of these two sentences:
 
  I want CONTEXT to be (DISPLAYED|SUPPRESSED) for (ALL|ONLY THIS LINE)
  RAISE (NOTICE|WARNING|ERROR)
  OR
  I don't want to change the default current behaviour of CONTEXT
 
  So we basically need a boolean setting value, where:
  NULL means the default behaviour
  TRUE means DISPLAY CONTEXT
  FALSE means SUPPRESS CONTEXT
 
  And the (ALL|ONLY THIS) part translates into using,
  * a GUC to change behaviour for ALL lines of code,
  * or using the RAISE syntax to change the behaviour of ONLY THIS line of
 code.
 
  And then we have the different message levels, for which CONTEXT is
  sometimes desirable in some situations:
  * The RAISE syntax allows controlling any message level in a natural
  way, as the message level is part of the syntax.
  * Allowing the same control using GUC would mean the message level
  would need to be part of the GUC key name, which means either add
  multiple GUCs, one for each message level, or only allow controlling
  the most important one and ignore the possibly need to control the
  other message levels.
 
  If it would be possible to somehow combine multiple message levels in
  the same GUC, that would solve the latter problem.
 
  We already have comma separated values for many GUCs, so maybe we
  could use that approach here as well.
 
  It looks like adding these two GUCs would meet the demands of all users:
 
  suppress_context_messages (enum)
  display_context_messages (enum)
 
  This would allow doing something crazy as:
 
  suppress_context_messages = warning,error
  display_context_messages = notice

 This is a very flexible proposal, but it's a tremendous amount of
 machinery for what's really a very minor issue.  If we added two GUCs
 for every comparably important issue, we'd have about 40,000 of them.


It can be PLpgSQL only GUC. Probably it is a most problematic case.


 I suggest we add the RAISE syntax first, because everybody agrees on
 that.  Then, we can argue about the other stuff.


There is a agreement about it - but I am expecting a harder discussion
about what will be default, and the discussion about syntax should not be
simple too.





 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company



Re: [HACKERS] PL/pgSQL, RAISE and error context

2015-04-24 Thread Joel Jacobson
Entering the discussion because this is a huge pain for me in my daily
work as well.

This is not a reply to any specific post in this thread, but my first
message in the thread.

I see a great value in providing both a GUC and a new RAISE syntax.
The different benefits of the two are maybe obvious, but perhaps worth
pointing out:
GUC: Good because you don't have to change any existing code.
RAISE syntax: Good because you can control exactly what message should
be emitted or not be emitted at that line of code.

I think preserving backwards compatibility is very important.
Not changing the default is not a problem for me, as long as it can be
overridden.

Whatever the default behaviour is, I think the need expressed by all
users in this thread boils down to any of these two sentences:

I want CONTEXT to be (DISPLAYED|SUPPRESSED) for (ALL|ONLY THIS LINE)
RAISE (NOTICE|WARNING|ERROR)
OR
I don't want to change the default current behaviour of CONTEXT

So we basically need a boolean setting value, where:
NULL means the default behaviour
TRUE means DISPLAY CONTEXT
FALSE means SUPPRESS CONTEXT

And the (ALL|ONLY THIS) part translates into using,
* a GUC to change behaviour for ALL lines of code,
* or using the RAISE syntax to change the behaviour of ONLY THIS line of code.

And then we have the different message levels, for which CONTEXT is
sometimes desirable in some situations:
* The RAISE syntax allows controlling any message level in a natural
way, as the message level is part of the syntax.
* Allowing the same control using GUC would mean the message level
would need to be part of the GUC key name, which means either add
multiple GUCs, one for each message level, or only allow controlling
the most important one and ignore the possibly need to control the
other message levels.

If it would be possible to somehow combine multiple message levels in
the same GUC, that would solve the latter problem.

We already have comma separated values for many GUCs, so maybe we
could use that approach here as well.

It looks like adding these two GUCs would meet the demands of all users:

suppress_context_messages (enum)
display_context_messages (enum)

This would allow doing something crazy as:

suppress_context_messages = warning,error
display_context_messages = notice


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PL/pgSQL, RAISE and error context

2015-04-24 Thread Robert Haas
On Fri, Apr 24, 2015 at 6:11 AM, Joel Jacobson j...@trustly.com wrote:
 Entering the discussion because this is a huge pain for me in my daily
 work as well.

 This is not a reply to any specific post in this thread, but my first
 message in the thread.

 I see a great value in providing both a GUC and a new RAISE syntax.
 The different benefits of the two are maybe obvious, but perhaps worth
 pointing out:
 GUC: Good because you don't have to change any existing code.
 RAISE syntax: Good because you can control exactly what message should
 be emitted or not be emitted at that line of code.

 I think preserving backwards compatibility is very important.
 Not changing the default is not a problem for me, as long as it can be
 overridden.

 Whatever the default behaviour is, I think the need expressed by all
 users in this thread boils down to any of these two sentences:

 I want CONTEXT to be (DISPLAYED|SUPPRESSED) for (ALL|ONLY THIS LINE)
 RAISE (NOTICE|WARNING|ERROR)
 OR
 I don't want to change the default current behaviour of CONTEXT

 So we basically need a boolean setting value, where:
 NULL means the default behaviour
 TRUE means DISPLAY CONTEXT
 FALSE means SUPPRESS CONTEXT

 And the (ALL|ONLY THIS) part translates into using,
 * a GUC to change behaviour for ALL lines of code,
 * or using the RAISE syntax to change the behaviour of ONLY THIS line of code.

 And then we have the different message levels, for which CONTEXT is
 sometimes desirable in some situations:
 * The RAISE syntax allows controlling any message level in a natural
 way, as the message level is part of the syntax.
 * Allowing the same control using GUC would mean the message level
 would need to be part of the GUC key name, which means either add
 multiple GUCs, one for each message level, or only allow controlling
 the most important one and ignore the possibly need to control the
 other message levels.

 If it would be possible to somehow combine multiple message levels in
 the same GUC, that would solve the latter problem.

 We already have comma separated values for many GUCs, so maybe we
 could use that approach here as well.

 It looks like adding these two GUCs would meet the demands of all users:

 suppress_context_messages (enum)
 display_context_messages (enum)

 This would allow doing something crazy as:

 suppress_context_messages = warning,error
 display_context_messages = notice

This is a very flexible proposal, but it's a tremendous amount of
machinery for what's really a very minor issue.  If we added two GUCs
for every comparably important issue, we'd have about 40,000 of them.

I suggest we add the RAISE syntax first, because everybody agrees on
that.  Then, we can argue about the other stuff.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tablespaces inside $PGDATA considered harmful

2015-04-24 Thread Robert Haas
On Thu, Apr 23, 2015 at 4:30 PM, Andres Freund and...@anarazel.de wrote:
 On 2015-04-23 16:26:09 -0400, Robert Haas wrote:
 But pg_upgrade automates all that, so you can't use pg_upgrade in that
 case.  If we add a GUC as I suggested, you can still use pg_upgrade.

 But we also have to live with data directories being in a shit state
 forever onward. We won't really be able to remove the option
 realistically.

 It's not that hard to just move the tablespace out of the data directory
 while the server. As long as you move it on the same partition, it's
 even fast.

OK, fair point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-04-24 Thread Shigeru HANADA
Hi Ashutosh,

Thanks for the review.

2015/04/22 19:28、Ashutosh Bapat ashutosh.ba...@enterprisedb.com のメール:
 Tests
 ---
 1.The postgres_fdw test is re/setting enable_mergejoin at various places. The 
 goal of these tests seems to be to test the sanity of foreign plans 
 generated. So, it might be better to reset enable_mergejoin (and may be all 
 of enable_hashjoin, enable_nestloop_join etc.) to false at the beginning of 
 the testcase and set them again at the end. That way, we will also make sure 
 that foreign plans are chosen irrespective of future planner changes.

I have different, rather opposite opinion about it.  I disabled other join 
types as least as the tests pass, because I worry oversights come from planner 
changes.  I hope to eliminate enable_foo from the test script, by improving 
costing model smarter.

 2. In the patch, I see that the inheritance testcases have been deleted from 
 postgres_fdw.sql, is that intentional? I do not see those being replaced 
 anywhere else.

It’s accidental removal, I restored the tests about inheritance feature.

 3. We need one test for each join type (or at least for INNER and LEFT OUTER) 
 where there are unsafe to push conditions in ON clause along-with 
 safe-to-push conditions. For INNER join, the join should get pushed down with 
 the safe conditions and for OUTER join it shouldn't be. Same goes for WHERE 
 clause, in which case the join will be pushed down but the unsafe-to-push 
 conditions will be applied locally.

Currently INNER JOINs with unsafe join conditions are not pushed down, so such 
test is not in the suit.  As you say, in theory, INNER JOINs can be pushed down 
even they have push-down-unsafe join conditions, because such conditions can be 
evaluated no local side against rows retrieved without those conditions.

 4. All the tests have ORDER BY, LIMIT in them, so the setref code is being 
 exercised. But, something like aggregates would test the setref code better. 
 So, we should add at-least one test like select avg(ft1.c1 + ft2.c2) from ft1 
 join ft2 on (ft1.c1 = ft2.c1).

Added an aggregate case, and also added an UNION case for Append.

 5. It will be good to add some test which contain join between few foreign 
 and few local tables to see whether we are able to push down the largest 
 possible foreign join tree to the foreign server. 
 





 Code
 ---
 In classifyConditions(), the code is now appending RestrictInfo::clause 
 rather than RestrictInfo itself. But the callers of classifyConditions() have 
 not changed. Is this change intentional?

Yes, the purpose of the change is to make appendConditions (former name is 
appendWhereClause) can handle JOIN ON clause, list of Expr.

 The functions which consume the lists produced by this function handle 
 expressions as well RestrictInfo, so you may not have noticed it. Because of 
 this change, we might be missing some optimizations e.g. in function 
 postgresGetForeignPlan()
  793 if (list_member_ptr(fpinfo-remote_conds, rinfo))
  794 remote_conds = lappend(remote_conds, rinfo-clause);
  795 else if (list_member_ptr(fpinfo-local_conds, rinfo))
  796 local_exprs = lappend(local_exprs, rinfo-clause);
  797 else if (is_foreign_expr(root, baserel, rinfo-clause))
  798 remote_conds = lappend(remote_conds, rinfo-clause);
  799 else
  800 local_exprs = lappend(local_exprs, rinfo-clause);
 Finding a RestrictInfo in remote_conds avoids another call to 
 is_foreign_expr(). So with this change, I think we are doing an extra call to 
 is_foreign_expr().
 

Hm, it seems better to revert my change and make appendConditions downcast 
given information into RestrictInfo or Expr according to the node tag.

 The function get_jointype_name() returns an empty string for unsupported join 
 types. Instead of that it should throw an error, if some code path 
 accidentally calls the function with unsupported join type e.g. SEMI_JOIN.

Agreed, fixed.

 While deparsing the SQL with rowmarks, the placement of FOR UPDATE/SHARE 
 clause in the original query is not being honored, which means that we will 
 end up locking the rows which are not part of the join result even when the 
 join is pushed to the foreign server. E.g take the following query (it uses 
 the tables created in postgres_fdw.sql tests)
 contrib_regression=# explain verbose select * from ft1 join ft2 on (ft1.c1 = 
 ft2.c1) for update of ft1;
   
  
   
  
  QUERY PLAN   
  

Re: [HACKERS] Fwd: [GENERAL] 4B row limit for CLOB tables

2015-04-24 Thread Robert Haas
On Thu, Apr 23, 2015 at 11:24 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Having said all that, if we did try to fix it today, I'd imagine changing
 TOAST value identifiers to int64 and inventing a new TOAST pointer format
 for use when 32 bits isn't wide enough for the ID.  But I think we're best
 advised to hold off doing that until the need becomes pressing.

Just out of curiosity, has anyone thought about inventing a new TOAST
pointer format on the grounds that our TOAST pointers are unreasonably
large? IIUC, a TOAST pointer right now is 18 bytes: 16 for a
varatt_external, and then that gets embedded in a varattrib_1b_e with
a va_header byte and a va_tag byte.  Eliminating one or both of
va_rawsize and va_extsize from the TOAST pointer itself seems like it
could save quite a bit of space on disk.  Maybe you could even find a
way to get rid of va_toastrelid; after all, at the point when you
first acquire a pointer to the tuple, you surely know what relation
it's a part of.  You'd probably want to force de-TOASTing (or
converting to a more expressive form of TOAST pointer, anyway) when
you extracted the column from the tuple, which might be hard to
arrange.

But the benefits could be pretty significant.  Suppose you have a
table where each tuple is 4K untoasted, with all but 100 bytes of that
in a single column. So, as stored, you've got 100 bytes of regular
stuff plus an 18-byte TOAST header.  If you could trim 2 of the
above-mentioned 4-byte fields out of the TOAST header, that would
reduce the size of the main relation fork by almost 7%.  If you could
trim all 3 of them out, you'd save more than 10%.  That's not nothing,
and the benefits could be even larger for rows that contain multiple
TOAST pointers.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-04-24 Thread Robert Haas
On Wed, Apr 22, 2015 at 10:48 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
 +   else if (scan-scanrelid == 0 
 +(IsA(scan, ForeignScan) || IsA(scan, CustomScan)))
 +   varno = INDEX_VAR;

 Suppose scan-scanrelid == 0 but the scan type is something else?  Is
 that legal?  Is varno == 0 the correct outcome in that case?

 Right now, no other scan type has capability to return a tuples
 with flexible type/attributes more than static definition.
 I think it is a valid restriction that only foreign/custom-scan
 can have scanrelid == 0.

But the code as you've written it doesn't enforce any such
restriction.  It just spends CPU cycles testing for a condition which,
to the best of your knowledge, will never happen.

If it's really a can't happen condition, how about checking it via an Assert()?

else if (scan-scanrelid == 0)
{
Assert(IsA(scan, ForeignScan) || IsA(scan, CustomScan));
varno = INDEX_VAR;
}

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-24 Thread Petr Jelinek

On 24/04/15 14:32, Andres Freund wrote:

On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote:

On 04/17/2015 11:54 AM, Andres Freund wrote:

I've attached a rebased patch, that adds decision about origin logging
to the relevant XLogInsert() callsites for external 2 byte identifiers
and removes the pad-reusing version in the interest of moving forward.


Putting aside the 2 vs. 4 byte identifier issue, let's discuss naming:

I just realized that it talks about replication identifier as the new
fundamental concept. The system table is called pg_replication_identifier.
But that's like talking about index identifiers, instead of just indexes,
and calling the system table pg_index_oid.

The important concept this patch actually adds is the *origin* of each
transaction. That term is already used in some parts of the patch. I think
we should roughly do a search-replace of replication identifier -
replication origin to the patch. Or even transaction origin.


Attached is a patch that does this, and some more, renaming. That was
more work than I'd imagined.  I've also made the internal naming in
origin.c more consistent/simpler and did a bunch of other cleanup.



There are few oversights in the renaming:

doc/src/sgml/func.sgml:
+Return the replay position for the passed in replication
+identifier. The parameter parameterflush/parameter

src/include/replication/origin.h:
+ * replication_identifier.h

+extern PGDLLIMPORT RepOriginId replident_sesssion_origin;
+extern PGDLLIMPORT XLogRecPtr replident_sesssion_origin_lsn;
+extern PGDLLIMPORT TimestampTz replident_sesssion_origin_timestamp;

(these are used then in multiple places in code afterwards and also 
mentioned in comment above replorigin_advance)


src/backend/replication/logical/origin.c:
+   ereport(ERROR,
+   (errcode(ERRCODE_OBJECT_IN_USE),
+errmsg(replication identiefer

+   default:
+   elog(PANIC, replident_redo: unknown op code

+ * This function may only be called if a origin was setup with
+ * replident_session_setup().

I also think the replident_checkpoint file should be renamed to 
replorigin_checkpoint.



--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: [GENERAL] 4B row limit for CLOB tables

2015-04-24 Thread Stephen Frost
* Robert Haas (robertmh...@gmail.com) wrote:
 On Thu, Apr 23, 2015 at 11:24 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Having said all that, if we did try to fix it today, I'd imagine changing
  TOAST value identifiers to int64 and inventing a new TOAST pointer format
  for use when 32 bits isn't wide enough for the ID.  But I think we're best
  advised to hold off doing that until the need becomes pressing.
 
 Just out of curiosity, has anyone thought about inventing a new TOAST
 pointer format on the grounds that our TOAST pointers are unreasonably
 large?

I'd not thought about it, but sure sounds like a good idea from here.

Would be particularly great if we were able to do this and increase the
number of supported toast pointers and avoid having to go hunting for
unused identifiers due to wrapping.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] PL/pgSQL, RAISE and error context

2015-04-24 Thread Joel Jacobson
On Fri, Apr 24, 2015 at 1:16 PM, Robert Haas robertmh...@gmail.com wrote:
 This would allow doing something crazy as:

 suppress_context_messages = warning,error
 display_context_messages = notice

 This is a very flexible proposal, but it's a tremendous amount of
 machinery for what's really a very minor issue.  If we added two GUCs
 for every comparably important issue, we'd have about 40,000 of them.

I agree. The one-dimensional GUC syntax is not well suited for
multi-dimensional config settings. And that's a good thing mostly I
think. It would be a nightmare if the config file values could in JSON
format, it's good they are simple.

But I'm thinking maybe we could improve the config file syntax for the
general case when you have multiple things you want to control, in
this case the message levels, and for each such thing, you want to
turn something on/off, in this case the CONTEXT. Maybe we could simply
use plus + and minus - to mean on and off?

Example:

context_messages = -warning, -error, +notice


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-04-24 Thread Kouhei Kaigai
 On Wed, Apr 22, 2015 at 10:48 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
  +   else if (scan-scanrelid == 0 
  +(IsA(scan, ForeignScan) || IsA(scan, CustomScan)))
  +   varno = INDEX_VAR;
 
  Suppose scan-scanrelid == 0 but the scan type is something else?  Is
  that legal?  Is varno == 0 the correct outcome in that case?
 
  Right now, no other scan type has capability to return a tuples
  with flexible type/attributes more than static definition.
  I think it is a valid restriction that only foreign/custom-scan
  can have scanrelid == 0.
 
 But the code as you've written it doesn't enforce any such
 restriction.  It just spends CPU cycles testing for a condition which,
 to the best of your knowledge, will never happen.
 
 If it's really a can't happen condition, how about checking it via an 
 Assert()?
 
 else if (scan-scanrelid == 0)
 {
 Assert(IsA(scan, ForeignScan) || IsA(scan, CustomScan));
 varno = INDEX_VAR;
 }

Thanks for your suggestion. I'd like to use this idea on the next patch.

--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-04-24 Thread Amit Kapila
On Thu, Apr 23, 2015 at 2:26 AM, Robert Haas robertmh...@gmail.com wrote:

 On Wed, Apr 22, 2015 at 8:48 AM, Amit Kapila amit.kapil...@gmail.com
wrote:
  I have implemented this idea (note that I have to expose a new API
  shm_mq_from_handle as TupleQueueFunnel stores shm_mq_handle* and
  we sum_mq* to call shm_mq_detach) and apart this I have fixed other
  problems reported on this thread:
 
  1. Execution of initPlan by master backend and then pass the
  required PARAM_EXEC parameter values to workers.
  2. Avoid consuming dsm's by freeing the parallel context after
  the last tuple is fetched.
  3. Allow execution of Result node in worker backend as that can
  be added as a gating filter on top of PartialSeqScan.
  4. Merged parallel heap scan descriptor patch
 
  To apply the patch, please follow below sequence:
 
  HEAD Commit-Id: 4d930eee
  parallel-mode-v9.patch [1]
  assess-parallel-safety-v4.patch [2]  (don't forget to run fixpgproc.pl
in
  the patch)
  parallel_seqscan_v14.patch (Attached with this mail)

 Thanks, this version looks like an improvement.  However, I still see
 some problems:

 - I believe the separation of concerns between ExecFunnel() and
 ExecEndFunnel() is not quite right.  If the scan is shut down before
 it runs to completion (e.g. because of LIMIT), then I think we'll call
 ExecEndFunnel() before ExecFunnel() hits the TupIsNull(slot) path.  I
 think you probably need to create a static subroutine that is called
 both as soon as TupIsNull(slot) and also from ExecEndFunnel(), in each
 case cleaning up whatever resources remain.


Right, will fix as per suggestion.

 - InitializeParallelWorkers() still mixes together general parallel
 executor concerns with concerns specific to parallel sequential scan
 (e.g. EstimatePartialSeqScanSpace).

Here we are doing 2 things, first one is for planned statement and
then second one is node specific which in the case is parallelheapscan
descriptor. So If I understand correctly, you want that we remove second
one and have a recursive function to achieve the same.


 - shm_mq_from_handle() is probably reasonable, but can we rename it
 shm_mq_get_queue()?


Okay, will change.

 - It's hard to believe this is right:

 +   if (parallelstmt-inst_options)
 +   receiver = None_Receiver;

 Really?  Flush the tuples if there are *any instrumentation options
 whatsoever*?  At the very least, that doesn't look too future-proof,
 but I'm suspicious that it's outright incorrect.


instrumentation info is for explain statement where we don't need
tuples and it is set same way for it as well, refer ExplainOnePlan().
What makes you feel this is incorrect?

 - I think ParallelStmt probably shouldn't be defined in parsenodes.h.
 That file is included in a lot of places, and adding all of those
 extra #includes there doesn't seem like a good idea for modularity
 reasons even if you don't care about partial rebuilds.  Something that
 includes a shm_mq obviously isn't a parse node in any meaningful
 sense anyway.


How about tcop/tcopprot.h?

 - I don't think you need both setup cost and startup cost.  Starting
 up more workers isn't particularly more expensive than starting up
 fewer of them, because most of the overhead is in waiting for them to
 actually start, and the number of workers is reasonable, then they're
 all be doing that in parallel with each other.  I suggest removing
 parallel_startup_cost and keeping parallel_setup_cost.


There is some work (like creation of shm queues, launching of workers)
which is done proportional to number of workers during setup time. I
have kept 2 parameters to distinguish such work.  I think you have a
point that start of some or all workers could be parallel, but I feel
that still is a work proportinal to number of workers.  For future
parallel operations also such a parameter could be useful where we need
to setup IPC between workers or some other stuff where work is proportional
to workers.

 - In cost_funnel(), I don't think it's right to divide the run cost by
 nWorkers + 1.  Suppose we've got a plan that looks like this:

 Funnel
 - Hash Join
   - Partial Seq Scan on a
   - Hash
 - Seq Scan on b

 The sequential scan on b is going to get executed once per worker,
 whereas the effort for the sequential scan on a is going to be divided
 over all the workers.  So the right way to cost this is as follows:

 (a) The cost of the partial sequential scan on a is equal to the cost
 of a regular sequential scan, plus a little bit of overhead to account
 for communication via the ParallelHeapScanDesc, divided by the number
 of workers + 1.
 (b) The cost of the remaining nodes under the funnel works normally.
 (c) The cost of the funnel is equal to the cost of the hash join plus
 number of tuples multiplied by per-tuple communication overhead plus a
 large fixed overhead reflecting the time it takes the workers to
 start.


IIUC, the change for this would be to remove the change related to
run cost 

Re: [HACKERS] anole - test case sha2 fails on all branches

2015-04-24 Thread Sandeep Thakkar
Hi Noah

The build at commit 36e5247 also failed. With new buildfarm client 4.15, we
used additional configure options like --with-gssapi -with-libxml
--with-libxml --with-ldap --with-libxslt . The build at commit 36e5247 and
last commit is successful without these options. I'll check which of these
libs is causing an issue and if updating it resolves it.

Thanks.

On Fri, Apr 24, 2015 at 7:55 AM, Noah Misch n...@leadboat.com wrote:

 On Thu, Apr 23, 2015 at 10:54:45AM -0400, Tom Lane wrote:
  I wrote:
   Given that anole is the only one reporting this, I'm not sure that we
   should immediately blame Postgres itself.  I have a vague recollection
   that we've seen this symptom before and traced it to a bug in some
   supporting library.  Is anole using any particularly out-of-date
 versions
   of openssl, kerberos, etc?
 
  A bit of digging in the archives suggests that my hindbrain remembered
 this:
  http://www.postgresql.org/message-id/flat/4f5a8404.8020...@dunslane.net
 
  The specifics probably don't apply to anole, but the conclusion that
  inconsistent openssl header and library files triggered the bug might.

 A library problem is plausible.  anole builds without OpenSSL, and I have
 no
 guess for which remaining library could be at fault.  I could not reproduce
 this in an HP-UX IA-64 build configured as follows (no HP compiler
 available):
   ./configure --enable-debug --enable-cassert --enable-depend
 --without-readline --without-zlib CC='gcc -pthread -mlp64'

 Sandeep, I suggest trying a build at commit 36e5247, the last REL9_4_STABLE
 commit known good on anole.  If that build fails, you'll know there's an
 environmental problem, like a broken dependency library.  If that build
 succeeds, please use git bisect to find which commit broke things, and
 report the commit hash here.

 Thanks,
 nm




-- 
Sandeep Thakkar


Re: [HACKERS] Moving ExecInsertIndexTuples and friends to new file

2015-04-24 Thread Heikki Linnakangas

On 04/24/2015 09:36 AM, Heikki Linnakangas wrote:

On 04/24/2015 06:30 AM, Stephen Frost wrote:

* Peter Geoghegan (p...@heroku.com) wrote:

On Thu, Apr 23, 2015 at 12:05 PM, Heikki Linnakangas hlinn...@iki.fi wrote:

While looking at Peter's INSERT ... ON CONFLICT patch, I started to feel
that ExecInsertIndexTuples() and friends would deserve a file of their own,
and not be buried in the middle of execUtils.c. I propose that we split
execUtils.c into two, moving ExecOpenIndices(), ExecCloseIndices()
ExecInsertIndexTuples() and related functions into a new file called
executor/execIndexing.c.


That split makes a lot of sense to me.


No objections here.


Ok, moved.


I wrote a little overview text on how unique and exclusion constraints 
are enforced. Most of the information can be gleaned from comments 
elsewhere, but I think it's helpful to have it in one place. Makes it 
easier to compare how unique and exclusion constraints work. The 
harmless deadlocks with exclusion constraints are not explained 
elsewhere AFAICS.


This is also in preparation for Peter's INSERT ON CONFLICT patch. That 
will add another section explaining how the deadlocks and livelocks are 
avoided. That's easier to understand after you grok the potential for 
deadlocks with exclusion constraints.


This also removes a comment from 1989 claiming that the code should be 
moved elsewhere. I think the code is in the right place.


- Heikki

diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index de8fcdb..a697682 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -1,7 +1,55 @@
 /*-
  *
  * execIndexing.c
- *	  executor support for maintaining indexes
+ *	  routines for inserting index tuples and enforcing unique and
+ *	  exclusive constraints.
+ *
+ * ExecInsertIndexTuples() is the main entry point.  It's called after
+ * inserting a tuple to the heap, and it inserts corresponding index tuples
+ * into all indexes.  At the same time, it enforces any unique and
+ * exclusion constraints:
+ *
+ * Unique Indexes
+ * --
+ *
+ * Enforcing a unique constraint is straightforward.  When the index AM
+ * inserts the tuple to the index, it also checks that there are no
+ * conflicting tuples in the index already.  It does so atomically, so that
+ * even if two backends try to insert the same key concurrently, only one
+ * of them will succeed.  All the logic to ensure atomicity, and to wait
+ * for in-progress transactions to finish, is handled by the index AM.
+ *
+ * If a unique constraint is deferred, we request the index AM to not
+ * throw an error if a conflict is found.  Instead, we make note that there
+ * was a conflict and return the list of indexes with conflicts to the
+ * caller.  The caller must re-check them later, by calling index_insert()
+ * with the UNIQUE_CHECK_EXISTING option.
+ *
+ * Exclusion Constraints
+ * -
+ *
+ * Exclusion constraints are different from unique indexes in that when the
+ * tuple is inserted to the index, the index AM does not check for
+ * duplicate keys at the same time.  After the insertion, we perform a
+ * separate scan on the index to check for conflicting tuples, and if one
+ * is found, we throw an error and the transaction is aborted.  If the
+ * conflicting tuple's inserter or deleter is in-progress, we wait for it
+ * to finish first.
+ *
+ * There is a chance of deadlock, if two backends insert a tuple at the
+ * same time, and then perform the scan to check for conflicts.  They will
+ * find each other's tuple, and both try to wait for each other.  The
+ * deadlock detector will detect that, and abort one of the transactions.
+ * That's fairly harmless, as one of them was bound to abort with a
+ * duplicate key error anyway, although you get a different error
+ * message.
+ *
+ * If an exclusion constraint is deferred, we still perform the conflict
+ * checking scan immediately after inserting the index tuple.  But instead
+ * of throwing an error if a conflict is found, we return that information
+ * to the caller.  The caller must re-check them later by calling
+ * check_exclusion_constraint().
+ *
  *
  * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -134,13 +182,10 @@ ExecCloseIndices(ResultRelInfo *resultRelInfo)
  *		This routine takes care of inserting index tuples
  *		into all the relations indexing the result relation
  *		when a heap tuple is inserted into the result relation.
- *		Much of this code should be moved into the genam
- *		stuff as it only exists here because the genam stuff
- *		doesn't provide the functionality needed by the
- *		executor.. -cim 9/27/89
  *
- *		This returns a list of index OIDs for any unique or exclusion
- *		constraints that are deferred and that had
+ *		Unique 

Re: [HACKERS] Replication identifiers, take 4

2015-04-24 Thread Andres Freund
On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote:
 On 04/17/2015 11:54 AM, Andres Freund wrote:
 I've attached a rebased patch, that adds decision about origin logging
 to the relevant XLogInsert() callsites for external 2 byte identifiers
 and removes the pad-reusing version in the interest of moving forward.
 
 Putting aside the 2 vs. 4 byte identifier issue, let's discuss naming:
 
 I just realized that it talks about replication identifier as the new
 fundamental concept. The system table is called pg_replication_identifier.
 But that's like talking about index identifiers, instead of just indexes,
 and calling the system table pg_index_oid.
 
 The important concept this patch actually adds is the *origin* of each
 transaction. That term is already used in some parts of the patch. I think
 we should roughly do a search-replace of replication identifier -
 replication origin to the patch. Or even transaction origin.

Attached is a patch that does this, and some more, renaming. That was
more work than I'd imagined.  I've also made the internal naming in
origin.c more consistent/simpler and did a bunch of other cleanup.

I'm pretty happy with this state.

Greetings,

Andres Freund
From fc406a87f2e2ac08a7ac112ef6b75be1e8256a16 Mon Sep 17 00:00:00 2001
From: Andres Freund and...@anarazel.de
Date: Fri, 24 Apr 2015 14:25:56 +0200
Subject: [PATCH] Introduce replication progress tracking infrastructure.
 (v2.0)

When implementing a replication solution ontop of logical decoding two
related problems exist:
* How to safely keep track of replication progress
* How to change replication behavior, based on the origin of a row;
  e.g. to avoid loops in bi-directional replication setups

The solution to these problems, as implemented in this commit, consist
out of three parts:

1) 'replication origins', which identify nodes in a replication setup.
2) 'replication progress tracking', which remembers, for each
   replication origin, how far replay has progressed in a efficient and
   crash safe manner.
3) The ability to filter out changes performed on the behest of a
   replication origin during logical decoding; this allows complex
   replication topologies.

Most of this could also be implemented in userspace, e.g. by inserting
additional rows contain origin information, but that ends up being much
less efficient and more complicated.  We don't want to require various
replication solutions to reimplement logic for this independently. The
infrastructure is intended to be generic enough to be reusable.

This infrastructure also replaces the 'nodeid' infrastructure of commit
timestamps. It is intended to provide all former capabilities, except
that there's only 2^16 different origins; but now they integrate with
logical decoding. Additionally more functionality is accessible via SQL.
Since the commit timestamp infrastructure has also been introduced in
9.5 that's not a problem.

For now the number of origins for which the replication progress can be
tracked is determined by the max_replication_slots GUC. That GUC is not
a perfect match to configure this, but there doesn't seem to be
sufficient reason to introduce a separate new one.

Bumps both catversion and wal page magic.

Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer
Reviewed-By: Heikki Linnakangas, Robert Haas, Steve Singer
Discussion: 20150216002155.gi15...@awork2.anarazel.de,
20140923182422.ga15...@alap3.anarazel.de,
20131114172632.ge7...@alap2.anarazel.de
---
 contrib/test_decoding/Makefile  |3 +-
 contrib/test_decoding/expected/replorigin.out   |  141 +++
 contrib/test_decoding/sql/replorigin.sql|   64 +
 contrib/test_decoding/test_decoding.c   |   28 +
 doc/src/sgml/catalogs.sgml  |  123 ++
 doc/src/sgml/filelist.sgml  |1 +
 doc/src/sgml/func.sgml  |  201 ++-
 doc/src/sgml/logicaldecoding.sgml   |   35 +-
 doc/src/sgml/postgres.sgml  |1 +
 doc/src/sgml/replication-origins.sgml   |   93 ++
 src/backend/access/heap/heapam.c|   19 +
 src/backend/access/rmgrdesc/Makefile|4 +-
 src/backend/access/rmgrdesc/replorigindesc.c|   61 +
 src/backend/access/rmgrdesc/xactdesc.c  |   24 +-
 src/backend/access/transam/commit_ts.c  |   53 +-
 src/backend/access/transam/rmgr.c   |1 +
 src/backend/access/transam/xact.c   |   72 +-
 src/backend/access/transam/xlog.c   |8 +
 src/backend/access/transam/xloginsert.c |   27 +-
 src/backend/access/transam/xlogreader.c |6 +
 src/backend/catalog/Makefile|2 +-
 src/backend/catalog/catalog.c   |8 +-
 src/backend/catalog/system_views.sql|7 +
 src/backend/replication/logical/Makefile|3 +-
 src/backend/replication/logical/decode.c|   49 +-
 

Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)

2015-04-24 Thread Amit Kapila
On Fri, Apr 24, 2015 at 8:46 AM, Abhijit Menon-Sen a...@2ndquadrant.com
wrote:

 At 2015-04-24 08:35:40 +0530, amit.kapil...@gmail.com wrote:
 
   Just stick a PG_RETURN_NULL() at the end?
 
  That should also work.

 OK, updated patch attached with just that one change.


Patch looks good to me, will mark it as Ready for Committer if Tomas
or anyone else doesn't have more to add in terms of review.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Moving ExecInsertIndexTuples and friends to new file

2015-04-24 Thread Peter Geoghegan
On Fri, Apr 24, 2015 at 6:38 AM, Heikki Linnakangas hlinn...@iki.fi wrote:
 I wrote a little overview text on how unique and exclusion constraints are
 enforced. Most of the information can be gleaned from comments elsewhere,
 but I think it's helpful to have it in one place. Makes it easier to compare
 how unique and exclusion constraints work. The harmless deadlocks with
 exclusion constraints are not explained elsewhere AFAICS.

FWIW, both Jeff Davis and Tom Lane were well aware of this issue back
when exclusion constraints went in - it was judged to be acceptable at
the time, which I agree with. I happened to discuss this with Jeff in
New York recently. I agree that it should definitely be documented
like this (and the fact that ordinary unique indexes are unaffected,
too).

 This is also in preparation for Peter's INSERT ON CONFLICT patch. That will
 add another section explaining how the deadlocks and livelocks are avoided.
 That's easier to understand after you grok the potential for deadlocks with
 exclusion constraints.

Makes sense.

 This also removes a comment from 1989 claiming that the code should be moved
 elsewhere. I think the code is in the right place.

+1

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improving speed of make check-world

2015-04-24 Thread Peter Eisentraut
On 4/23/15 1:22 PM, Jeff Janes wrote:
 Something about this commit (dcae5faccab64776376d354d) broke make
 check in parallel conditions when started from a clean directory.  It
 fails with a different error each time, one example:
 
 make -j4 check  /dev/null
 
 In file included from gram.y:14515:
 scan.c: In function 'yy_try_NUL_trans':
 scan.c:10307: warning: unused variable 'yyg'
 /usr/bin/ld: tab-complete.o: No such file: No such file or directory
 collect2: ld returned 1 exit status
 make[3]: *** [psql] Error 1
 make[2]: *** [all-psql-recurse] Error 2
 make[2]: *** Waiting for unfinished jobs
 make[1]: *** [all-bin-recurse] Error 2
 make: *** [all-src-recurse] Error 2
 make: *** Waiting for unfinished jobs

I think the problem is that check depends on all, but now also
depends on temp-install, which in turn runs install and all.  With a
sufficient amount of parallelism, you end up running two alls on top
of each other.

It seems this can be fixed by removing the check: all dependency.  Try
removing that in the top-level GNUmakefile.in and see if the problem
goes away.  For completeness, we should then also remove it in the other
makefiles.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump: largeobject behavior issues (possible bug)

2015-04-24 Thread Joshua D. Drake

On 04/24/2015 03:41 PM, Tom Lane wrote:


Given that large objects don't have any individual dependencies,
one could envision fixing this by replacing the individual large-object
DumpableObjects by a single placeholder to participate in the sort phase,
and then when it's time to dump that, scan the large objects using a
cursor and create/print/delete the information separately for each one.
This would likely involve some rather painful refactoring in pg_dump
however.


Andrew G mentioned something about using a cursor for the main query 
that pulls the info. He said that it wasn't a solution but may be a 
bandaid (my words). Is that something we may want to look into as a stop 
gap?




regards, tom lane




--
The most kicking donkey PostgreSQL Infrastructure company in existence.
The oldest, the most experienced, the consulting company to the stars.
Command Prompt, Inc. http://www.commandprompt.com/ +1 -503-667-4564 -
24x7 - 365 - Proactive and Managed Professional Services!


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0

2015-04-24 Thread Stephen Frost
Peter,

* Peter Geoghegan (p...@heroku.com) wrote:
 I came up with something that is along the lines of what we discussed.
 I'll wait for you to push Dean's code, and rebase on top of that
 before submitting what I came up with. Maybe this can be rolled into
 my next revision if I work through Andres' most recent feedback
 without much delay.

This is done (finally!).  Please take a look and let me know if you have
any questions or concerns.  Happy to chat again also, of course.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Fwd: [GENERAL] 4B row limit for CLOB tables

2015-04-24 Thread Álvaro Hernández Tortosa


On 24/04/15 05:24, Tom Lane wrote:

Stephen Frost sfr...@snowman.net writes:

* Bruce Momjian (br...@momjian.us) wrote:

On Sun, Feb  1, 2015 at 03:54:03PM +0100, Álvaro Hernández Tortosa wrote:

The problem here is that performance degrades exponentially, or
worse. Speaking here from experience, we already tested this for a
very similar case (table creation, where two oids are consumed from
a global sequence when inserting to pg_class). Have a look at
http://www.slideshare.net/nosys/billion-tables-project-nycpug-2013,
slides 43-45. We tested there this scenario and shown that table
creations per second dropped from 10K to a few per second and then
to a few per day. In the graphs you can't even realize there were
more tables been created. At around 8K tables from the theoretical
limit of 4B oids consumed, the process basically stopped (doing more
insertions).

We don't report the maximum number of tables per database, or the
maximum number of TOAST values.  Agreed?

For my 2c, this limitation is a surprise to users and therefore we
should add documentation to point out that it exists, unless we're going
to actually fix it (which is certainly what I'd prefer to see...).

TBH, I've got very little enthusiasm for fixing this given the number
of reports of trouble from the field, which so far as I recall is zero.
Álvaro's case came up through intentionally trying to create an
unreasonable number of tables, not from real usage.  This thread likewise
appears to contain lots of speculation and no reports of anyone hitting
a problem in practice.


It is certainly true that this was a very synthetic case. I 
envision, however, certain use cases where we may hit a very large 
number of tables:


- Massive multitenancy
- Aggressive partitioning
- Massive multitenancy with aggressive partitioning
- Software dynamically generated tables, like those created by ToroDB 
(https://github.com/torodb/torodb). In ToroDB we generate tables 
depending only on the input data, so we may end up having as many as 
required by the datasource. For example, a general purpose json 
datastore may generate several tables per document inserted.




Certainly this is likely to become an issue at some point in the future,
but I'm not finding it very compelling to worry about now.  By the time
it does become an issue, we may have additional considerations or use
cases that should inform a solution; which seems to me to be a good
argument not to try to fix it in advance of real problems.  Perhaps,


I understand this argument, and it makes sense. However, on the 
other side, given the long time it may take from patch to commit and 
then release version to companies finally using it in production, I'd 
rather try to fix it soon, as there are already reports and use cases 
that may hit it, rather than wait three years until it explodes in our 
faces. After all, 640Kb RAM is enough, right? So maybe 2B tables is not 
that far in the horizon. Who knows.


Regards,

Álvaro


--
Álvaro Hernández Tortosa


---
8Kdata



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Row security violation error is misleading

2015-04-24 Thread Stephen Frost
Dean,

* Dean Rasheed (dean.a.rash...@gmail.com) wrote:
 On 21 April 2015 at 20:50, Stephen Frost sfr...@snowman.net wrote:
  Thanks a lot for this.  Please take a look at the attached.
 
 I've given this a quick read-through, and it looks good to me. The
 interaction of permissive and restrictive policies from hooks matches
 my expections, and it's a definite improvement having tests for RLS
 hooks.
 
 The only thing I spotted was that the file comment for
 test_rls_hooks.c needs updating.

Fixed!

 Is there any documentation for hooks? If not, perhaps that's something
 we should be considering too.

We don't generally document hooks beyond in the source code..  I'm happy
to expand on what's there, if anyone feels it'd be helpful to do so.
Having the test module is a form of documentation also, of course.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0

2015-04-24 Thread Peter Geoghegan
On Fri, Apr 24, 2015 at 5:50 PM, Stephen Frost sfr...@snowman.net wrote:
 * Peter Geoghegan (p...@heroku.com) wrote:
 I came up with something that is along the lines of what we discussed.
 I'll wait for you to push Dean's code, and rebase on top of that
 before submitting what I came up with. Maybe this can be rolled into
 my next revision if I work through Andres' most recent feedback
 without much delay.

 This is done (finally!).  Please take a look and let me know if you have
 any questions or concerns.  Happy to chat again also, of course.

Great, thanks.

I didn't actually wait for you (as earlier indicated) before posting
the new approach to RLS in V3.4. However, I have some decent tests for
the new behaviors (I did test-driven development), so I think that
when I rebase on top of the new RLS stuff tomorrow, I'll find that it
won't be too difficult.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump: largeobject behavior issues (possible bug)

2015-04-24 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 On 04/23/2015 04:04 PM, Andrew Gierth wrote:
 The relevant code is getBlobs in pg_dump.c, which queries the whole of
 pg_largeobject_metadata without using a cursor (so the PGresult is
 already huge thanks to having 100 million rows), and then mallocs a
 BlobInfo array and populates it from the PGresult, also using pg_strdup
 for the oid string, owner name, and ACL if any.

 I'm surprised this hasn't come up before. I have a client that I 
 persuaded to convert all their LOs to bytea fields because of problems 
 with pg_dump handling millions of LOs, and kept them on an older 
 postgres version until they made that change.

Yeah, this was brought up when we added per-large-object metadata; it was
obvious that that patch would cause pg_dump to choke on large numbers of
large objects.  The (perhaps rather lame) argument was that you wouldn't
have that many of them.

Given that large objects don't have any individual dependencies,
one could envision fixing this by replacing the individual large-object
DumpableObjects by a single placeholder to participate in the sort phase,
and then when it's time to dump that, scan the large objects using a
cursor and create/print/delete the information separately for each one.
This would likely involve some rather painful refactoring in pg_dump
however.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Row security violation error is misleading

2015-04-24 Thread Stephen Frost
Dean,

* Dean Rasheed (dean.a.rash...@gmail.com) wrote:
 The second patch [2] is the one that is actually relevant to this
 thread. This patch is primarily to apply the RLS checks earlier,
 before an update is attempted, more like a regular permissions check.
 This adds a new enum to classify the kinds of WCO, a side benefit of
 which is that it allows different error messages when RLS checks are
 violated, as opposed to WITH CHECK OPTIONs on views.

I've gone ahead and pushed this, please take a look and test it and let
me know if you see any issues or have any questions or concerns.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Feedback on getting rid of VACUUM FULL

2015-04-24 Thread Alvaro Herrera
Jim Nasby wrote:

 It looks like the biggest complaint (aside from allowing a limited number of
 tuples to be moved) is in [1] and [2], where Tom is saying that you can't
 simply call heap_update() like this without holding an exclusive lock on the
 table. Is that because we're not actually changing the tuple?

That's nonsense -- obviously UPDATE can do heap_update without an
exclusive lock on the table, so the explanation must be something else.
I think his actual complaint was that you can't remove the old tuple
until concurrent readers of the table have already finished scanning it,
or you get into a situation where they might need to read the page in
which the original version resided, but your mini-vacuum already removed
it.  So before removing it you need to wait until they are all finished.
This is the reason I mentioned CREATE INDEX CONCURRENTLY: if you wait
until those transactions are all gone (like CIC does), you are then free
to remove the old versions of the tuple, because you know that all
readers have a snapshot new enough to see the new version of the tuple.

 Another issue is both HOT and KeyUpdate; I think we need to completely
 ignore/over-ride that stuff for this.

You don't need anything for HOT, because cross-page updates are never
HOT.  Not sure what you mean about KeyUpdate, but yeah you might need
something there -- obviously, you don't want to create multixacts
unnecessarily.

 Instead of adding forcefsm, I think it would be more useful to accept a
 target block number. That way we can actually control where the new tuple
 goes.

Whatever makes the most sense, I suppose.  (Maybe we shouldn't consider
this a tweaked heap_update -- which is already complex enough -- but a
separate heapam entry point.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Feedback on getting rid of VACUUM FULL

2015-04-24 Thread Jim Nasby

On 4/24/15 5:30 PM, Alvaro Herrera wrote:

Jim Nasby wrote:


It looks like the biggest complaint (aside from allowing a limited number of
tuples to be moved) is in [1] and [2], where Tom is saying that you can't
simply call heap_update() like this without holding an exclusive lock on the
table. Is that because we're not actually changing the tuple?


That's nonsense -- obviously UPDATE can do heap_update without an
exclusive lock on the table, so the explanation must be something else.
I think his actual complaint was that you can't remove the old tuple
until concurrent readers of the table have already finished scanning it,
or you get into a situation where they might need to read the page in
which the original version resided, but your mini-vacuum already removed
it.  So before removing it you need to wait until they are all finished.
This is the reason I mentioned CREATE INDEX CONCURRENTLY: if you wait
until those transactions are all gone (like CIC does), you are then free
to remove the old versions of the tuple, because you know that all
readers have a snapshot new enough to see the new version of the tuple.


Except I don't see anywhere in the patch that's actually removing the 
old tuple...



Another issue is both HOT and KeyUpdate; I think we need to completely
ignore/over-ride that stuff for this.


You don't need anything for HOT, because cross-page updates are never
HOT.  Not sure what you mean about KeyUpdate, but yeah you might need
something there -- obviously, you don't want to create multixacts
unnecessarily.


If I'm not mistaken, if there's enough room left on the page then 
HeapSatisfiesHOTandKeyUpdate() will say this tuple satisfies HOT. So 
we'd have to do something to over-ride that, and I don't think the 
current patch does that. (It might force it to a new page anyway, but it 
does nothing with satisfies_hot, which I suspect isn't safe.)



Instead of adding forcefsm, I think it would be more useful to accept a
target block number. That way we can actually control where the new tuple
goes.


Whatever makes the most sense, I suppose.  (Maybe we shouldn't consider
this a tweaked heap_update -- which is already complex enough -- but a
separate heapam entry point.)


Yeah, I thought about creating heap_move, but I suspect that would still 
have to worry about a lot of this other stuff anyway. Far more likely 
for a change to be missed in heap_move than heap_update too.


I am tempted to add a SQL heap_move function though, assuming it's not 
much extra work.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0

2015-04-24 Thread Stephen Frost
* Peter Geoghegan (p...@heroku.com) wrote:
 On Fri, Apr 24, 2015 at 5:50 PM, Stephen Frost sfr...@snowman.net wrote:
  * Peter Geoghegan (p...@heroku.com) wrote:
  I came up with something that is along the lines of what we discussed.
  I'll wait for you to push Dean's code, and rebase on top of that
  before submitting what I came up with. Maybe this can be rolled into
  my next revision if I work through Andres' most recent feedback
  without much delay.
 
  This is done (finally!).  Please take a look and let me know if you have
  any questions or concerns.  Happy to chat again also, of course.
 
 Great, thanks.
 
 I didn't actually wait for you (as earlier indicated) before posting
 the new approach to RLS in V3.4. However, I have some decent tests for
 the new behaviors (I did test-driven development), so I think that
 when I rebase on top of the new RLS stuff tomorrow, I'll find that it
 won't be too difficult.

Fantastic, glad to hear it!

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] PL/pgSQL, RAISE and error context

2015-04-24 Thread Pavel Stehule
2015-04-24 16:02 GMT+02:00 Joel Jacobson j...@trustly.com:

 On Fri, Apr 24, 2015 at 1:16 PM, Robert Haas robertmh...@gmail.com
 wrote:
  This would allow doing something crazy as:
 
  suppress_context_messages = warning,error
  display_context_messages = notice
 
  This is a very flexible proposal, but it's a tremendous amount of
  machinery for what's really a very minor issue.  If we added two GUCs
  for every comparably important issue, we'd have about 40,000 of them.

 I agree. The one-dimensional GUC syntax is not well suited for
 multi-dimensional config settings. And that's a good thing mostly I
 think. It would be a nightmare if the config file values could in JSON
 format, it's good they are simple.

 But I'm thinking maybe we could improve the config file syntax for the
 general case when you have multiple things you want to control, in
 this case the message levels, and for each such thing, you want to
 turn something on/off, in this case the CONTEXT. Maybe we could simply
 use plus + and minus - to mean on and off?

 Example:

 context_messages = -warning, -error, +notice


I prefer your first proposal - and there is a precedent for plpgsql -
plpgsql_extra_checks

It is clean for anybody. +-identifiers looks like horrible httpd config. :)

Regards

Pavel


Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)

2015-04-24 Thread Tomas Vondra



On 04/24/15 14:58, Amit Kapila wrote:

On Fri, Apr 24, 2015 at 8:46 AM, Abhijit Menon-Sen a...@2ndquadrant.com
mailto:a...@2ndquadrant.com wrote:
 
  At 2015-04-24 08:35:40 +0530, amit.kapil...@gmail.com
mailto:amit.kapil...@gmail.com wrote:
  
Just stick a PG_RETURN_NULL() at the end?
  
   That should also work.
 
  OK, updated patch attached with just that one change.
 

Patch looks good to me, will mark it as Ready for Committer if Tomas
or anyone else doesn't have more to add in terms of review.


FWIW, I'm OK with the patch as is. Your reviews were spot on.

regards

--
Tomas Vondra   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: [GENERAL] 4B row limit for CLOB tables

2015-04-24 Thread Jim Nasby

On 4/24/15 7:11 PM, Álvaro Hernández Tortosa wrote:

On 24/04/15 05:24, Tom Lane wrote:

...

TBH, I've got very little enthusiasm for fixing this given the number
of reports of trouble from the field, which so far as I recall is zero.
Álvaro's case came up through intentionally trying to create an
unreasonable number of tables, not from real usage.  This thread likewise
appears to contain lots of speculation and no reports of anyone hitting
a problem in practice.


 It is certainly true that this was a very synthetic case. I
envision, however, certain use cases where we may hit a very large
number of tables:


The original case has NOTHING to do with the number of tables and 
everything to do with the number of toasted values a table can have. If 
you have to toast 4B attributes in a single relation it will fail. In 
reality, if you get anywhere close to that things will fall apart due to 
OID conflicts.


This case isn't nearly as insane as 4B tables. A table storing 10 text 
fields each of which is 2K would hit this limit with only 400M rows. If 
my math is right that's only 8TB; certainly not anything insane 
space-wise or rowcount-wise.


Perhaps it's still not fixing, but I think it's definitely worth 
documenting.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0

2015-04-24 Thread Peter Geoghegan
On Thu, Apr 23, 2015 at 6:07 PM, Peter Geoghegan p...@heroku.com wrote:
 Curious about what you think about my proposal to go back to putting
 the inference specification WHERE clause within the parenthesis, since
 this solves several problems, including clarifying to users that the
 predicate is part of the inference clause.

I've *provisionally* pushed code that goes back to the old way,
Andres: 
https://github.com/petergeoghegan/postgres/commit/2a5d80b27d2c5832ad26dde4651c64dd2004f401

Perhaps this is the least worst way, after all.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing tuple overhead

2015-04-24 Thread Amit Kapila
On Sat, Apr 25, 2015 at 1:58 AM, Jim Nasby jim.na...@bluetreble.com wrote:

 On 4/23/15 10:40 PM, Amit Kapila wrote:

 I agree with you and what I think one of the major reasons of bloat is
that
 Index segment doesn't have visibility information due to which clearing
of
 Index needs to be tied along with heap.  Now if we can move transaction
 information at page level, then we can even think of having it in Index
 segment as well and then Index can delete/prune it's tuples on it's own
 which can reduce the bloat in index significantly and there is a benefit
 to Vacuum as well.


 I don't see how putting visibility at the page level helps indexes at
all. We could already put XMIN in indexes if we wanted, but it won't help,
because...


We can do that by putting transaction info at tuple level in index as
well but that will be huge increase in size of index unless we devise
a way to have variable index tuple header rather than a fixed.

 Now this has some downsides as well like Delete
 needs to traverse Index segment as well to Delete mark the tuples, but
 I think the upsides of reducing bloat can certainly outweigh the
downsides.


 ... which isn't possible. You can not go from a heap tuple to an index
tuple.

We will have the access to index value during delete, so why do you
think that we need linkage between heap and index tuple to perform
Delete operation?  I think we need to think more to design Delete
.. by CTID, but that should be doable.

 This has been discussed in the past.

I have tried to search in archive, but not getting what is the exact
problem.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] PL/pgSQL, RAISE and error context

2015-04-24 Thread Joel Jacobson
On Fri, Apr 24, 2015 at 6:07 PM, Pavel Stehule pavel.steh...@gmail.com wrote:
 Example:

 context_messages = -warning, -error, +notice


 I prefer your first proposal - and there is a precedent for plpgsql -
 plpgsql_extra_checks

 It is clean for anybody. +-identifiers looks like horrible httpd config. :)

I have to agree on that :) Just thought this is the best we can do if
we want to reduce the number of GUCs to a minimum.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tablespaces inside $PGDATA considered harmful

2015-04-24 Thread Bruce Momjian
On Wed, Apr 22, 2015 at 10:41:02PM -0400, Bruce Momjian wrote:
  josh=# create tablespace tbl2 location '/home/josh/pg94/data/pg_xlog/';
  CREATE TABLESPACE
  
  It really seems like we ought to block *THAT*.  Of course, if we block
  tablespace creation in PGDATA generally, then that's covered.
 
 I have developed the attached patch to warn about creating tablespaces
 inside the data directory.  The case this doesn't catch is referencing a
 symbolic link that points to the same directory.  We can't make it an
 error so people can use pg_upgrade these setups.  This would be for 9.5
 only.

OK, based on later discussions, I have updated my 9.5 patch to have
pg_upgrade also display a warning (the warning will also appear in the
pg_upgrade logs, but I doubt the user will see it), e.g.:

Setting next OID for new clusterok
Sync data directory to disk ok
Creating script to analyze new cluster  ok

WARNING:  user-defined tablespace locations should not be inside the 
data directory, e.g. /u/pgsql.old/data/pg_tblspc

Upgrade Complete

Optimizer statistics are not transferred by pg_upgrade so,
once you start the new server, consider running:
./analyze_new_cluster.sh

Could not create a script to delete the old cluster's data
files because user-defined tablespaces exist in the old cluster
directory.  The old cluster's contents must be deleted manually.

This way, both pg_dump and pg_upgrade will issue warnings, though, of
course, those warnings can be ignored.  I am hopeful these two warnings
will be sufficient and we will not need make these errors, with the
possible inconvenience it will cause.  I am still afraid that someone
will ignore the new errors pg_dump would generate and lose data.  I just
don't remember enough cases where we threw new errors on _data_ restore.

Frankly, those using pg_upgrade already will have to move the old
tablespaces out of the old cluster if they ever want to delete those
clusters, so I am hopeful these additional warnings will help eliminate
this practice, which is already cumbersome and useless.  I am not
planning to revisit this for 9.6.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PL/pgSQL, RAISE and error context

2015-04-24 Thread Pavel Stehule
2015-04-24 19:16 GMT+02:00 Joel Jacobson j...@trustly.com:

 On Fri, Apr 24, 2015 at 6:07 PM, Pavel Stehule pavel.steh...@gmail.com
 wrote:
  Example:
 
  context_messages = -warning, -error, +notice
 
 
  I prefer your first proposal - and there is a precedent for plpgsql -
  plpgsql_extra_checks
 
  It is clean for anybody. +-identifiers looks like horrible httpd config.
 :)

 I have to agree on that :) Just thought this is the best we can do if
 we want to reduce the number of GUCs to a minimum.


It looks like discussion KDE x GNOME.

GUC that has simply effect on behave without performance impact should not
be problem - like log_lock_wait, log_min_duration and similar. I am sure so
we would it.

The problematic GUC are a performance, planner, bgwriter, checkpoint
related.


Re: [HACKERS] adding more information about process(es) cpu and memory usage

2015-04-24 Thread Jim Nasby

On 4/24/15 6:29 AM, Robert Haas wrote:

On Thu, Apr 23, 2015 at 9:28 PM, Tom Lane t...@sss.pgh.pa.us wrote:

The reason nobody's gotten around to that in the last fifteen years is
that per-process rusage isn't actually all that interesting; there's
too much that happens in background daemons, for instance.


There's *some* stuff that happens in background daemons, but if you
want to measure user and system time consume by a particularly query,
this would actually be a pretty handy way to do that, I think.


I more often am wondering what a running backend is doing OS-wise, but 
being able to see what happened when it finished would definitely be 
better than what's available now.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Typo in a comment in set_rel_size()

2015-04-24 Thread Tom Lane
Amit Langote langote_amit...@lab.ntt.co.jp writes:
 Attached fixes what I suppose is a typo:

  * so set up a single dummy path for it.  Here we only check this for
  * regular baserels; if it's an otherrel, CE was already checked in
 -* set_append_rel_pathlist().
 +* set_append_rel_size().
  *

It's not a typo; the comment was correct when written.  But I evidently
missed updating it when set_append_rel_pathlist() got split into two
functions.  Applied, thanks for noticing!

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Feedback on getting rid of VACUUM FULL

2015-04-24 Thread Alvaro Herrera
Heikki Linnakangas wrote:

 Hmm, AFAICT pg_reorg is much more complex, writing stuff to a temp table
 and swapping relfilenodes afterwards. More like the VACUUM REWRITE
 that's been discussed.
 
 For the kicks, I looked at what it would take to write a utility like
 that. It turns out to be quite trivial, patch attached. It uses the same
 principle as VACUUM FULL, scans from the end, moving tuples to
 lower-numbered pages until it can't do it anymore. It requires a small
 change to heap_update(), to override the preference to store the new
 tuple on the same page as the old one, but other than that, it's all in
 the external module.

More than five years have passed since Heikki posted this, and we still
haven't found a solution to the problem -- which neverthless keeps
biting people to the point that multiple user-space implementations of
similar techniques are out there.

I think what we need here is something that does heap_update to tuples
at the end of the table, moving them to earlier pages; then wait for old
snapshots to die (the infrastructure for which we have now, thanks to
CREATE INDEX CONCURRENTLY); then truncate the empty pages.  Of course,
there are lots of details to resolve.  It doesn't really matter that
this runs for long: a process doing this for hours might be better than
AccessExclusiveLock on the table for a much shorter period.

Are there any takers?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Ignoring some binaries generated in src/test

2015-04-24 Thread Tom Lane
Michael Paquier michael.paqu...@gmail.com writes:
 A couple of binaries in src/test, that are not part of the main make
 flow, can be built but they are actually not ignored in the tree:
 examples/testlibpq
 examples/testlibpq2
 examples/testlibpq3
 examples/testlibpq4
 examples/testlo
 examples/testlo64
 locale/test-ctype
 thread/thread_test
 I recall that some of them were target for removal, still shouldn't
 they have their own entries in a .gitignore, like in the patch
 attached?

Perhaps, but if we're going to support doing a make in those
subdirectories, I think it would also be appropriate to fix
src/test/Makefile so that clean and related targets recurse to
those subdirectories.

The current logic in src/test/Makefile, particularly the way that
the modules subdirectory is handled, seems pretty ugly/convoluted
anyway.  I wonder why it was done that way rather than just ensuring
that modules/ doesn't do anything for make install?

We'd still need special cases for examples/ et al because we don't want
them built during make all, but I think just adding them to
ALWAYS_SUBDIRS might suffice.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0

2015-04-24 Thread Heikki Linnakangas

On 04/24/2015 02:52 AM, Peter Geoghegan wrote:

I found a bug that seems to be down to commit
e3144183562d08e347f832f0b29daefe8bac617b on Github:


commit e3144183562d08e347f832f0b29daefe8bac617b
Author: Heikki Linnakangas heikki.linnakan...@iki.fi
Date:   Thu Apr 23 18:38:11 2015 +0300

 Minor cleanup of check_exclusion_or_unique_constraint.

 To improve readability.



At least, that's what a git bisect session showed.


Ok, I see now that I totally screwed up the conditions on waitMode in 
that commit. I just pushed a fix to your github repository.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0

2015-04-24 Thread Peter Geoghegan
On Fri, Apr 24, 2015 at 12:31 AM, Heikki Linnakangas hlinn...@iki.fi wrote:
 Ok, I see now that I totally screwed up the conditions on waitMode in that
 commit. I just pushed a fix to your github repository.

I can confirm that the commit you just pushed prevents the
implementation from attempting to lock an invisible tuple, where
previously it would reliably fall given the same testcase.

Thanks
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] collate.linux.utf8 test coverage

2015-04-24 Thread Andrew Dunstan
The optional buildfarm module that runs this test was broken by commit 
dcae5faccab64776376d354decda0017c648bb53


Since nobody has responded to my complaint about this, I have disabled it on 
crake, the only buildfarm machine that has actually been running the test, so 
we now have no buildfarm coverage for it.

cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improving vacuum/VM/etc

2015-04-24 Thread Jim Nasby

On 4/24/15 6:50 AM, Robert Haas wrote:

Thanks for looking at this.


On Thu, Apr 23, 2015 at 3:09 PM, Jim Nasby jim.na...@bluetreble.com wrote:

Unfortunately, the states I came up with using existing semantics don't look
hugely useful[4], but if we take Robert's idea and make all-visible mean
all-frozen, we can do much better:

0: Newly inserted tuples
Tracking this state allows us to aggressively set hint bits.


Who is us?  And what do you mean by aggressively?  As things
stand, any process that has to touch a tuple always sets any
applicable hint bits.


A background process that will proactively hint tuples before a 
foreground process needs to. But see also below...



1: Newly deleted
There are tuples that have been deleted but not pruned. There may also be
newly inserted tuples that need hinting (state 0).

Similar to state 0, we'd want to be fairly aggressive with these pages,
because as soon as the deleting XID is committed and older than all
snapshots we can prune. Because we can prune without hitting indexes, this
is still a fairly cheap operation, though not as cheap as 0.


What behavior difference would you foresee between state 0 and state 1?


Below.


2: Fully hinted, not frozen
This is the really painful state to clean up, because we have to deal with
indexes. We must enter this state after being in 1.


Neither the fact that a page is fully hinted nor the fact that it is
or is not frozen implies anything about dealing with indexes.  We need
to deal with indexes because the page contains either dead tuples (as
a result of an aborted insert, a committed delete, or an aborted or
committed update) or dead line pointers (as a result of pruning dead
tuples).


The idea I was shooting for is that the worst-case scenario in cleanup 
is dealing with indexes, which we need to do any time a tuple becomes 
dead. That's why I made 1 a separate state from 0, but it occurs to me 
now that I wasn't very clear about this.


My goal here is that there are two separate paths for a page to be in: 
either it needs index vacuuming at some point, or it doesn't. If a page 
is in state 0, once we can make the page all-visible/frozen it can go 
into state 3 and *we never have to clean it again*.


OTOH, if a tuple is marked dead (non-HOT), then we can be aggressive 
about hinting (and pruning, if there were HOT updates as well), but no 
matter what we must eventually include that page in index cleanup.


So once a page enters state 1 or 2, it may never move to state 0 or 3 
without an index scan pass.



OK, I agree that a background process could be useful.  Whenever it
sees a dirty page, it could attempt to aggressively set hint bits,
prune, mark all-visible, and freeze the page before that page gets
evicted.  However, that doesn't require the sort of state map you're
proposing here.



I think your statement about pages that were in those states are
still likely to be in the disk cache is not really true.  I mean, if
we're doing OLTP, yes.  But not if we're bulk-loading.


Right, but at least we'd know we had a table with a load of unhinted or 
newly dead tuples. That means there's cleanup work we can do without 
needing an index pass.



Not needing to scan everything that's frozen is thanks to state 3. I think
it's OK (at least for now) if only vacuum puts pages into this state, which
means it can actually freeze the tuples when it does it (thanks to 37484ad
we won't lose forensic data doing this). That means there's no extra work
necessary by a foreground process that's dirtying a page.


Did you notice the discussion on the other thread about this
increasing WAL volume by a factor of 113?


Yeah, though I'd forgotten about it. :(

I wonder if there's some way we can reduce that. I just looked at what 
we WAL log for a freeze and it appears to only be xl_heap_freeze_tuple, 
which if my math is correct is 12 bytes (11 ignoring alignment). I don't 
understand how that can be 113 times worse than a plain vacuum.



I can't really follow why you've got these states to begin with.  0,
1, and 2 are all pretty much the same.  The useful distinction AFAICS
is between not-all-visible, all-visible, and all-visible-plus-frozen.


Index scanning is probably the most expensive part of cleanup, so it 
seems like it would be useful to be able to track that as 
visible/frozen. (What would probably be more useful is a way to directly 
link a heap tuple to any index tuples pointing at it, but that would 
certainly be a lot harder to do.)


There's also the idea of being proactive about hinting and pruning, 
instead of foisting that onto later foreground processes or hoping that 
vacuum comes along. Certainly the most obvious part is doing that before 
buffers are evicted, but it's not uncommon for the OS cache to be 10x 
larger (or more). Even if we can't hit these pages before they're all 
the way on disk, if we at least know there's a pile of them we can do 
something before a foreground process (or at least let the 

Re: [HACKERS] Feedback on getting rid of VACUUM FULL

2015-04-24 Thread Jim Nasby

On 4/24/15 2:04 PM, Alvaro Herrera wrote:

Heikki Linnakangas wrote:


Hmm, AFAICT pg_reorg is much more complex, writing stuff to a temp table
and swapping relfilenodes afterwards. More like the VACUUM REWRITE
that's been discussed.

For the kicks, I looked at what it would take to write a utility like
that. It turns out to be quite trivial, patch attached. It uses the same
principle as VACUUM FULL, scans from the end, moving tuples to
lower-numbered pages until it can't do it anymore. It requires a small
change to heap_update(), to override the preference to store the new
tuple on the same page as the old one, but other than that, it's all in
the external module.


More than five years have passed since Heikki posted this, and we still
haven't found a solution to the problem -- which neverthless keeps
biting people to the point that multiple user-space implementations of
similar techniques are out there.

I think what we need here is something that does heap_update to tuples
at the end of the table, moving them to earlier pages; then wait for old
snapshots to die (the infrastructure for which we have now, thanks to
CREATE INDEX CONCURRENTLY); then truncate the empty pages.  Of course,
there are lots of details to resolve.  It doesn't really matter that
this runs for long: a process doing this for hours might be better than
AccessExclusiveLock on the table for a much shorter period.

Are there any takers?


Honestly, I'd prefer we exposed some way to influence where a new tuple 
gets put, and perhaps better ways of accessing tuples on a specific 
page. That would make it a lot easier to handle this in userspace, but 
it would also make it easier to do things like concurrent clustering. Or 
just organizing a table however you wanted.


That said, why not just pull what Heikki did into contrib, and add the 
necessary mode to heap_update?

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-24 Thread Andres Freund
On April 24, 2015 10:26:23 PM GMT+02:00, Peter Eisentraut pete...@gmx.net 
wrote:
On 4/24/15 8:32 AM, Andres Freund wrote:
 On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote:
 On 04/17/2015 11:54 AM, Andres Freund wrote:
 I've attached a rebased patch, that adds decision about origin
logging
 to the relevant XLogInsert() callsites for external 2 byte
identifiers
 and removes the pad-reusing version in the interest of moving
forward.

 Putting aside the 2 vs. 4 byte identifier issue, let's discuss
naming:

 I just realized that it talks about replication identifier as the
new
 fundamental concept. The system table is called
pg_replication_identifier.
 But that's like talking about index identifiers, instead of just
indexes,
 and calling the system table pg_index_oid.

 The important concept this patch actually adds is the *origin* of
each
 transaction. That term is already used in some parts of the patch. I
think
 we should roughly do a search-replace of replication identifier -
 replication origin to the patch. Or even transaction origin.
 
 Attached is a patch that does this, and some more, renaming. That was
 more work than I'd imagined.  I've also made the internal naming in
 origin.c more consistent/simpler and did a bunch of other cleanup.
 
 I'm pretty happy with this state.

Shouldn't this be backed up by pg_dump(all?)?


Given it deals with LSNs and is, quite fundamentally due to concurrency, non 
transactional, I doubt it's worth it. The other side's slots also aren't going 
to be backed up as pg dump obviously can't know about then. So the represented 
data won't make much sense.


Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Ignoring some binaries generated in src/test

2015-04-24 Thread Alvaro Herrera
Tom Lane wrote:

 The current logic in src/test/Makefile, particularly the way that
 the modules subdirectory is handled, seems pretty ugly/convoluted
 anyway.  I wonder why it was done that way rather than just ensuring
 that modules/ doesn't do anything for make install?

Because we do want to have the Makefile in src/test/modules to install
the modules if make install is invoked there.  That way, you can run
make -C src/test/modules install installcheck, and it works.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-24 Thread Jim Nasby

On 4/24/15 6:52 AM, Robert Haas wrote:

On Thu, Apr 23, 2015 at 9:03 PM, Michael Paquier
michael.paqu...@gmail.com wrote:

On Thu, Apr 23, 2015 at 10:42 PM, Robert Haas wrote:

On Thu, Apr 23, 2015 at 4:19 AM, Simon Riggs  wrote:

We only need a freeze/backup map for larger relations. So if we map 1000
blocks per map page, we skip having a map at all when size  1000.


Agreed.  We might also want to map multiple blocks per map slot - e.g.
one slot per 32 blocks.  That would keep the map quite small even for
very large relations, and would not compromise efficiency that much
since reading 256kB sequentially probably takes only a little longer
than reading 8kB.

I think the idea of integrating the freeze map into the VM fork is
also worth considering.  Then, the incremental backup map could be
optional; if you don't want incremental backup, you can shut it off
and have less overhead.


When I read that I think about something configurable at
relation-level.There are cases where you may want to have more
granularity of this information at block level by having the VM slots
to track less blocks than 32, and vice-versa.


What are those cases?  To me that sounds like making things
complicated to no obvious benefit.


Tables that get few/no dead tuples, like bulk insert tables. You'll have 
large sections of blocks with the same visibility.


I suspect the added code to allow setting 1 bit for multiple pages 
without having to lock all those pages simultaneously will probably 
outweigh making this a reloption anyway.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-24 Thread Peter Eisentraut
On 4/24/15 8:32 AM, Andres Freund wrote:
 On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote:
 On 04/17/2015 11:54 AM, Andres Freund wrote:
 I've attached a rebased patch, that adds decision about origin logging
 to the relevant XLogInsert() callsites for external 2 byte identifiers
 and removes the pad-reusing version in the interest of moving forward.

 Putting aside the 2 vs. 4 byte identifier issue, let's discuss naming:

 I just realized that it talks about replication identifier as the new
 fundamental concept. The system table is called pg_replication_identifier.
 But that's like talking about index identifiers, instead of just indexes,
 and calling the system table pg_index_oid.

 The important concept this patch actually adds is the *origin* of each
 transaction. That term is already used in some parts of the patch. I think
 we should roughly do a search-replace of replication identifier -
 replication origin to the patch. Or even transaction origin.
 
 Attached is a patch that does this, and some more, renaming. That was
 more work than I'd imagined.  I've also made the internal naming in
 origin.c more consistent/simpler and did a bunch of other cleanup.
 
 I'm pretty happy with this state.

Shouldn't this be backed up by pg_dump(all?)?




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing tuple overhead

2015-04-24 Thread Jim Nasby

On 4/23/15 10:40 PM, Amit Kapila wrote:

I agree with you and what I think one of the major reasons of bloat is that
Index segment doesn't have visibility information due to which clearing of
Index needs to be tied along with heap.  Now if we can move transaction
information at page level, then we can even think of having it in Index
segment as well and then Index can delete/prune it's tuples on it's own
which can reduce the bloat in index significantly and there is a benefit
to Vacuum as well.


I don't see how putting visibility at the page level helps indexes at 
all. We could already put XMIN in indexes if we wanted, but it won't 
help, because...



Now this has some downsides as well like Delete
needs to traverse Index segment as well to Delete mark the tuples, but
I think the upsides of reducing bloat can certainly outweigh the downsides.


... which isn't possible. You can not go from a heap tuple to an index 
tuple. This has been discussed in the past. If we could do that then 
vacuum would become REALLY cheap compared to today.


BTW, before actually tackling anything we should try and get more data 
from Robert/Jan about where the extra 80% came from. We don't know if 
it's indexes or what.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Feedback on getting rid of VACUUM FULL

2015-04-24 Thread Alvaro Herrera
Jim Nasby wrote:

 Honestly, I'd prefer we exposed some way to influence where a new tuple gets
 put, and perhaps better ways of accessing tuples on a specific page. That
 would make it a lot easier to handle this in userspace, but it would also
 make it easier to do things like concurrent clustering. Or just organizing a
 table however you wanted.

That's great and all, but it doesn't help people who have already, for
whatever reason, ran into severe bloat and cannot take long enough
downtime to run VACUUM FULL.

 That said, why not just pull what Heikki did into contrib, and add the
 necessary mode to heap_update?

Sure, that's what I suggest.  We just need to fix the bugs and (as Tom
puts it) infelicities.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Ignoring some binaries generated in src/test

2015-04-24 Thread Tom Lane
Alvaro Herrera alvhe...@2ndquadrant.com writes:
 Tom Lane wrote:
 The current logic in src/test/Makefile, particularly the way that
 the modules subdirectory is handled, seems pretty ugly/convoluted
 anyway.  I wonder why it was done that way rather than just ensuring
 that modules/ doesn't do anything for make install?

 Because we do want to have the Makefile in src/test/modules to install
 the modules if make install is invoked there.  That way, you can run
 make -C src/test/modules install installcheck, and it works.

OK.  I still wonder if there isn't a better way to get that effect, but
I left it alone for now.  I committed Michael's new .gitignore files and
fixed the Makefiles so that make clean and friends clean up properly.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Feedback on getting rid of VACUUM FULL

2015-04-24 Thread Jim Nasby

On 4/24/15 3:34 PM, Alvaro Herrera wrote:

Jim Nasby wrote:


Honestly, I'd prefer we exposed some way to influence where a new tuple gets
put, and perhaps better ways of accessing tuples on a specific page. That
would make it a lot easier to handle this in userspace, but it would also
make it easier to do things like concurrent clustering. Or just organizing a
table however you wanted.


That's great and all, but it doesn't help people who have already, for
whatever reason, ran into severe bloat and cannot take long enough
downtime to run VACUUM FULL.


That said, why not just pull what Heikki did into contrib, and add the
necessary mode to heap_update?


Sure, that's what I suggest.  We just need to fix the bugs and (as Tom
puts it) infelicities.


It looks like the biggest complaint (aside from allowing a limited 
number of tuples to be moved) is in [1] and [2], where Tom is saying 
that you can't simply call heap_update() like this without holding an 
exclusive lock on the table. Is that because we're not actually changing 
the tuple?


Another issue is both HOT and KeyUpdate; I think we need to completely 
ignore/over-ride that stuff for this.


Instead of adding forcefsm, I think it would be more useful to accept a 
target block number. That way we can actually control where the new 
tuple goes. For this particular case we'd presumably go with normal FSM 
page selection logic, but someone could chose to to do something more 
sophisticated if they wanted.


[1] http://postgresql.org/message-id/3409.1253147...@sss.pgh.pa.us
[2] http://postgresql.org/message-id/3631.1253149...@sss.pgh.pa.us
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers