Re: [HACKERS] Logical decoding of sequence advances, part II

2016-08-21 Thread Craig Ringer
On 22 August 2016 at 11:13, Craig Ringer  wrote:


> So yeah. I think extending SeqTableData and xl_seq_rec with xid is the way
> to go. Objections?
>


Prototyping this shows that re-using SeqTableData to store the creator xid
won't work out. We can't just store the toplevel xid because TRUNCATE ...
RESTART IDENTITY is transactional, using a new relfilenode and new sequence
"timeline". If we just store and xlog the toplevel xid as the sequence's
creator/restarter we'll fail to correctly handle it if a subxact that did
TRUNCATE ... RESTART IDENTITY rolls back, e.g.

CREATE TABLE x(id serial not null);
SELECT nextval('x_id_seq');   =>   1
BEGIN;
SELECT nextval('x_id_seq');   =>   2
SAVEPOINT sp1;
SELECT nextval('x_id_seq');   =>   3
TRUNCATE TABLE x RESTART IDENTITY;
SELECT nextval('x_id_seq');   =>   1
ROLLBACK TO SAVEPOINT sp1;
SELECT nextval('x_id_seq');   =>   4

sequence.c:init_sequence() detects this by noticing that the relfilenode
has changed and discarding cached values, resuming at last_value. Knowledge
of whether we created the sequence relfilenode is not retained so we can't
do something similar.

Unless anyone has any clever (or obvious but not to me) solutions to this,
I'll probably need to maintain a separate map of sequence relfilenodes we
created and which xid we created them in, so we can test whether that xid
is still in progress when logging a change. It's still pretty much free
when wal_level < logical or the current xact hasn't created any sequences.

Otherwise I could store a List of xids in the SeqTableData for the sequence
and check that for in-progress xids. It'd usually be NIL. If not, it'll
almost always be a 1-item List, the creating / resetting xid. If subxacts
are involved it'll become a stack. We walk down the stack checking whether
xacts are in progress and popping them if not until we find an in-progress
entry or run out of stack and set it to NIL.

Either will produce the same desired result: the correct subxact xid for
the innermost in-progress xact that created or reset this sequence, if any.

(I initially planned to just punt on TRUNCATE and let event triggers handle
it, but the need to roll back sequence advances if a TRUNCATE ... RESTART
IDENTITY is rolled back means sequence decoding must pay attention to it).


I'm also having trouble working out how to get a historical snapshot for
the most recent committed xact in a decoding session so the sequence's name
can be looked up by oid in the relcache during decoding. Advice would be
welcome if anyone can spare a moment.

I'll keep working on this concurrent with some higher priority work.
Suggestions, advice, or screams of horror welcomed. I think we really,
really need logical decoding of sequence advances...

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Should we cacheline align PGXACT?

2016-08-21 Thread Andres Freund
On 2016-08-22 11:25:55 +0800, Craig Ringer wrote:
> On 22 August 2016 at 10:40, Andres Freund  wrote:
> 
> > On 2016-08-19 09:46:12 -0400, Tom Lane wrote:
> > > Alexander Korotkov  writes:
> > > > originally this idea was proposed by Andres Freund while experimenting
> > with
> > > > lockfree Pin/UnpinBuffer [1].
> > > > The patch is attached as well as results of pgbench -S on 72-cores
> > > > machine.  As before it shows huge benefit in this case.
> > >
> > > That's one mighty ugly patch.
> >
> > My version of it was only intended to nail down some variability on the
> > pgpro machine, it wasn't intended for submission.
> >
> >
> > > Can't you do it without needing to introduce the additional layer of
> > > struct nesting?
> >
> > If we required support for anonymous unions, such things would be a lot
> > easier to do.  That aside, the only alternative seems tob e hard-coding
> > padding space - which probably isn't all that un-fragile either.
> > 
> >
> 
> Somewhat naïve question from someone with much less clue about low level
> cache behaviour trying to follow along: given that we determine such
> padding at compile time, how do we ensure that the cacheline size we're
> targeting is right at runtime?

There's basically only very few common cacheline sizes. Pretty much only
64 byte and 128 bytes are common these days. By usually padding to the
larger of those two, we waste a bit of memory, but not actually cache
space on platforms with smaller lines, because the padding is never
accessed.

> Or is it safe to assume that using 16 bytes so we don't cross cache
> line boundaries is always helpful, whether we have 4 PGXACT entries
> (64 byte line) or some other number per cacheline?

That's generally a good thing to do, yes.  It's probably not going to
give the full benefits here though.

Regards

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] UTF-8 docs?

2016-08-21 Thread Tatsuo Ishii
Just out of curiopusity, I wonder why we can't make the encoding of
SGML docs to be UTF-8, rather than current ISO-8859-1.

As long as everything is written in ASCII, the size of docs will be
almost same even if UTF-8 is used. Plus, if the encoding is changed to
UTF-8, it is very easy to translate the doc to local languages. As far
as I know, al of local language docs under
https://www.postgresql.org/docs/ are using UTF-8.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning - another take

2016-08-21 Thread Ashutosh Bapat
The parent-child relationship of multi-level partitioned tables is not
retained when creating the AppendRelInfo nodes. We create RelOptInfo nodes
for all the leaf and intermediate tables. The AppendRelInfo nodes created
for these RelOptInfos set the topmost table as the parent of all the leaf
and child tables. Since partitioning scheme/s at each level is/are
associated with the parent/s at that level, we loose information about the
immediate parents and thus it becomes difficult to identify which leaf node
falls where in the partition hierarchy. This stops us from doing any
lump-sum partition pruning where we can eliminate all the partitions under
a given parent-partition if that parent-partition gets pruned. It also
restricts partition-wise join technique from being applied to partial
partition hierarchy when the whole partitioning scheme of joining tables
does not match. Maintaining a RelOptInfo hierarchy should not create
corresponding Append (all kinds) plan hierarchy since
accumulate_append_subpath() flattens any such hierarchy while creating
paths. Can you please consider this point in your upcoming patch?


On Wed, Aug 10, 2016 at 4:39 PM, Amit Langote  wrote:

> Hi,
>
> Attached is the latest set of patches to implement declarative
> partitioning.  There is already a commitfest entry for the same:
> https://commitfest.postgresql.org/10/611/
>
> The old discussion is here:
> http://www.postgresql.org/message-id/55d3093c.5010...@lab.ntt.co.jp/
>
> Attached patches are described below:
>
> 0001-Catalog-and-DDL-for-partition-key.patch
> 0002-psql-and-pg_dump-support-for-partitioned-tables.patch
>
> These patches create the infrastructure and DDL for partitioned
> tables.
>
> In addition to a catalog for storing the partition key information, this
> adds a new relkind to pg_class.h. PARTITION BY clause is added to CREATE
> TABLE. Tables so created are RELKIND_PARTITIONED_REL relations which are
> to be special in a number of ways, especially with regard to their
> interactions with regular table inheritance features.
>
> PARTITION BY RANGE ({ column_name | ( expression ) } [ opclass ] [, ...])
> PARTITION BY LIST ({ column_name | ( expression ) } [ opclass ])
>
>
> 0003-Catalog-and-DDL-for-partition-bounds.patch
> 0004-psql-and-pg_dump-support-for-partitions.patch
>
> These patches create the infrastructure and DDL for partitions.
>
> Parent-child relationships of a partitioned table and its partitions are
> managed behind-the-scenes with inheritance.  That means there is a
> pg_inherits entry and attributes, constraints, etc. are marked with
> inheritance related information appropriately.  However this case differs
> from a regular inheritance relationship in a number of ways.  While the
> regular inheritance imposes certain restrictions on what elements a
> child's schema is allowed to contain (both at creation time and
> after-the-fact), the partitioning related code imposes further
> restrictions.  For example, while regular inheritance allows a child to
> contain its own columns, the partitioning code disallows that.  Stuff like
> NO INHERIT marking on check constraints, ONLY are ignored by the the
> partitioning code.
>
> Partition DDL includes both a way to create new partition and "attach" an
> existing table as a partition of parent partitioned table.  Attempt to
> drop a partition using DROP TABLE causes an error. Instead a partition
> needs first to be "detached" from parent partitioned table.  On the other
> hand, dropping the parent drops all the partitions if CASCADE is specified.
>
> CREATE TABLE partition_name
> PARTITION OF parent_table [ (
>   { column_name WITH OPTIONS [ column_constraint [ ... ] ]
> | table_constraint }
> [, ... ]
> ) ] partition_bound_spec
> [ PARTITION BY {RANGE | LIST} ( { column_name | ( expression ) } [ opclass
> ] [, ...] )
>
> CREATE FOREIGN TABLE [ IF NOT EXISTS ] table_name
>   PARTITION OF parent_table [ (
>   { column_name WITH OPTIONS [ column_constraint [ ... ] ]
> | table_constraint }
> [, ... ]
> ) ] partition_bound_spec
>   SERVER server_name
> [ OPTIONS ( option 'value' [, ... ] ) ]
>
> ALTER TABLE parent ATTACH PARTITION partition_name partition_bound_spec [
> VALIDATE | NO VALIDATE ]
>
> ALTER TABLE parent DETACH PARTITION partition_name
>
> partition_bound_spec is:
>
> FOR VALUES { list_spec | range_spec }
>
> list_spec in FOR VALUES is:
>
> IN ( expression [, ...] )
>
> range_spec in FOR VALUES is:
>
> START lower-bound [ INCLUSIVE | EXCLUSIVE ] END upper-bound [ INCLUSIVE |
> EXCLUSIVE ]
>
> where lower-bound and upper-bound are:
>
> { ( expression [, ...] ) | UNBOUNDED }
>
> expression can be a string literal, a numeric literal or NULL.
>
> Note that the one can specify PARTITION BY when creating a partition
> itself. That is to allow creating multi-level partitioned tables.
>
>
> 0005-Teach-a-few-places-to-use-partition-check-constraint.patch
>
> A partition's bound implicitly constrains 

Re: [HACKERS] Should we cacheline align PGXACT?

2016-08-21 Thread Craig Ringer
On 22 August 2016 at 10:40, Andres Freund  wrote:

> On 2016-08-19 09:46:12 -0400, Tom Lane wrote:
> > Alexander Korotkov  writes:
> > > originally this idea was proposed by Andres Freund while experimenting
> with
> > > lockfree Pin/UnpinBuffer [1].
> > > The patch is attached as well as results of pgbench -S on 72-cores
> > > machine.  As before it shows huge benefit in this case.
> >
> > That's one mighty ugly patch.
>
> My version of it was only intended to nail down some variability on the
> pgpro machine, it wasn't intended for submission.
>
>
> > Can't you do it without needing to introduce the additional layer of
> > struct nesting?
>
> If we required support for anonymous unions, such things would be a lot
> easier to do.  That aside, the only alternative seems tob e hard-coding
> padding space - which probably isn't all that un-fragile either.
> 
>

Somewhat naïve question from someone with much less clue about low level
cache behaviour trying to follow along: given that we determine such
padding at compile time, how do we ensure that the cacheline size we're
targeting is right at runtime? Or is it safe to assume that using 16 bytes
so we don't cross cache line boundaries is always helpful, whether we have
4 PGXACT entries (64 byte line) or some other number per cacheline?

Also, for anyone else following this discussion who wants to understand it
better, take a look at
http://igoro.com/archive/gallery-of-processor-cache-effects/ .

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] replication slots replicated to standbys?

2016-08-21 Thread Craig Ringer
On 22 August 2016 at 10:31, Michael Paquier 
wrote:

> On Sun, Aug 21, 2016 at 1:24 PM, Craig Ringer
>  wrote:
> > On 21 Aug 2016 12:36 AM, "Bruce Momjian"  wrote:
> >> Seems like another good idea to use pg_basebackup rather than manually
> >> doing base backups;  Magnus has been saying this for a while.
> >
> > The main time that's an issue is when you're rsync'ing to save bandwidth,
> > using CoW volume snapshots, etc. pg_basebackup becomes totally
> impractical
> > on big systems.
>
> Yes, and that's not fun. Particularly when the backup takes so long
> that WAL has already been recycled... Replication slots help here but
> the partitions dedicated to pg_xlog have their limit as well.
>

We can and probably should allow XLogReader to invoke restore_command to
fetch WAL, read it, and discard/recycle it again. This would greatly
alleviate the pain of indefinite xlog retention.

It's a pain to do so while recovery.conf is its own separate magic though,
not part of postgresql.conf.

I have no plans to work on this at this time.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


[HACKERS] Logical decoding of sequence advances, part II

2016-08-21 Thread Craig Ringer
Hi all

My earlier efforts at logical decoding of sequence advances were too
simplistic[1], falling afoul of issues with sequences being both
transactional and not transactional depending on whether the sequence is
created in the current xact or not.

TL;DR of solution:

* extend SeqTableData and xl_seq_rec with xid
* set xid in SeqTableData during DefineSequence()
* if SeqTableData set, write it in xl_seq_rec when writing xlog
* assign sequence update to specified xid's reorder buffer in decoding if
xid set, otherwise decode immediately


The problem:

If the sequence is created in the current xact (i.e. uncommitted) we have
to add the sequence updates to that xact to be replayed only if it commits.
The sequence is visible only to the toplevel xact that created the sequence
so advances of it can only come from that xact and its children. The actual
CREATE SEQUENCE is presumed to be handled separately by an event trigger or
similar.

If the new sequence is committed we must replay sequence advances
immediately and non-transactionally to ensure they're not lost due to xact
rollback or replayed in the wrong order due to xact commit order.

If the sequence is ALTERed in a way that changes pg_class that's event
triggers' job and sequence decoding doesn't care. If it's ALTERed in a way
that changes Form_pg_sequence we replay the change immediately, using the
last committed snapshot to get the sequence details, so the change will
take immediate effect and is retained whether or not any pg_class changes
are committed. This reflects how it happens on the upstream.


Planned solution:

Extend xl_seq_rec with a created_in_xid TransactionId field. If
created_in_xid != InvalidTransactionId, logical decoding associates the
sequence advance with the given toplevel xact and adds it to the reorder
buffer instead of immediately invoking the sequence decoding callback. The
decoding callback then gets invoked during ReorderBufferCommit processing
at the appropriate time, like any other transactional change.

To determine whether to log an xid for the sequence advance we need some
backend local state to determine whether the sequence is new in this xact.
Handily we already have one, the seqhashtab of SeqTableData in sequence.c,
just where it's needed. So all that's needed is to add a TransactionId
field that we set if we created that sequence in this session. If it's set
we test it for TransactionIsInProgress() when xlog'ing a sequence advance;
if it is, log that xid. If not in progress, clear the xid in SeqTableData
entry so we don't check again.



Another approach would be, during decoding, to look up the relfilenode of
the sequence to get the sequence oid and do a pg_class lookup. Check to see
whether xmin is part of an in-progress xact. If so, add the sequence
advance to that xact's reorder buffer, otherwise decode it immediately. The
problem is that

(a) I think we lack relfilenode-to-oid mapping information at
decoding-time. RelidByRelfilenode() needs a snapshot and is invoked
during ReorderBufferCommit(). We have make the transactional vs
nontransactional decision in LogicalDecodingProcessRecord() when I'm pretty
sure we don't have a snapshot.

(b) It also has issues with ALTER TRANSACTION. We must replay decoded xact
updates immediately even if some in-flight xact has modified the pg_class
entry for the sequence. So we can't just check whether the xmin is one of
our xact's (sub)xids, we must also check whether some older tuple for the
same sequence oid has a corresponding xmax and keep walking backwards until
we determine whether we originally CREATEd the sequence in this xact or
only ALTERed it.


So yeah. I think extending SeqTableData and xl_seq_rec with xid is the way
to go. Objections?

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] replication slots replicated to standbys?

2016-08-21 Thread Michael Paquier
On Sun, Aug 21, 2016 at 1:35 AM, Bruce Momjian  wrote:
> On Sat, Aug 20, 2016 at 01:43:42PM +0900, Michael Paquier wrote:
>> Note as well that pg_basebackup omits its content and creates an empty
>> directory.
>
> Are there any other error-prone things copied from the master?

The contents of pg_snapshots get copied by pg_basebackup. Those are
useless in a backup, but harmless.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should we cacheline align PGXACT?

2016-08-21 Thread Andres Freund
On 2016-08-19 09:46:12 -0400, Tom Lane wrote:
> Alexander Korotkov  writes:
> > originally this idea was proposed by Andres Freund while experimenting with
> > lockfree Pin/UnpinBuffer [1].
> > The patch is attached as well as results of pgbench -S on 72-cores
> > machine.  As before it shows huge benefit in this case.
> 
> That's one mighty ugly patch.

My version of it was only intended to nail down some variability on the
pgpro machine, it wasn't intended for submission.


> Can't you do it without needing to introduce the additional layer of
> struct nesting?

If we required support for anonymous unions, such things would be a lot
easier to do.  That aside, the only alternative seems tob e hard-coding
padding space - which probably isn't all that un-fragile either.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] replication slots replicated to standbys?

2016-08-21 Thread Michael Paquier
On Sun, Aug 21, 2016 at 1:24 PM, Craig Ringer
 wrote:
> On 21 Aug 2016 12:36 AM, "Bruce Momjian"  wrote:
>> Seems like another good idea to use pg_basebackup rather than manually
>> doing base backups;  Magnus has been saying this for a while.
>
> The main time that's an issue is when you're rsync'ing to save bandwidth,
> using CoW volume snapshots, etc. pg_basebackup becomes totally impractical
> on big systems.

Yes, and that's not fun. Particularly when the backup takes so long
that WAL has already been recycled... Replication slots help here but
the partitions dedicated to pg_xlog have their limit as well.

>> I supposed there is no way we could remove this error-prone behavior
>> because replication slots must survive server restarts.  Is there no way
>> to know if we are starting a standby from a fresh base backup vs.
>> restarting a standby?  In that case we could clear the replication
>> slots.  Are there any other error-prone things copied from the master?
>
> We could remove slots when we enter archive recovery. But I've recently
> implemented support for logical decoding from standbys, which needs slots.
> Physical slot use on standby is also handy. We cannot tell whether a slot
> was created on the replica or created on the master and copied in the base
> backup and don't want to drop slots created on the replica.
>
> I also have use cases for slots being retained in restore from snapshot, for
> re-integrating restored nodes into an MM mesh.
>
> I think a recovery.conf option to remove all slots during archive recovery
> could be handy. But mostly it comes down to tools not copying them.

Yes, I'd personally let recovery.conf out of that, as well as the
removal of replication slot data when archive recovery begins to keep
the configuration simple. The decision-making of the data included in
any backup will be done by the tool itself anyway..
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should we cacheline align PGXACT?

2016-08-21 Thread Andres Freund
On 2016-08-20 14:33:13 -0400, Robert Haas wrote:
> On Aug 19, 2016, at 2:12 AM, Alexander Korotkov  
> wrote:
> > Hackers,
> > 
> > originally this idea was proposed by Andres Freund while experimenting with 
> > lockfree Pin/UnpinBuffer [1].
> > The patch is attached as well as results of pgbench -S on 72-cores machine. 
> >  As before it shows huge benefit in this case.
> > For sure, we should validate that it doesn't cause performance regression 
> > in other cases.  At least we should test read-write and smaller machines.
> > Any other ideas?
> 
> Wow, nice results.  My intuition on why PGXACT helped in the first
> place was that it minimized the number of cache lines that had to be
> touched to take a snapshot. Padding obviously would somewhat increase
> that again, so I can't quite understand why it seems to be
> helping... any idea?

I don't think it's that surprising: PGXACT->xid is written to each
transaction, and ->xmin is often written to multiple times per
transaction. That means that if a PGXACT's cacheline is shared between
backends one write will often first have another CPU flush it's store
buffer / L1 / L2 cache. If there's several hops between two cores, that
can mean quite a bit of added latency.  I previously played around with
*removing* the optimization of resetting ->xmin when not required
anymore - and on a bigger machine it noticeably increased throughput on
higher client counts.

To me it's pretty clear that rounding up PGXACT's size to a 16 bytes
(instead of the current 12, with 4 byte alignment) is going to be a win,
the current approach just leeds to pointless sharing.  Besides, storing
the database oid in there will allow GetOldestXmin() to only use PGXACT,
and could, with a bit more work, allow to ignore other databases in
GetSnapshotData().

I'm less sure that going up to a full cacheline is always a win.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: initdb: "'" for QUOTE_PATH (non-windows)

2016-08-21 Thread Michael Paquier
On Sun, Aug 21, 2016 at 3:02 AM, Tom Lane  wrote:
> I wrote:
>> A bigger issue here is that it seems fundamentally wrong for initdb to be
>> including libpq, because it surely is never meant to be communicating
>> with a running postmaster.  Not sure what to do about that.  We could
>> consider moving pqexpbuffer out of libpq into fe_utils, but I wonder
>> whether that would break any third-party code.  We've never advertised
>> pqexpbuffer.h as a supported API of libpq, but it's probably handy enough
>> that people use it anyway.  I suppose we could duplicate it in fe_utils
>> and libpq, though that's a tad ugly.  Thoughts?
>
> I looked into this and soon found that fe_utils/string_utils.o has
> dependencies on libpq that are much wider than just pqexpbuffer :-(.
> It might be a project to think about sorting that out sometime, but it
> looks like it would be an awful lot of work just to avoid having initdb
> depend on libpq.so.  So I now think this objection is impractical.
> I'll just annotate the makefile entry to say that initdb itself doesn't
> use libpq.

pqexpbuffer.c is an independent piece of facility, so we could move it
to src/common and leverage the dependency a bit, and have libpq link
to the source file itself at build phase. The real problem is the call
to PQmblen in psqlscan.l... And this, I am not sure how this could be
refactored cleanly.

>> Another perhaps-only-cosmetic issue is that now initdb prints quotes
>> whether they are needed or not.  I find this output pretty ugly:
>> ...
>> That's not really the fault of this patch perhaps.  Maybe we could adjust
>> appendShellString so it doesn't add quotes if they are clearly
>> unnecessary.
>
> I still think this is worth fixing, but it's a simple modification.
> Will take care of it.
>
> This item is listed in the commitfest as a bug fix, but given the lack of
> previous complaints, and the fact that the printed command isn't really
> meant to be blindly copied-and-pasted anyway (you're at least meant to
> replace "logfile" with something), I do not think it needs back-patching.

Agreed on that. Only first-time users do a copy-paste of the command
anyway, so the impact is very narrow.

And actually, I had a look at the build failure that you reported in
3855.1471713...@sss.pgh.pa.us. While that was because of a copy of
libpq.so that I had in my own LD_LIBRARY_PATH, shouldn't all the other
frontend utilities depending on fe_utils also use $(libpq_pgport)
instead of -lpq? I think that you saw the error for initdb because
that's the first one to be built in src/bin/, and I think that Ryan
used -lpq in his patch because the same pattern is used everywhere
else.

It seems to me as well that submake-libpq and submake-libpgfeutils
should be listed in initdb's Makefile when building the binary.

Am I missing something? Please see the patch attached to see what I mean.

Thinking harder about that, it may be better to even add a variable in
Makefile.global.in for utilities in need of fe_utils, like that for
example:
libpg_feutils = -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)
Thoughts?
-- 
Michael
diff --git a/src/bin/initdb/Makefile b/src/bin/initdb/Makefile
index 531cc97..394eae0 100644
--- a/src/bin/initdb/Makefile
+++ b/src/bin/initdb/Makefile
@@ -30,7 +30,7 @@ OBJS=	initdb.o findtimezone.o localtime.o encnames.o $(WIN32RES)
 
 all: initdb
 
-initdb: $(OBJS) | submake-libpgport
+initdb: $(OBJS) | submake-libpq submake-libpgport submake-libpgfeutils
 	$(CC) $(CFLAGS) $(OBJS) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
 
 # We used to pull in all of libpq to get encnames.c, but that
diff --git a/src/bin/pg_dump/Makefile b/src/bin/pg_dump/Makefile
index 25336a5..08b01d7 100644
--- a/src/bin/pg_dump/Makefile
+++ b/src/bin/pg_dump/Makefile
@@ -17,7 +17,7 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
-LDFLAGS += -L$(top_builddir)/src/fe_utils -lpgfeutils -lpq
+LDFLAGS += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)
 
 OBJS=	pg_backup_archiver.o pg_backup_db.o pg_backup_custom.o \
 	pg_backup_null.o pg_backup_tar.o pg_backup_directory.o \
diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index 8823288..aa3c45b 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -12,7 +12,7 @@ OBJS = check.o controldata.o dump.o exec.o file.o function.o info.o \
tablespace.o util.o version.o $(WIN32RES)
 
 override CPPFLAGS := -DDLSUFFIX=\"$(DLSUFFIX)\" -I$(srcdir) -I$(libpq_srcdir) $(CPPFLAGS)
-LDFLAGS += -L$(top_builddir)/src/fe_utils -lpgfeutils -lpq
+LDFLAGS += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)
 
 
 all: pg_upgrade
diff --git a/src/bin/pgbench/Makefile b/src/bin/pgbench/Makefile
index 1503d00..228eed5 100644
--- a/src/bin/pgbench/Makefile
+++ b/src/bin/pgbench/Makefile
@@ -10,7 +10,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = pgbench.o exprparse.o 

Re: [HACKERS] synchronous_commit = remote_flush

2016-08-21 Thread Robert Haas
On Sun, Aug 21, 2016 at 6:08 PM, Thomas Munro
 wrote:
> On Fri, Aug 19, 2016 at 6:30 AM, Jim Nasby  wrote:
>> I'm wondering if we've hit the point where trying to put all of this in a
>> single GUC is a bad idea... changing that probably means a config
>> compatibility break, but I don't think that's necessarily a bad thing at
>> this point...
>
> Aside from the (IMHO) slightly confusing way that "on" works, which is
> the smaller issue I was raising in this thread, I agree that we might
> eventually want to escape from the assumption that "local apply" (=
> off), local flush, remote write, remote flush, remote apply happen in
> that order and therefore a single linear control knob can describe
> which of those to wait for.
>
> Some pie-in-the-sky thoughts: we currently can't reach
> "group-safe"[1], where you wait only for N servers to have the WAL in
> memory (let's say that for us that means write but not flush): the
> closest we can get is "1-safe and group-safe", using remote_write to
> wait for the standbys to write (= "group-safe"), which implies local
> flush (= "1-safe").  Now that'd be a terrible level to use unless your
> recovery procedure included cluster-wide communication to straighten
> things out, and without any such clusterware it makes a lot of sense
> to have the master flush before sending, and I'm not actually
> proposing we change that, I'm just speculating that someone might
> eventually want it.  We also can't have standbys apply before they
> flush; as far as I know there is no theoretical reason why that
> shouldn't be allowed, except maybe for some special synchronisation
> steps around checkpoint records so that recovery doesn't get too far
> ahead.

Well, in order to remain recoverable, the standby has to obey the
WAL-before-data rule: if it writes a page with a given LSN, that LSN
had better be flushed to disk first.  In practice, this means that if
you want a standby to remain recoverable without needing to contact
the rest of the cluster, you can't let its minimum recovery point pass
the WAL flush point.  In short, this comes up anytime you evict a
buffer, not just around checkpoints.

> That'd mirror what happens on the master more closely.
> Imagine if you wanted to wait for your transaction to become visible
> on certain other servers, but didn't want to wait for any disks:
> that'd be the distributed equivalent of today's "off", but today's
> "remote_apply" implies local flush and remote flush.  Or more likely
> you'd want some combination: 2-safe or group-safe on some subset of
> servers to satisfy your durability requirements, and applied on some
> other perhaps larger subset of servers for consistency.  But this is
> just water cooler handwaving.

Sure, that stuff would be great, and we'll probably have to redesign
synchronous_commit entirely if and when we get there, but I'm not sure
it makes sense to tinker with it now just for that.  The original
reason why I suggested the current design for synchronous_commit is to
avoid forcing people to set yet another GUC in order to use
synchronous replication.  The default of 'on' means that you can just
configure synchronous_standby_names and away you go.  Perhaps a better
design as we added more values would have been to keep
synchronous_commit as on/local/off and use a separate GUC, say,
synchronous_replication to define what "on" means: remote_apply,
remote_flush, remote_apply, 2safe+groupsafe, or whatever.  And when
synchronous_standby_names='' then the value of synchronous_replication
is ignored, and synchronous_commit=on means the same as
synchronous_commit=local, just as it does today.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dsm_unpin_segment

2016-08-21 Thread Robert Haas
On Sun, Aug 21, 2016 at 7:54 PM, Thomas Munro
 wrote:
> Here's the rationale I'm using: if it's helpful to programmers of
> client code, especially client code that might include extensions, and
> nowhere near a hot code path, then why not use elog rather than
> Assert?  I took inspiration for that from the pre-existing "debugging
> cross-check"  in dsm_attach that does elog(ERROR, "can't attach the
> same segment more than once").  On that basis, this new version
> retains the elog you mentioned, and now also uses elog for the
> you-tried-to-unpin-a-handle-I-couldn't-find case.  But I kept Assert
> in places that detect bugs in *this* code, rather than client code.

Hmm, well said.  I've never thought about it in exactly that way, but
I think that is a useful distinction.  I've sometimes phrased it this
way: an Assert() is good if the path is performance-critical, or if
the Assert() is checking something that is "nearby" in the code, but
an elog() is better if we're checking for a condition that could
happen as a result of some code that is far-removed from the place
where the elog() is.  If an assertion fails, you're not necessarily
going to realize right away that the calling code needs to be checked
for errors.  That could be mitigated, of course, by adding a comment
right before the Assert() saying "if this Assertion fails, you
probably did X, and you shouldn't do that".  But an elog() can state
the exact problem right away.

Also, of course, elog() is the right tool if we want to perform the
check even in production builds where asserts are not enabled.  That's
not so relevant here, but it matters in some other cases, like when
checking for a case that shouldn't happen normally but could be the
result of data corruption.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [sqlsmith] FailedAssertion("!(XLogCtl->Insert.exclusiveBackup)", File: "xlog.c", Line: 10200)

2016-08-21 Thread Michael Paquier
On Sat, Aug 6, 2016 at 6:35 PM, Andreas Seltenreich  wrote:
> Michael Paquier writes:
>
>> Andreas, with the patch attached is the assertion still triggered?
>> [2. text/x-diff; base-backup-crash-v2.patch]
>
> I didn't observe the crashes since applying this patch.  There should
> have been about five by the amount of fuzzing done.

I have reworked the patch, replacing those two booleans with a single
enum. That's definitely clearer. Also, I have added this patch to the
CF to not lose track of it:
https://commitfest.postgresql.org/10/731/
Horiguchi-san, I have added you as a reviewer as you provided some
input. I hope you don't mind.
-- 
Michael
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f13f9c1..2b8c09c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -460,6 +460,25 @@ typedef union WALInsertLockPadded
 } WALInsertLockPadded;
 
 /*
+ * State of an exclusive backup
+ *
+ * EXCLUSIVE_BACKUP_NONE means that there is no exclusive backup actually
+ * running, to be more precise pg_start_backup() is not being executed for
+ * an exclusive backup and there is exclusive backup in progress.
+ * EXCLUSIVE_BACKUP_STARTING means that pg_start_backup() is running and
+ * that is is not done yet.
+ * EXCLUSIVE_BACKUP_IN_PROGRESS means that pg_start_backup() has finished
+ * running and an exclusive backup is in progress. pg_stop_backup() is
+ * needed to finish it.
+ */
+typedef enum ExclusiveBackupState
+{
+	EXCLUSIVE_BACKUP_NONE = 0,
+	EXCLUSIVE_BACKUP_STARTING,
+	EXCLUSIVE_BACKUP_IN_PROGRESS
+} ExclusiveBackupState;
+
+/*
  * Shared state data for WAL insertion.
  */
 typedef struct XLogCtlInsert
@@ -500,13 +519,14 @@ typedef struct XLogCtlInsert
 	bool		fullPageWrites;
 
 	/*
-	 * exclusiveBackup is true if a backup started with pg_start_backup() is
-	 * in progress, and nonExclusiveBackups is a counter indicating the number
+	 * exclusiveBackupState is changed to EXCLUSIVE_BACKUP_STARTING for the
+	 * duration of pg_start_backup(), then to EXCLUSIVE_BACKUP_IN_PROGRESS once
+	 * it is done, and nonExclusiveBackups is a counter indicating the number
 	 * of streaming base backups currently in progress. forcePageWrites is set
 	 * to true when either of these is non-zero. lastBackupStart is the latest
 	 * checkpoint redo location used as a starting point for an online backup.
 	 */
-	bool		exclusiveBackup;
+	ExclusiveBackupState exclusiveBackupState;
 	int			nonExclusiveBackups;
 	XLogRecPtr	lastBackupStart;
 
@@ -9833,7 +9853,7 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 	WALInsertLockAcquireExclusive();
 	if (exclusive)
 	{
-		if (XLogCtl->Insert.exclusiveBackup)
+		if (XLogCtl->Insert.exclusiveBackupState != EXCLUSIVE_BACKUP_NONE)
 		{
 			WALInsertLockRelease();
 			ereport(ERROR,
@@ -9841,7 +9861,7 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 	 errmsg("a backup is already in progress"),
 	 errhint("Run pg_stop_backup() and try again.")));
 		}
-		XLogCtl->Insert.exclusiveBackup = true;
+		XLogCtl->Insert.exclusiveBackupState = EXCLUSIVE_BACKUP_STARTING;
 	}
 	else
 		XLogCtl->Insert.nonExclusiveBackups++;
@@ -10096,7 +10116,7 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 		{
 			/*
 			 * Check for existing backup label --- implies a backup is already
-			 * running.  (XXX given that we checked exclusiveBackup above,
+			 * running.  (XXX given that we checked exclusiveBackupState above,
 			 * maybe it would be OK to just unlink any such label file?)
 			 */
 			if (stat(BACKUP_LABEL_FILE, _buf) != 0)
@@ -10178,6 +10198,16 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 	PG_END_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
 
 	/*
+	 * Mark that start phase has correctly finished for an exclusive backup.
+	 */
+	if (exclusive)
+	{
+		WALInsertLockAcquireExclusive();
+		XLogCtl->Insert.exclusiveBackupState = EXCLUSIVE_BACKUP_IN_PROGRESS;
+		WALInsertLockRelease();
+	}
+
+	/*
 	 * We're done.  As a convenience, return the starting WAL location.
 	 */
 	if (starttli_p)
@@ -10195,8 +10225,8 @@ pg_start_backup_callback(int code, Datum arg)
 	WALInsertLockAcquireExclusive();
 	if (exclusive)
 	{
-		Assert(XLogCtl->Insert.exclusiveBackup);
-		XLogCtl->Insert.exclusiveBackup = false;
+		Assert(XLogCtl->Insert.exclusiveBackupState == EXCLUSIVE_BACKUP_STARTING);
+		XLogCtl->Insert.exclusiveBackupState = EXCLUSIVE_BACKUP_NONE;
 	}
 	else
 	{
@@ -10204,7 +10234,7 @@ pg_start_backup_callback(int code, Datum arg)
 		XLogCtl->Insert.nonExclusiveBackups--;
 	}
 
-	if (!XLogCtl->Insert.exclusiveBackup &&
+	if (XLogCtl->Insert.exclusiveBackupState == EXCLUSIVE_BACKUP_NONE &&
 		XLogCtl->Insert.nonExclusiveBackups == 0)
 	{
 		XLogCtl->Insert.forcePageWrites = false;
@@ -10280,14 +10310,21 @@ do_pg_stop_backup(char *labelfile, bool 

Re: [HACKERS] New SQL counter statistics view (pg_stat_sql)

2016-08-21 Thread Haribabu Kommi
On Sun, Aug 21, 2016 at 1:17 AM, Tom Lane  wrote:
> Haribabu Kommi  writes:
>> This is a new statistics view that is used to provide the number of
>> SQL operations that are
>> happened on a particular interval of time. This view is useful for the
>> system to find out the
>> pattern of the operations that are happening in the instance during
>> particular interval of
>> time.
>
> 1. This set of counters seems remarkably random.  Why, for instance,
> count reindexes but not original index creations?

I thought of adding the columns to the view that are mostly used(excluding
all DDL commands), otherwise the view columns fill with all the commands
that are supported by PostgreSQL.

> 2. What will you do with cases such as SELECTs containing modifying CTEs?
> Or EXPLAIN ANALYZE?  Does CLUSTER count as a REINDEX?  Does REFRESH
> MATERIALIZED VIEW count as a SELECT?  (Or maybe it's an INSERT?)

Cluster commands will be calculated as clusters, explain needs some
logic to identify what SQL operation exactly. This is because of using
nodeTag data from the parseTree structure to identify what type of SQL
statement it is.

The counters will not updated for any SQL type that is not as part of the view.

> If you're going to be selective about what you count, you're forever
> going to be fielding complaints from users who are unhappy that you
> didn't count some statement type they care about, or feel that you
> misclassified some complex case.

Yes, I agree that users may complain regarding with limited columns that are not
sufficient for their use. But with the mostly used columns, this view
may be useful for
some of the users to find out the pattern of the SQL operations that
are happening
on the instance.

> I'm inclined to suggest you forget this approach and propose a single
> counter for "SQL commands executed", which avoids all of the above
> definitional problems.  People who need more detail than that are
> probably best advised to look to contrib/pg_stat_statements, anyway.

I will add a new column to the pg_stat_database, to track the number of SQL
operations that are happened on this particular database.

Regards,
Hari Babu
Fujitsu Australia


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] standalone backend PANICs during recovery

2016-08-21 Thread Michael Paquier
On Sun, Aug 21, 2016 at 1:41 AM, Tom Lane  wrote:
> So at this point I'm pretty baffled as to what the actual use-case is
> here.

It is easier to attach a debugger in this case to analyze the problem?

> I am tempted to say that a standalone backend should refuse to
> start at all if a recovery.conf file is present.  If we do want to
> allow the case, we need some careful thought about what it should do.

+1 for preventing an instance in --single mode to start if
recovery.conf is present. It is better to put safeguards than getting
unwanted behaviors.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SP-GiST support for inet datatypes

2016-08-21 Thread Tom Lane
Andrew Gierth  writes:
> "Tom" == Tom Lane  writes:
>  Tom> I started to look at this patch.  The reported speedup is pretty
>  Tom> nice, but ...

> The builtin gist support for inet seems quite surprisingly slow; ip4r
> beats it into the ground without difficulty. (I'd be curious how well
> this sp-gist version stacks up against ip4r.)

Yeah ... what I find interesting about this patch is that the opclass's
data splitting rules don't seem all that much different from what the
existing gist opclass knows how to do.  I suspect that the speed
differential arises because the gist opclass's penalty and picksplit
algorithms are somehow extremely stupid, which hints that maybe they
could be made better.  In particular, IIRC the test cases that we looked
at when making the gist opclass tended to have only one or a few netmask
lengths, whereas in this data set the netmasks are all over the place.
I theorize that that's what's breaking the gist algorithm.  Not sure
exactly how to make it better though.

This example also seems to be showing up some lack of intelligence in
SPGiST itself, in that there are an awful lot of pages with a lot of
empty space on them.  Not sure why.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dsm_unpin_segment

2016-08-21 Thread Thomas Munro
On Sat, Aug 20, 2016 at 11:37 PM, Amit Kapila  wrote:
> On Tue, Aug 9, 2016 at 10:07 AM, Thomas Munro
>  wrote:
>> On Tue, Aug 9, 2016 at 12:53 PM, Tom Lane  wrote:
>>> The larger picture here is that Robert is exhibiting a touching but
>>> unfounded faith that extensions using this feature will contain zero bugs.
>>> IMO there needs to be some positive defense against mistakes in using the
>>> pin/unpin API.  As things stand, multiple pin requests don't have any
>>> fatal consequences (especially not on non-Windows), so I have little
>>> confidence that it's not happening in the field.  I have even less
>>> confidence that there wouldn't be too many unpin requests.
>>
>> Ok, here is a version that defends against invalid sequences of
>> pin/unpin calls.  I had to move dsm_impl_pin_segment into the block
>> protected by DynamicSharedMemoryControlLock, so that it could come
>> after the already-pinned check, but before updating any state, since
>> it makes a Windows syscall that can fail.  That said, I've only tested
>> on Unix and will need to ask someone to test on Windows.
>>
>
> Few review comments:

Thanks for the review!

> 1.
> + /* Note that 1 means no references (0 means unused slot). */
> + if (--dsm_control->item[i].refcnt == 1)
> + destroy = true;
> +
> + /*
> + * Allow implementation-specific code to run.  We have to do this before
> + * releasing the lock, because impl_private_pm_handle may get modified by
> + * dsm_impl_unpin_segment.
> + */
> + if (control_slot >= 0)
> + dsm_impl_unpin_segment(handle,
> + _control->item[control_slot].impl_private_pm_handle);
>
> If there is an error in dsm_impl_unpin_segment(), then we don't need
> to decrement the reference count.  Isn't it better to do it after the
> dsm_impl_unpin_segment() is successful.  Similarly, I think pinned
> should be set to false after dsm_impl_unpin_segment().

Hmm.  Yeah, OK.  Things are in pretty bad shape if you fail to unpin
despite having run the earlier checks, but you're right, it's better
that way.  New version attached.

> Do you need a check if (control_slot >= 0)?  In the code just above
> you have as Assert to ensure that it is >=0.

Right.  Furthermore, the code was using "i" in some places and
"control_slot" in others.  We might as well use control_slot
everywhere.

On Sun, Aug 21, 2016 at 5:45 PM, Amit Kapila  wrote:
> On Sat, Aug 20, 2016 at 6:13 PM, Robert Haas  wrote:
>> On Sat, Aug 20, 2016 at 7:37 AM, Amit Kapila  wrote:
>>> 2.
>>> + if (dsm_control->item[seg->control_slot].pinned)
>>> + elog(ERROR, "cannot pin a segment that is already pinned");
>>>
>>> Shouldn't this be a user facing error (which means we should use ereport)?
>>
>> Uh, certainly not.  This can't happen because of SQL the user enters;
>> it can only happen because of a C coding error.  elog() is the
>> appropriate tool for that case.
>>
>
> Okay, your point makes sense to me, but in that case why not use an
> Assert here?  I think elog() could also be used here as well, but it
> seems to me elog() is most appropriate for the cases where it is not
> straightforward to tell the behaviour of API or value of variable like
> when PageAddItem() is not successful.

Here's the rationale I'm using: if it's helpful to programmers of
client code, especially client code that might include extensions, and
nowhere near a hot code path, then why not use elog rather than
Assert?  I took inspiration for that from the pre-existing "debugging
cross-check"  in dsm_attach that does elog(ERROR, "can't attach the
same segment more than once").  On that basis, this new version
retains the elog you mentioned, and now also uses elog for the
you-tried-to-unpin-a-handle-I-couldn't-find case.  But I kept Assert
in places that detect bugs in *this* code, rather than client code.

> void
>  dsm_pin_segment(dsm_segment *seg)
>
> +void
> +dsm_unpin_segment(dsm_handle handle)
>
> Another point, I want to highlight here is that pin/unpin API's have
> different type of arguments.  The comments on top of function
> dsm_unpin_segment() explains the need of using dsm_handle which seems
> reasonable.  I just pointed out to see if anybody else has a different
> view.

Now that I have posted the DSA patch[1], it probably makes sense to
explain the path by which I arrived at the conclusion that unpin
should take a handle.  In an earlier version, dsm_unpin_segment took a
dsm_segment *, mirroring the pin function.  But then Robert complained
privately that my dsa_area destroy code path was sometimes having to
attach segments just to unpin them and then detach, which might fail
due to lack of virtual memory.  He was right to complain: that created
a fairly nasty failure mode where I'd unpinned some but not all of the
DSM segments that belong together.  So I concluded that it should be
possible to unpin a segment even when you haven't got 

Re: [HACKERS] WIP: About CMake v2

2016-08-21 Thread Christian Convey
Hi Yury,

>> I glad to hear it. I suppose you can just try build postgres and send all
>> problems to github tracker.
>> https://github.com/stalkerg/postgres_cmake/issues

FYI, I had success using your "postgres_cmake" repo.  I tested it up
through "make check" and "make install".

Here are the details:

* postgres_cmake commit e7e75160d4533cd8caa9f3f0dd7b485dbd4e7bdf
* compiler = cc (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4
* cmake version = 3.5.3

Kind regards,
Christian


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] synchronous_commit = remote_flush

2016-08-21 Thread Thomas Munro
On Fri, Aug 19, 2016 at 6:30 AM, Jim Nasby  wrote:
> I'm wondering if we've hit the point where trying to put all of this in a
> single GUC is a bad idea... changing that probably means a config
> compatibility break, but I don't think that's necessarily a bad thing at
> this point...

Aside from the (IMHO) slightly confusing way that "on" works, which is
the smaller issue I was raising in this thread, I agree that we might
eventually want to escape from the assumption that "local apply" (=
off), local flush, remote write, remote flush, remote apply happen in
that order and therefore a single linear control knob can describe
which of those to wait for.

Some pie-in-the-sky thoughts: we currently can't reach
"group-safe"[1], where you wait only for N servers to have the WAL in
memory (let's say that for us that means write but not flush): the
closest we can get is "1-safe and group-safe", using remote_write to
wait for the standbys to write (= "group-safe"), which implies local
flush (= "1-safe").  Now that'd be a terrible level to use unless your
recovery procedure included cluster-wide communication to straighten
things out, and without any such clusterware it makes a lot of sense
to have the master flush before sending, and I'm not actually
proposing we change that, I'm just speculating that someone might
eventually want it.  We also can't have standbys apply before they
flush; as far as I know there is no theoretical reason why that
shouldn't be allowed, except maybe for some special synchronisation
steps around checkpoint records so that recovery doesn't get too far
ahead.  That'd mirror what happens on the master more closely.
Imagine if you wanted to wait for your transaction to become visible
on certain other servers, but didn't want to wait for any disks:
that'd be the distributed equivalent of today's "off", but today's
"remote_apply" implies local flush and remote flush.  Or more likely
you'd want some combination: 2-safe or group-safe on some subset of
servers to satisfy your durability requirements, and applied on some
other perhaps larger subset of servers for consistency.  But this is
just water cooler handwaving.

[1] https://infoscience.epfl.ch/record/49936/files/WS03

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SP-GiST support for inet datatypes

2016-08-21 Thread Andrew Gierth
> "Tom" == Tom Lane  writes:

 > Emre Hasegeli  writes:
 >> Attached patches add SP-GiST support to the inet datatypes.

 Tom> I started to look at this patch.  The reported speedup is pretty
 Tom> nice, but ...

The builtin gist support for inet seems quite surprisingly slow; ip4r
beats it into the ground without difficulty. (I'd be curious how well
this sp-gist version stacks up against ip4r.)

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SP-GiST support for inet datatypes

2016-08-21 Thread Tom Lane
Emre Hasegeli  writes:
>> ... Several of the existing opclasses use fixed numbers of
>> child nodes, so why does this need something they don't?

> Currently, SP-GiST framework supports fixed number of child nodes, if
> the index is growing by page splits, not by prefix splits.  This is
> not a fair API.

Hm, I see your point: the spgSplitTuple action only supports forming a
new upper tuple with a single, labeled child node.  That is pretty bogus.
You could imagine doing repeated spgAddNode actions to enlarge the new
upper tuple; but that's ridiculously inefficient, and it's also unclear
how you'd avoid confusing searches that descend through a partially-split
upper tuple (either concurrently, or after a crash that prevented
finishing the split).

> SP-GiST is not widely adopted in my experience.  Breaking this part of
> the API would affect prefix tree implementations.  I don't think there
> are much of them.  Supporting the new interface is easy for them.  We
> can try to support the old interface by side of the new one, but this
> wouldn't be very clean either.

We could imagine defining a "spgSplitTupleMulti" action alongside the
existing one, but I agree it's a bit of a wart --- nobody would have
designed it that way in a green field.  On the other hand, index AM-to-
opclass APIs are things we don't like to break.  We went out of our way
to make past GiST and GIN API changes upward-compatible.

Oleg, Teodor, do you have any idea how many people are using custom
SPGiST opclasses that might be affected by an API break here?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Alter or rename enum value

2016-08-21 Thread Tom Lane
Emre Hasegeli  writes:
>> +ReleaseCatCacheList(list);
>> +heap_close(pg_enum, RowExclusiveLock);

> Maybe we better release them before reporting error, too.  I would
> release the list after the loop, close the heap before ereport().

Transaction abort will clean up such resources just fine; if it did
not, then any function you call would have problems if it threw an
error.  I would not contort the logic to free stuff before ereport.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Alter or rename enum value

2016-08-21 Thread Emre Hasegeli
> Here is v4, which changes the command from ALTER VALUE to RENAME VALUE,
> for consistency with RENAME ATTRIBUTE.

It looks like we always use "ALTER ... RENAME ... old_name TO
new_name" syntax, so it is better that way.  I have noticed that all
the other RENAMEs go through ExecRenameStmt().  We better do the same.

> +int nbr_index;
> +Form_pg_enum nbr_en;

What is "nbr"?  Why not calling them something like "i" and "val"?

> +   /* Locate the element to rename and check if the new label is already in

The project style for multi-line commands is to leave the first line
of the comment block empty.

> +if (strcmp(NameStr(nbr_en->enumlabel), oldVal) == 0)

I found namestrcmp() for this.

> +}
> +if (!old_tup)

I would leave a space in between.

> +ReleaseCatCacheList(list);
> +heap_close(pg_enum, RowExclusiveLock);

Maybe we better release them before reporting error, too.  I would
release the list after the loop, close the heap before ereport().


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] synchronous_commit = remote_flush

2016-08-21 Thread Christoph Berg
Re: Thomas Munro 2016-08-21 

Re: [HACKERS] synchronous_commit = remote_flush

2016-08-21 Thread Thomas Munro
On Fri, Aug 19, 2016 at 7:32 PM, Masahiko Sawada  wrote:
> On Fri, Aug 19, 2016 at 5:25 AM, Robert Haas  wrote:
>> On Thu, Aug 18, 2016 at 12:22 AM, Thomas Munro
>>  wrote:
>>> To do something about the confusion I keep seeing about what exactly
>>> "on" means, I've often wished we had "remote_flush".  But it's not
>>> obvious how the backwards compatibility could work, ie how to keep the
>>> people happy who use "local" vs "on" to control syncrep, and also the
>>> people who use "off" vs "on" to control asynchronous commit on
>>> single-node systems.  Is there any sensible way to do that, or is it
>>> not broken and I should pipe down, or is it just far too entrenched
>>> and never going to change?
>>
>> I don't see why we can't add "remote_flush" as a synonym for "on".  Do
>> you have something else in mind?
>>
>
> +1 for adding "remote_flush" as a synonym for "on".
> It doesn't break backward compatibility.

Right, we could just add it to guc.c after "on", so that you can "SET
synchronous_commit TO remote_flush", but then "SHOW
synchronous_commit" returns "on".

The problem I was thinking about was this: if you add "remote_flush"
before "on" in guc.c, then "SHOW ..." will return "remote_flush",
which would be really helpful for users trying to understand what
syncrep is actually doing; but it would probably confuse single node
users and async replication users.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SP-GiST support for inet datatypes

2016-08-21 Thread Emre Hasegeli
> ... this part of the patch breaks the existing API for SP-GiST opclasses.
> That is a hard sell.  It may only matter for one existing opclass in core,
> but unless we have reason to think nobody is using any custom SP-GiST
> opclasses, that is not a pleasant thing to do.  How important is it really
> for this opclass?  Several of the existing opclasses use fixed numbers of
> child nodes, so why does this need something they don't?

This addition is required to implement the operator class cleanly.  We
could store the id of the child node as a "label", and add nodes when
labels are missing, but this complicates all operator class functions.
Other designs are also possible using the labels, but a simple design
with fixed number of nodes worked best for me.

Currently, SP-GiST framework supports fixed number of child nodes, if
the index is growing by page splits, not by prefix splits.  This is
not a fair API.  I assume it is designed by only having the prefix
tree in mind.  The change makes SP-GiST more reusable.

SP-GiST is not widely adopted in my experience.  Breaking this part of
the API would affect prefix tree implementations.  I don't think there
are much of them.  Supporting the new interface is easy for them.  We
can try to support the old interface by side of the new one, but this
wouldn't be very clean either.

I thought we were breaking the C interface in similar ways on major releases.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers