Re: How is the PostgreSQL debuginfo file generated

2017-11-23 Thread 高增琦
A little trick:

1. configure with —enable-debug, run “make install”, use this build result
as debuginfo

2. then run “make install-strip”, use this result as release

However the it is not the regular debuginfo, you still can call gdb with
it. Additionally, you can dump the real debuginfo from it later.

Craig Ringer 于2017年11月24日 周五10:04写道:

> On 23 November 2017 at 18:38, Rui Hai Jiang  wrote:
>
>> Hello hackers,
>>
>>
>>
>> I’m wondering how to build the debuginfo package from  the PostgreSQL
>> source.
>>
>>
>>
>> For example to generate something like this one :
>> postgresql91-debuginfo.x86_64.
>>
>>
>>
>> Is there existing support in the current PostgreSQL Makefiles to generate
>> such debuginfo? I have searched in the source code and could find anything.
>>   How were the existing debuginfo packages created?
>>
>>
>>
> When you're building from source, no separate debuginfo is produced. If
> you build with --enable-debug, the debuginfo is embedded in the binaries.
> The resulting binaries are very large but not generally any slower.
>
> Packaging systems have helper programs that invoke 'strip' (or sometimes
> objcopy) to split out the debuginfo into a separate file and remove it from
> the original executable.  See
> https://sourceware.org/gdb/onlinedocs/gdb/Separate-Debug-Files.html  and
> https://fedoraproject.org/wiki/Packaging:Debuginfo?rd=Packaging/Debuginfo
> .
>
> You cannot generally re-create debuginfo to match a given binary package.
> You'd have to rebuild in an absolutely identical environment: exact same
> library versions, compiler version, configure flags, etc etc etc. Otherwise
> you'll get broken/misleading debuginfo.
>
>
> --
>  Craig Ringer   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>
-- 
GaoZengqi
pgf...@gmail.com
zengqi...@gmail.com


Re: [HACKERS] logical decoding of two-phase transactions

2017-11-23 Thread Craig Ringer
On 24 November 2017 at 13:44, Nikhil Sontakke 
wrote:

>
> > This could create an inconsistent view of the catalogs if our prepared
> txn
> > did any DDL. For example, we might've updated a pg_class row, so we
> created
> > a new row and set xmax on the old one. Vacuum may merrily remove our new
> row
> > so there's no way we can find the correct data anymore, we'd have to find
> > the outdated row or no row. By my reading of HeapTupleSatisfiesMVCC we'll
> > see the old pg_class tuple.
> >
> > Similar issues apply for pg_attribute etc etc. We might try to decode a
> > record according to the wrong table structure because relcache lookups
> > performed by the plugin will report outdated information.
> >
>
> We actually do the decoding in a PG_TRY/CATCH block, so if there are
> any errors we
> can clean those up in the CATCH block. If it's a prepared transaction
> then we can send
> an ABORT to the remote side to clean itself up.
>

Yeah. I suspect it might not always ERROR gracefully though.


> > How practical is adding a lock class?
>
> Am open to suggestions. This looks like it could work decently.


It looks amazingly simple from here. Which probably means there's more to
it that I haven't seen yet. I could use advice from someone who knows the
locking subsystem better.


> Yes, that makes sense in case of ROLLBACK. If we find a missing GID
> for a COMMIT PREPARE we are in for some trouble.
>

I agree. But it's really down to the apply worker / plugin to set policy
there, I think. It's not the 2PC decoding support's problem.

I'd argue that a plugin that wishes to strictly track and match 2PC aborts
with the subsequent ROLLBACK PREPARED could do so by recording the abort
locally. It need not rely on faked-up 2pc xacts from the output plugin.
Though it might choose to create them on the downstream as its method of
tracking aborts.

In other words, we don't need the logical decoding infrastructure's help
here. It doesn't have to fake up 2PC xacts for us. Output plugins/apply
workers that want to can do it themselves, and those that don't can ignore
rollback prepared for non-matched GIDs instead.

> We'd need a race-free way to do that though, so I think we'd have to
> extend
> > FinishPreparedTransaction and LockGXact with some kind of missing_ok
> flag. I
> > doubt that'd be controversial.
> >
>
> Sure.


I reckon that should be in-scope for this patch, and pretty clearly useful.
Also simple.


>
> > - It's really important that the hook that decides whether to decode an
> xact
> > at prepare or commit prepared time reports the same answer each and every
> > time, including if it's called after a prepared txn has committed. It
> > probably can't look at anything more than the xact's origin replica
> > identity, xid, and gid. This also means we need to know the gid of
> prepared
> > txns when processing their commit record, so we can tell logical decoding
> > whether we already sent the data to the client at prepare-transaction
> time,
> > or if we should send it at commit-prepared time instead.
> >
>
> We already have a filter_prepare_cb hook in place for this. TBH, I
> don't think this patch needs to worry about the internals of that
> hook. Whatever it returns, if it returns the same value everytime then
> we should be good from the logical decoding perspective.
>

I agree. I meant that it  should try to pass only info that's accessible at
both PREPARE TRANSACTION and COMMIT PREPARED time, and we should document
the importance of returning a consistent result. In particular, it's always
wrong to examine the current twophase state when deciding what to return.


> I think, if we encode the logic in the GID itself, then it will be
> good and consistent everytime. For example, if the hook sees a GID
> with the prefix '_$Logical_', then it knows it has to PREPARE it.
> Others can be decoded at commit time.
>

Yep. We can also safely tell the hook:

* the xid
* whether the xact has made catalog changes (since we know that at prepare
and commit time)

but probably not much else.


> > - You need to flush the syscache when you finish decoding a PREPARE
> > TRANSACTION of an xact that made catalog changes, unless it's immediately
> > followed by COMMIT PREPARED of the same xid. Because xacts between the
> two
> > cannot see changes the prepared xact made to the catalogs.
> >
> > - For the same reason we need to ensure that the historic snapshot used
> to
> > decode a 2PC xact that made catalog changes isn't then used for
> subsequent
> > xacts between the prepare and commit. They'd see the uncommitted
> catalogs of
> > the prepared xact.
> >
>
> Yes, we will do TeardownHistoricSnapshot and syscache flush as part of
> the cleanup for 2PC transactions.
>

Great.

Thanks.



-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] logical decoding of two-phase transactions

2017-11-23 Thread Nikhil Sontakke
Hi Craig,

> I didn't find any sort of sensible situation where locking would pose
> issues. Unless you're taking explicit LOCKs on catalog tables, you should be
> fine.
>
> There may be issues with CLUSTER or VACUUM FULL of non-relmapped catalog
> relations I guess. Personally I put that in the "don't do that" box, but if
> we really have to guard against it we could slightly expand the limits on
> which txns you can PREPARE to any txn that has a strong lock on a catalog
> relation.
>

Well, we don't allow VACUUM FULL of regular tables in transaction blocks.

I tried "CLUSTER user_table USING pkey", it works and also it does not take
strong locks on catalog tables which would halt the decoding process.
ALTER TABLE
works without stalling decoding already as mentioned earlier.

>>
>> The issue is with a concurrent rollback of the prepared transaction.
>> We need a way to ensure that
>> the 2PC does not abort when we are in the midst of a change record
>> apply activity.
>
>
> The *reason* we care about this is that tuples created by aborted txns are
> not considered "recently dead" by vacuum. They can be marked invalid and
> removed immediately due to hint-bit setting and HOT pruning, vacuum runs,
> etc.
>
> This could create an inconsistent view of the catalogs if our prepared txn
> did any DDL. For example, we might've updated a pg_class row, so we created
> a new row and set xmax on the old one. Vacuum may merrily remove our new row
> so there's no way we can find the correct data anymore, we'd have to find
> the outdated row or no row. By my reading of HeapTupleSatisfiesMVCC we'll
> see the old pg_class tuple.
>
> Similar issues apply for pg_attribute etc etc. We might try to decode a
> record according to the wrong table structure because relcache lookups
> performed by the plugin will report outdated information.
>

We actually do the decoding in a PG_TRY/CATCH block, so if there are
any errors we
can clean those up in the CATCH block. If it's a prepared transaction
then we can send
an ABORT to the remote side to clean itself up.


> The sanest option here seems to be to stop the txn from aborting while we're
> decoding it, hence Nikhil's suggestions.
>

If we do the cleanup above in the CATCH block, then do we really care?
I guess the issue would be in determining why we reached the CATCH
block, whether it was due to a decoding error or due to network issues
or something else..

>>
>> * We introduce two new booleans in the TwoPhaseState
>> GlobalTransactionData structure.
>>   bool beingdecoded;
>>   bool abortpending;
>
>
> I think it's premature to rule out the simple option of doing a LockGXact
> when we start decoding. Improve the error "prepared transaction with
> identifier \"%s\" is busy" to report the locking pid too. It means you
> cannot rollback or commit a prepared xact while it's being decoded, but for
> the intended use of this patch, I think that's absolutely fine anyway.
>
> But I like your suggestion much more than the idea of taking a LWLock on
> TwoPhaseStateLock while decoding a record. Lots of LWLock churn, and LWLocks
> held over arbitrary user plugin code. Not viable.
>
> With your way we just have to take a LWLock once on TwoPhaseState when we
> test abortpending and set beingdecoded. After that, during decoding, we can
> do unlocked tests of abortpending, since a stale read will do nothing worse
> than delay our response a little. The existing 2PC ops already take the
> LWLock and can examine beingdecoded then. I expect they'd need to WaitLatch
> in a loop until beingdecoded was cleared, re-acquiring the LWLock and
> re-checking each time it's woken. We should probably add a field there for a
> waiter proc that wants its latch set, so 2pc ops don't usually have to poll
> for decoding to finish. (Unless condition variables will help us here?)
>

Yes, WaitLatch could do the job here.

> However, let me make an entirely alternative suggestion. Should we add a
> heavyweight lock class for 2PC xacts instead, and leverage the existing
> infrastructure? We already use transaction locks widely after all. That way,
> we just take some kind of share lock on the 2PC xact by xid when we start
> logical decoding of the 2pc xact. ROLLBACK PREPARED and COMMIT PREPARED
> would acquire the same heavyweight lock in an exclusive mode before grabbing
> TwoPhaseStateLock and doing their work.
>
> That way we get automatic cleanup when decoding procs exit, we get wakeups
> for waiters, etc, all for "free".
>
> How practical is adding a lock class?

Am open to suggestions. This looks like it could work decently.

>
> Just to be explicit, this means "tell the downstream the xact has aborted".
> Currently logical decoding does not ever start decoding an xact until it's
> committed, so it has never needed an abort callback on the output plugin
> interface.
>
> But we'll need one when we start doing speculative logical decoding of big
> txns before they commit, and we'll need it for this. It's relative

Re: Failed to delete old ReorderBuffer spilled files

2017-11-23 Thread Craig Ringer
On 24 November 2017 at 12:38, atorikoshi  wrote:

>
>
> On 2017/11/24 10:57, Craig Ringer wrote:
> > On 24 November 2017 at 09:20, atorikoshi  co.jp
> >> wrote:
> >
> >>
> >> Thanks for letting me know.
> >> I think I only tested running "make check" at top directory, sorry
> >> for my insufficient test.
> >>
> >> The test failure happened at the beginning of replication(creating
> >> slot), so there are no changes yet and getting the tail of changes
> >> failed.
> >>
> >> Because the bug we are fixing only happens after creating files,
> >> I've added "txn->serialized" to the if statement condition.
> >
> >
> > Thanks.
> >
> > Re-reading the patch I see
> >
> >  * The final_lsn of which transaction that hasn't produced an abort
> >  * record is invalid.
> >
> > which I find very hard to parse. I suggest:
> >
> >  We set final_lsn when we decode the commit or abort record for a
> > transaction,
> >  but transactions implicitly aborted by a crash never write such a
> record.
> >
> > then continue from there with the same text as in the patch.
> >
> > Otherwise I'm happy. It passes make check, test decoding and the recovery
> > TAP tests too.
> >
>
> Thanks for your kind suggestion and running test.
> I've added your comment at the beginning.
>
>
Marked ready for committer.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] UPDATE of partition key

2017-11-23 Thread Amit Langote
On 2017/11/23 21:57, Amit Khandekar wrote:
> If we collect the partition keys in expand_partitioned_rtentry(), we
> need to pass the root relation also, so that we can convert the
> partition key attributes to root rel descriptor. And the other thing
> is, may be, we can check beforehand (in expand_inherited_rtentry)
> whether the rootrte's updatedCols is empty, which I think implies that
> it's not an UPDATE operation. If yes, we can just skip collecting the
> partition keys.

Yeah, it seems like a good idea after all to check in
expand_inherited_rtentry() whether the root RTE's updatedCols is non-empty
and if so check if any of the updatedCols are partition keys.  If we find
some, then it will suffice to just set a simple flag in the
PartitionedChildRelInfo that will be created for the root table.  That
should be done *after* we have visited all the tables in the partition
tree including some that might be partitioned and hence will provide their
partition keys.  The following block in expand_inherited_rtentry() looks
like a good spot:

if (rte->inh && partitioned_child_rels != NIL)
{
PartitionedChildRelInfo *pcinfo;

pcinfo = makeNode(PartitionedChildRelInfo);

Thanks,
Amit




Re: [HACKERS] path toward faster partition pruning

2017-11-23 Thread Amit Langote
Thanks Jesper.

On 2017/11/23 3:56, Jesper Pedersen wrote:
> Hi Amit,
> 
> On 11/22/2017 03:59 AM, Amit Langote wrote:
>> Fixed in the attached.  No other changes beside that.
>>
> 
> I have been using the following script to look at the patch
> 
> -- test.sql --

[ ... ]

> EXPLAIN (ANALYZE) SELECT t1.a, t1.b, t2.c, t2.d FROM t1 INNER JOIN t2 ON
> t2.c = t1.b WHERE t2.d = 1;
> 
> I just wanted to highlight that the "JOIN ON" partition isn't pruned - the
> "WHERE" one is.

Did you mean to write ON t2.d = t1.b?  If so, equivalence class mechanism
will give rise to a t1.b = 1 and hence help prune t1's partition as well:

EXPLAIN (COSTS OFF)
SELECT t1.a, t1.b, t2.c, t2.d
FROM t1 INNER JOIN t2 ON t2.d = t1.b
WHERE t2.d = 1;
QUERY PLAN
---
 Nested Loop
   ->  Append
 ->  Bitmap Heap Scan on t1_p00
   Recheck Cond: (b = 1)
   ->  Bitmap Index Scan on idx_t1_b_a_p00
 Index Cond: (b = 1)
   ->  Materialize
 ->  Append
   ->  Bitmap Heap Scan on t2_p00
 Recheck Cond: (d = 1)
 ->  Bitmap Index Scan on idx_t2_d_p00
   Index Cond: (d = 1)

In your original query, you use ON t2.c = t1.b, whereby there is no
"constant" value to perform partition pruning with.  t2.c is unknown until
the join actually executes.

> BEGIN;
> EXPLAIN (ANALYZE) UPDATE t1 SET a = 1 WHERE b = 1;
> ROLLBACK;
>
> BEGIN;
> EXPLAIN (ANALYZE) DELETE FROM t1 WHERE b = 1;
> ROLLBACK;
> 
> Should pruning of partitions for UPDATEs (where the partition key isn't
> updated) and DELETEs be added to the TODO list?

Note that partition pruning *does* work for UPDATE and DELETE, but only if
you use list/range partitioning.  The reason it doesn't work in this case
(t1 is hash partitioned) is that the pruning is still based on constraint
exclusion in the UPDATE/DELETE case and constraint exclusion cannot handle
hash partitioning.

See example below that uses list partitioning:

drop table t1, t2;
create table t1 (a int, b int) partition by list (b);
create table t1_p0 partition of t1 for values in (0);
create table t1_p1 partition of t1 for values in (1);
create table t1_p2 partition of t1 for values in (2);
create table t1_p3 partition of t1 for values in (3);

create table t2 (c int, d int) partition by list (d);
create table t2_p0 partition of t2 for values in (0);
create table t2_p1 partition of t2 for values in (1);
create table t2_p2 partition of t2 for values in (2);
create table t2_p3 partition of t2 for values in (3);

explain (costs off) update t1 set a = 1 where b = 1;
   QUERY PLAN
=
 Update on t1
   Update on t1_p1
   ->  Seq Scan on t1_p1
 Filter: (b = 1)
(4 rows)

explain (costs off) delete from t1 where b = 1;
   QUERY PLAN
=
 Delete on t1
   Delete on t1_p1
   ->  Seq Scan on t1_p1
 Filter: (b = 1)
(4 rows)


I can see how that seems a bit odd.  If you use hash partitioning,
UPDATE/DELETE do not benefit from partition-pruning, even though SELECT
does.  That's because SELECT uses the new partition-pruning method (this
patch set) which supports hash partitioning, whereas UPDATE and DELETE use
constraint exclusion which doesn't.  It would be a good idea to make even
UPDATE and DELETE use the new method thus bringing everyone on the same
page, but that requires us to make some pretty non-trivial changes to how
UPDATE/DELETE planning works for inheritance/partitioned tables, which we
should undertake separately, imho.

Thanks,
Amit




Re: [HACKERS] Partition-wise aggregation/grouping

2017-11-23 Thread Ashutosh Bapat
On Thu, Nov 23, 2017 at 6:38 PM, Jeevan Chalke
 wrote:
>
>
> Agree. However, there was ab existing comment in create_append_path() saying
> "We don't bother with inventing a cost_append(), but just do it here", which
> implies at sometime in future we may need it; so why not now where we are
> explicitly costing for an append node. Having a function is good so that, if
> required in future, we need update in only this function.
> Let me know if you think otherwise, I make those changes in next patchset.
>

I don't read that comment as something we will do it in future. I
don't think the amount of changes that this patch introduces just for
adding one more line of code aren't justified. There's anyway only one
place where we are costing append, so it's not that the new function
avoids code duplication. Although I am happy to defer this to the
committer, if you think that we need a separate function.

>
> As suggested by Robert, I have renamed it to APPEND_CPU_COST_MULTIPLIER in
> v7 patchset.
> Also, retained the #define for just multiplier as suggested by Robert.

Ok.


>
>>
>>
>> Anyway, it doesn't look like a good idea to pass an argument (gd) only to
>> return from that function in case of its presence. May be we should handle
>> it
>> outside this function.
>
>
> Well, I would like to have it inside the function itself. Let the function
> itself do all the necessary checking rather than doing some of them outside.

We will leave this to the committer. I don't like that style, but it's
also good to expect a function to do all related work.


>>
>> +if (!create_child_grouping_paths(root, input_child_rel,
>> agg_costs, gd,
>> + &extra))
>> +{
>> +/* Could not create path for childrel, return */
>> +pfree(appinfos);
>> +return;
>> +}
>>
>> Can we detect this condition and bail out even before planning any of the
>> children? It looks wasteful to try to plan children only to bail out in
>> this
>> case.
>
>
> I don't think so. It is like non-reachable and added just for a safety in
> case we can't able to create a child path. The bail out conditions cannot be
> evaluated at the beginning. Do you this an Assert() will be good here? Am I
> missing something?

An Assert would help. If it's something that should not happen, we
should try catching that rather that silently ignoring it.

>
>>
>> +/* Nothing to do if we have no live children */
>> +if (live_children == NIL)
>> +return;
>>
>> A parent relation with all dummy children will also be dummy. May be we
>> should
>> mark the parent dummy case using mark_dummy_rel() similar to
>> generate_partition_wise_join_paths().
>
>
> If parent is dummy, then we are not at all doing PWA. So no need to mark
> parent grouped_rel as dummy I guess.
> However, if some of the children are dummy, I am marking corresponding upper
> rel as dummy too.
> Actually, this condition will never going to be true as you said correctly
> that "A parent relation with all dummy children will also be dummy". Should
> we have an Assert() instead?

Yes.


>
>
> I have testcase for multi-level partitioned table.
> However, I did not understand by what you mean by "children with different
> order of partition key columns". I had a look over tests in
> partition_join.sql and it seems that I have cover all those scenarios.
> Please have a look over testcases added for PWA and let me know the
> scenarios missing, I will add them then.

By children with different order of partition key columns, I meant
something like this

parent(a int, b int, c int) partition by (a), child1(b int, c int, a
int) partition by b, child1_1 (c int, a int, b int);

where the attribute numbers of the partition keys in different
children are different.

>> This looks like a useful piece of general functionality
>> list_has_intersection(), which would returns boolean instead of the whole
>> intersection. I am not sure whether we should add that function to list.c
>> and
>> use here.
>
>
> Sounds good.
> But for now, I am keeping it as part of this feature itself.

ok

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] HASH_CHUNK_SIZE vs malloc rounding

2017-11-23 Thread Thomas Munro
On Fri, Nov 24, 2017 at 4:54 PM, Tom Lane  wrote:
>> It doesn't seem that crazy to expose aset.c's overhead size to client
>> code does it?  Most client code wouldn't care, but things that are
>> doing something closer to memory allocator work themselves like
>> dense_alloc could care.  It could deal with its own overhead itself,
>> and subtract aset.c's overhead using a macro.
>
> Seeing that we now have several allocators with different overheads,
> I think that exposing this directly to clients is exactly what not to do.
> I still like the idea I sketched above of a context-type-specific function
> to adjust a request size to something that's efficient.
>
> But there's still the question of how do we know what's an efficient-sized
> malloc request.  Is there good reason to suppose that powers of 2 are OK?

It looks like current glibc versions don't do that sort of thing until
you reach M_MMAP_THRESHOLD (default 128kB) and then starts working in
4kb pages.  Before that its manual says that it doesn't do any
rounding beyond alignment[1].  I haven't come across any other
allocator that is so forgiving, but I haven't checked Illumos, AIX or
Windows.  Only the first of those has source code available, but they
confuse you by shipping a bunch of different allocators in their tree.
I didn't check the other BSDs.  tcmalloc[2], jemalloc, macOS malloc
round up to (respectively) 4k page, 8kb size class, 512 byte size
class when you allocate 32kb + 1 byte.

[1] https://www.gnu.org/software/libc/manual/html_node/The-GNU-Allocator.html
[2] http://goog-perftools.sourceforge.net/doc/tcmalloc.html

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Failed to delete old ReorderBuffer spilled files

2017-11-23 Thread atorikoshi



On 2017/11/24 10:57, Craig Ringer wrote:
> On 24 November 2017 at 09:20, atorikoshi 

>> wrote:
>
>>
>> Thanks for letting me know.
>> I think I only tested running "make check" at top directory, sorry
>> for my insufficient test.
>>
>> The test failure happened at the beginning of replication(creating
>> slot), so there are no changes yet and getting the tail of changes
>> failed.
>>
>> Because the bug we are fixing only happens after creating files,
>> I've added "txn->serialized" to the if statement condition.
>
>
> Thanks.
>
> Re-reading the patch I see
>
>  * The final_lsn of which transaction that hasn't produced an abort
>  * record is invalid.
>
> which I find very hard to parse. I suggest:
>
>  We set final_lsn when we decode the commit or abort record for a
> transaction,
>  but transactions implicitly aborted by a crash never write such a 
record.

>
> then continue from there with the same text as in the patch.
>
> Otherwise I'm happy. It passes make check, test decoding and the recovery
> TAP tests too.
>

Thanks for your kind suggestion and running test.
I've added your comment at the beginning.

--
Atsushi Torikoshi
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

diff --git a/src/backend/replication/logical/reorderbuffer.c 
b/src/backend/replication/logical/reorderbuffer.c
index 0f607ba..ee28d26 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1724,6 +1724,22 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId 
oldestRunningXid)
 
if (TransactionIdPrecedes(txn->xid, oldestRunningXid))
{
+   /*
+* We set final_lsn when we decode the commit or abort 
record for
+* a transaction, but transactions implicitly aborted 
by a crash
+* never write such a record.
+* The final_lsn of which transaction that hasn't 
produced an abort
+* record is invalid. To ensure cleanup the serialized 
changes of
+* such transaction we set the LSN of the last change 
action to
+* final_lsn.
+*/
+   if (txn->serialized && txn->final_lsn == 0)
+   {
+   ReorderBufferChange *last_change =
+   dlist_tail_element(ReorderBufferChange, node, 
&txn->changes);
+   txn->final_lsn = last_change->lsn;
+   }
+
elog(DEBUG2, "aborting old transaction %u", txn->xid);
 
/* remove potential on-disk data, and deallocate this 
tx */
diff --git a/src/include/replication/reorderbuffer.h 
b/src/include/replication/reorderbuffer.h
index 86effe1..7931757 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -168,6 +168,8 @@ typedef struct ReorderBufferTXN
 * * plain abort record
 * * prepared transaction abort
 * * error during decoding
+* Note that this can also be a LSN of the last change action of this 
xact
+* if it is an implicitly aborted transaction.
 * 
 */
XLogRecPtr  final_lsn;


Re: [HACKERS] SERIALIZABLE with parallel query

2017-11-23 Thread Haribabu Kommi
On Tue, Sep 26, 2017 at 4:41 PM, Haribabu Kommi 
wrote:

>
>
> On Mon, Sep 25, 2017 at 6:57 PM, Thomas Munro <
> thomas.mu...@enterprisedb.com> wrote:
>
>> On Mon, Sep 25, 2017 at 8:37 PM, Haribabu Kommi
>>  wrote:
>> > After I tune the GUC to go with sequence scan, still I am not getting
>> the
>> > error
>> > in the session-2 for update operation like it used to generate an error
>> for
>> > parallel
>> > sequential scan, and also it even takes some many commands until unless
>> the
>> > S1
>> > commits.
>>
>> Hmm.  Then this requires more explanation because I don't expect a
>> difference.  I did some digging and realised that the error detail
>> message "Reason code: Canceled on identification as a pivot, during
>> write." was reached in a code path that requires
>> SxactIsPrepared(writer) and also MySerializableXact == writer, which
>> means that the process believes it is committing.  Clearly something
>> is wrong.  After some more digging I realised that
>> ParallelWorkerMain() calls EndParallelWorkerTransaction() which calls
>> CommitTransaction() which calls
>> PreCommit_CheckForSerializationFailure().  Since the worker is
>> connected to the leader's SERIALIZABLEXACT, that finishes up being
>> marked as preparing to commit (not true!), and then the leader get
>> confused during that write, causing a serialization failure to be
>> raised sooner (though I can't explain why it should be raised then
>> anyway, but that's another topic).  Oops.  I think the fix here is
>> just not to do that in a worker (the worker's CommitTransaction()
>> doesn't really mean what it says).
>>
>> Here's a version with a change that makes that conditional.  This way
>> your test case behaves the same as non-parallel mode.
>>
>
> The patch looks good, and I don't have any comments for the code.
> The test that is going to add by the patch is not generating a true
> parallelism scenario, I feel it is better to change the test that can
> generate a parallel sequence/index/bitmap scan.
>


The latest patch is good. It lacks a test that verifies the serialize
support with actual parallel workers, so in case if it broken, it is
difficult
to know.

Regards,
Hari Babu
Fujitsu Australia


Re: [HACKERS] HASH_CHUNK_SIZE vs malloc rounding

2017-11-23 Thread Tom Lane
Thomas Munro  writes:
> On Tue, Nov 29, 2016 at 6:27 AM, Tom Lane  wrote:
>> We could imagine providing an mmgr API function along the lines of "adjust
>> this request size to the nearest thing that can be allocated efficiently".
>> That would avoid the need for callers to know about aset.c overhead
>> explicitly.  I'm not sure how it could deal with platform-specific malloc
>> vagaries though :-(

> Someone pointed out to me off-list that jemalloc's next size class
> after 32KB is in fact 40KB by default[1].  So PostgreSQL uses 25% more
> memory for hash joins than it thinks it does on some platforms.  Ouch.

> It doesn't seem that crazy to expose aset.c's overhead size to client
> code does it?  Most client code wouldn't care, but things that are
> doing something closer to memory allocator work themselves like
> dense_alloc could care.  It could deal with its own overhead itself,
> and subtract aset.c's overhead using a macro.

Seeing that we now have several allocators with different overheads,
I think that exposing this directly to clients is exactly what not to do.
I still like the idea I sketched above of a context-type-specific function
to adjust a request size to something that's efficient.

But there's still the question of how do we know what's an efficient-sized
malloc request.  Is there good reason to suppose that powers of 2 are OK?

regards, tom lane



Re: [HACKERS] logical decoding of two-phase transactions

2017-11-23 Thread Craig Ringer
On 23 November 2017 at 20:27, Nikhil Sontakke 
wrote:


>
> Is there any other locking scenario that we need to consider?
> Otherwise, are we all ok on this point being a non-issue for 2PC
> logical decoding?
>

Yeah.

I didn't find any sort of sensible situation where locking would pose
issues. Unless you're taking explicit LOCKs on catalog tables, you should
be fine.

There may be issues with CLUSTER or VACUUM FULL of non-relmapped catalog
relations I guess. Personally I put that in the "don't do that" box, but if
we really have to guard against it we could slightly expand the limits on
which txns you can PREPARE to any txn that has a strong lock on a catalog
relation.


> The issue is with a concurrent rollback of the prepared transaction.
> We need a way to ensure that
> the 2PC does not abort when we are in the midst of a change record
> apply activity.
>

The *reason* we care about this is that tuples created by aborted txns are
not considered "recently dead" by vacuum. They can be marked invalid and
removed immediately due to hint-bit setting and HOT pruning, vacuum runs,
etc.

This could create an inconsistent view of the catalogs if our prepared txn
did any DDL. For example, we might've updated a pg_class row, so we created
a new row and set xmax on the old one. Vacuum may merrily remove our new
row so there's no way we can find the correct data anymore, we'd have to
find the outdated row or no row. By my reading of HeapTupleSatisfiesMVCC
we'll see the old pg_class tuple.

Similar issues apply for pg_attribute etc etc. We might try to decode a
record according to the wrong table structure because relcache lookups
performed by the plugin will report outdated information.

The sanest option here seems to be to stop the txn from aborting while
we're decoding it, hence Nikhil's suggestions.


> * We introduce two new booleans in the TwoPhaseState
> GlobalTransactionData structure.
>   bool beingdecoded;
>   bool abortpending;
>

I think it's premature to rule out the simple option of doing a LockGXact
when we start decoding. Improve the error "prepared transaction with
identifier \"%s\" is busy" to report the locking pid too. It means you
cannot rollback or commit a prepared xact while it's being decoded, but for
the intended use of this patch, I think that's absolutely fine anyway.

But I like your suggestion much more than the idea of taking a LWLock on
TwoPhaseStateLock while decoding a record. Lots of LWLock churn, and
LWLocks held over arbitrary user plugin code. Not viable.

With your way we just have to take a LWLock once on TwoPhaseState when we
test abortpending and set beingdecoded. After that, during decoding, we can
do unlocked tests of abortpending, since a stale read will do nothing worse
than delay our response a little. The existing 2PC ops already take the
LWLock and can examine beingdecoded then. I expect they'd need to WaitLatch
in a loop until beingdecoded was cleared, re-acquiring the LWLock and
re-checking each time it's woken. We should probably add a field there for
a waiter proc that wants its latch set, so 2pc ops don't usually have to
poll for decoding to finish. (Unless condition variables will help us here?)

However, let me make an entirely alternative suggestion. Should we add a
heavyweight lock class for 2PC xacts instead, and leverage the existing
infrastructure? We already use transaction locks widely after all. That
way, we just take some kind of share lock on the 2PC xact by xid when we
start logical decoding of the 2pc xact. ROLLBACK PREPARED and COMMIT
PREPARED would acquire the same heavyweight lock in an exclusive mode
before grabbing TwoPhaseStateLock and doing their work.

That way we get automatic cleanup when decoding procs exit, we get wakeups
for waiters, etc, all for "free".

How practical is adding a lock class?

(Frankly I've often wished I could add new heavyweight lock classes when
working on complex extensions like BDR, too, and in an ideal world we'd be
able to register lock classes for use by extensions...)


3) Before starting decode of the next change record, we re-check if
> "abortpending" is set. If "abortpending"
> is set, we do not decode the next change record. Thus the abort is
> delay-bounded to a maximum of one change record decoding/apply cycle
> after we signal our intent to abort it. Then, we need to send ABORT
> (regular, not rollback prepared, since we have not sent "PREPARE" yet.
>

Just to be explicit, this means "tell the downstream the xact has aborted".
Currently logical decoding does not ever start decoding an xact until it's
committed, so it has never needed an abort callback on the output plugin
interface.

But we'll need one when we start doing speculative logical decoding of big
txns before they commit, and we'll need it for this. It's relatively
trivial.


> This abort_prepared callback will abort the dummy PREPARED query from
>
step (3) above. Instead of doing this, we could actually check if the
> 'GID' entry exists and 

Re: [HACKERS] INSERT ON CONFLICT and partitioned tables

2017-11-23 Thread Amit Langote
On 2017/11/24 11:45, Amit Langote wrote:
> Meanwhile, rebased patch is attached.

Oops, forgot to attach in the last email.  Attached now.

Thanks,
Amit
From ee7ef9d35f810c8c38c0ec40205f7b8c5d1f696d Mon Sep 17 00:00:00 2001
From: amit 
Date: Mon, 3 Apr 2017 19:13:38 +0900
Subject: [PATCH] Allow ON CONFLICT DO NOTHING on partitioned tables

ON CONFLICT .. DO UPDATE still doesn't work, because it requires
specifying the conflict target.  DO NOTHING doesn't require it,
but the executor will check for conflicts within only a given
leaf partitions, if relevant constraints exist.

Specifying the conflict target makes the planner look for the
required indexes on the parent table, which are not allowed, so an
error will always be reported in that case.
---
 doc/src/sgml/ddl.sgml | 12 +---
 src/backend/executor/execPartition.c  | 10 ++
 src/backend/parser/analyze.c  |  8 
 src/test/regress/expected/insert_conflict.out | 10 ++
 src/test/regress/sql/insert_conflict.sql  | 10 ++
 5 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index daba66c187..724cca0f8d 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -3288,9 +3288,15 @@ ALTER TABLE measurement ATTACH PARTITION 
measurement_y2008m02
  
   
Using the ON CONFLICT clause with partitioned tables
-   will cause an error, because unique or exclusion constraints can only be
-   created on individual partitions.  There is no support for enforcing
-   uniqueness (or an exclusion constraint) across an entire partitioning
+   will cause an error if the conflict target is specified (see
+for more details on how the clause
+   works).  That means it's not possible to specify
+   DO UPDATE as the alternative action, because
+   specifying the conflict target is mandatory in that case.  On the other
+   hand, specifying DO NOTHING as the alternative action
+   works fine provided the conflict target is not specified.  In that case,
+   unique constraints (or exclusion constraints) of the individual leaf
+   partitions are considered, not those across the whole partitioning
hierarchy.
   
  
diff --git a/src/backend/executor/execPartition.c 
b/src/backend/executor/execPartition.c
index d275cefe1d..1b3bbfdb94 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -133,13 +133,15 @@ ExecSetupPartitionTupleRouting(Relation rel,
CheckValidResultRel(leaf_part_rri, CMD_INSERT);
 
/*
-* Open partition indices (remember we do not support ON 
CONFLICT in
-* case of partitioned tables, so we do not need support 
information
-* for speculative insertion)
+* Open partition indices.  The user may have asked to check for
+* conflicts within this leaf partition and do "nothing" 
instead of
+* throwing an error.  Be prepared in that case by initializing 
the
+* index information needed by ExecInsert() to perform 
speculative
+* insertions.
 */
if (leaf_part_rri->ri_RelationDesc->rd_rel->relhasindex &&
leaf_part_rri->ri_IndexRelationDescs == NULL)
-   ExecOpenIndices(leaf_part_rri, false);
+   ExecOpenIndices(leaf_part_rri, true);
 
estate->es_leaf_result_relations =
lappend(estate->es_leaf_result_relations, 
leaf_part_rri);
diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c
index 757a4a8fd1..d680d2285c 100644
--- a/src/backend/parser/analyze.c
+++ b/src/backend/parser/analyze.c
@@ -847,16 +847,8 @@ transformInsertStmt(ParseState *pstate, InsertStmt *stmt)
 
/* Process ON CONFLICT, if any. */
if (stmt->onConflictClause)
-   {
-   /* Bail out if target relation is partitioned table */
-   if (pstate->p_target_rangetblentry->relkind == 
RELKIND_PARTITIONED_TABLE)
-   ereport(ERROR,
-   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-errmsg("ON CONFLICT clause is not 
supported with partitioned tables")));
-
qry->onConflict = transformOnConflictClause(pstate,

stmt->onConflictClause);
-   }
 
/*
 * If we have a RETURNING clause, we need to add the target relation to
diff --git a/src/test/regress/expected/insert_conflict.out 
b/src/test/regress/expected/insert_conflict.out
index 8d005fddd4..617b0c0ed2 100644
--- a/src/test/regress/expected/insert_conflict.out
+++ b/src/test/regress/expected/insert_conflict.out
@@ -786,3 +786,13 @@

Re: [HACKERS] INSERT ON CONFLICT and partitioned tables

2017-11-23 Thread Amit Langote
Thank you Simon for the review.

On 2017/11/20 7:33, Simon Riggs wrote:
> On 2 August 2017 at 00:56, Amit Langote  wrote:
> 
>> The patch's job is simple:
> 
> Patch no longer applies and has some strange behaviours.
> 
>> - Remove the check in the parser that causes an error the moment the
>>   ON CONFLICT clause is found.
> 
> Where is the check and test that blocks ON CONFLICT UPDATE?

That check is in transformInsertStmt() and following is the diff from the
patch that removes it:

diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c
index 757a4a8fd1..d680d2285c 100644
--- a/src/backend/parser/analyze.c
+++ b/src/backend/parser/analyze.c
@@ -847,16 +847,8 @@ transformInsertStmt(ParseState *pstate, InsertStmt *stmt)

 /* Process ON CONFLICT, if any. */
 if (stmt->onConflictClause)
-{
-/* Bail out if target relation is partitioned table */
-if (pstate->p_target_rangetblentry->relkind ==
RELKIND_PARTITIONED_TABLE)
-ereport(ERROR,
-(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("ON CONFLICT clause is not supported with
partitioned tables")));
-

ON CONFLICT UPDATE will result in an error because it requires specifying
a conflict_target.  Specifying conflict_target will always result in an
error as things stand today, because planner looks for a constraint/index
covering the same and partitioned tables cannot have one.

> My understanding was the patch was intended to only remove the error
> in case of DO NOTHING.

To be precise, the patch is intended to remove the error for the cases
which don't require specifying conflict_target.  In INSERT's
documentation, conflict_target is described as follows:

"Specifies which conflicts ON CONFLICT takes the alternative action on by
choosing arbiter indexes. Either performs unique index inference, or names
a constraint explicitly. For ON CONFLICT DO NOTHING, it is optional to
specify a conflict_target; when omitted, conflicts with all usable
constraints (and unique indexes) are handled. For ON CONFLICT DO UPDATE, a
conflict_target must be provided."

Note the "For ON CONFLICT DO NOTHING, it is optional to specify a
conflict_target".  That's the case that this patch intends to prevent
error for, that is, the error won't occur if no conflict_target is specified.

>> - Fix leaf partition ResultRelInfo initialization code so that the call
>>   ExecOpenIndices() specifies 'true' for speculative, so that the
>>   information necessary for conflict checking will be initialized in the
>>   leaf partition's ResultRelInfo
> 
> Why? There is no caller that needs information.

It is to be used if and when ExecInsert() calls
ExecCheckIndexConstraints() in the code path to handle ON CONFLICT DO
NOTHING that we're intending to support in some cases.  Note that it will
only check conflicts for the individual leaf partitions using whatever
constraint-enforcing indexes they might have.

Example:

create table foo (a int) partition by list (a);
create table foo123 partition of foo (a unique) for values in (1, 2, 3);
\d foo123
   Table "public.foo123"
 Column |  Type   | Collation | Nullable | Default
=---+-+---+--+-
 a  | integer |   |  |
Partition of: foo FOR VALUES IN (1, 2, 3)
Indexes:
"foo123_a_key" UNIQUE CONSTRAINT, btree (a)


Without the patch, parser itself will throw an error:

insert into foo values (1)
on conflict do nothing returning tableoid::regclass, *;
ERROR:  ON CONFLICT clause is not supported with partitioned tables


After applying the patch:

insert into foo values (1)
on conflict do nothing returning tableoid::regclass, *;
 tableoid | a
=-+---
 foo123   | 1
(1 row)

INSERT 0 1

insert into foo values (1) on conflict do nothing returning
tableoid::regclass, *;
 tableoid | a
=-+---
(0 rows)

INSERT 0 0

That worked because no explicit conflict target was specified.  If it is
specified, planner (plancat.c: infer_arbiter_indexes()) will throw an
error, because it cannot find a constraint on foo (which is a partitioned
table).

insert into foo values (1)
on conflict (a) do nothing returning tableoid::regclass, *;
ERROR:  there is no unique or exclusion constraint matching the ON
CONFLICT specification

We'll hopefully able to support specifying conflict_target after the
Alvaro's work at [1] is finished.

Meanwhile, rebased patch is attached.

Thanks,
Amit

[1] https://commitfest.postgresql.org/15/1365/




Re: How is the PostgreSQL debuginfo file generated

2017-11-23 Thread Craig Ringer
On 23 November 2017 at 18:38, Rui Hai Jiang  wrote:

> Hello hackers,
>
>
>
> I’m wondering how to build the debuginfo package from  the PostgreSQL
> source.
>
>
>
> For example to generate something like this one :
> postgresql91-debuginfo.x86_64.
>
>
>
> Is there existing support in the current PostgreSQL Makefiles to generate
> such debuginfo? I have searched in the source code and could find anything.
>   How were the existing debuginfo packages created?
>
>
>
When you're building from source, no separate debuginfo is produced. If you
build with --enable-debug, the debuginfo is embedded in the binaries. The
resulting binaries are very large but not generally any slower.

Packaging systems have helper programs that invoke 'strip' (or sometimes
objcopy) to split out the debuginfo into a separate file and remove it from
the original executable.  See
https://sourceware.org/gdb/onlinedocs/gdb/Separate-Debug-Files.html  and
https://fedoraproject.org/wiki/Packaging:Debuginfo?rd=Packaging/Debuginfo .

You cannot generally re-create debuginfo to match a given binary package.
You'd have to rebuild in an absolutely identical environment: exact same
library versions, compiler version, configure flags, etc etc etc. Otherwise
you'll get broken/misleading debuginfo.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Failed to delete old ReorderBuffer spilled files

2017-11-23 Thread Craig Ringer
On 24 November 2017 at 09:20, atorikoshi  wrote:

>
> Thanks for letting me know.
> I think I only tested running "make check" at top directory, sorry
> for my insufficient test.
>
> The test failure happened at the beginning of replication(creating
> slot), so there are no changes yet and getting the tail of changes
> failed.
>
> Because the bug we are fixing only happens after creating files,
> I've added "txn->serialized" to the if statement condition.


Thanks.

Re-reading the patch I see

 * The final_lsn of which transaction that hasn't produced an abort
 * record is invalid.

which I find very hard to parse. I suggest:

 We set final_lsn when we decode the commit or abort record for a
transaction,
 but transactions implicitly aborted by a crash never write such a record.

then continue from there with the same text as in the patch.

Otherwise I'm happy. It passes make check, test decoding and the recovery
TAP tests too.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] HASH_CHUNK_SIZE vs malloc rounding

2017-11-23 Thread Thomas Munro
On Tue, Nov 29, 2016 at 6:27 AM, Tom Lane  wrote:
> Thomas Munro  writes:
>> I bet other allocators also do badly with "32KB plus a smidgen".  To
>> minimise overhead we'd probably need to try to arrange for exactly
>> 32KB (or some other power of 2 or at least factor of common page/chunk
>> size?) to arrive into malloc, which means accounting for both
>> nodeHash.c's header and aset.c's headers in nodeHash.c, which seems a
>> bit horrible.  It may not be worth doing anything about.
>
> Yeah, the other problem is that without a lot more knowledge of the
> specific allocator, we shouldn't really assume that it's good or bad with
> an exact-power-of-2 request --- it might well have its own overhead.
> It is an issue though, and not only in nodeHash.c.  I'm pretty sure that
> StringInfo also makes exact-power-of-2 requests for no essential reason,
> and there are probably many other places.
>
> We could imagine providing an mmgr API function along the lines of "adjust
> this request size to the nearest thing that can be allocated efficiently".
> That would avoid the need for callers to know about aset.c overhead
> explicitly.  I'm not sure how it could deal with platform-specific malloc
> vagaries though :-(

Someone pointed out to me off-list that jemalloc's next size class
after 32KB is in fact 40KB by default[1].  So PostgreSQL uses 25% more
memory for hash joins than it thinks it does on some platforms.  Ouch.

It doesn't seem that crazy to expose aset.c's overhead size to client
code does it?  Most client code wouldn't care, but things that are
doing something closer to memory allocator work themselves like
dense_alloc could care.  It could deal with its own overhead itself,
and subtract aset.c's overhead using a macro.

[1] https://www.freebsd.org/cgi/man.cgi?jemalloc(3)

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Failed to delete old ReorderBuffer spilled files

2017-11-23 Thread atorikoshi

On 2017/11/22 16:49, Masahiko Sawada wrote:

On Wed, Nov 22, 2017 at 2:56 PM, Craig Ringer  wrote:

On 22 November 2017 at 12:15, Kyotaro HORIGUCHI
 wrote:


At Wed, 22 Nov 2017 12:57:34 +0900, Michael Paquier
 wrote in


On Wed, Nov 22, 2017 at 11:49 AM, Craig Ringer 
wrote:

On 20 November 2017 at 18:35, atorikoshi
 wrote:

I put many queries into one transaction and made ReorderBuffer spill
data to disk, and sent SIGKILL to postgres before the end of the
transaction.

After starting up postgres again, I observed the files spilled to
data wasn't deleted.


Since this can only happen  on crash exits, and the reorderbuffer data
is
useless after a decoding backend exits, why don't we just recursively
delete
the tree contents on Pg startup?


+1. You would just need an extra step after say
DeleteAllExportedSnapshotFiles() in startup.c. Looks saner to me do so
so as well.


The old files are being removed at startup by
StartupReorderBuffer.



That seems to contradict the statement above, that "after starting up
postgres again, I observed the files spilled to disk weren't deleted".



At the time of assertion failure, the files are not of the
previous run, but they are created after reconnection from the
subscriber.



Are you saying the problem files do not exist when we start up, but are
created and then leaked after logical decoding resumes after an unclean
startup?

... Yes, that does appear to be the case, per the original report:

"7. After a while, we can see the files remaining.
  (Immediately after starting publiser, we can not see these files.)"

I was confused by "remaining". They're not remaining, they've been
re-created.


Sorry for my incorrect description, I should have said re-created.



But if they're re-created, why are they not recreated correctly after an
unclean shutdown? What persistent state is causing that? We should be
clobbering saved reorder buffer temp files, snapshots, etc at startup. The
slot state is pretty simple, it'd just be a bit behind.

The key difference seems to be that we hard-kill the server so it can't
write anything to clog. The xact is implicitly aborted, we never wrote any
xlog record for a commit or abort. The problem is presumably with decoding
xacts that were implicitly aborted by server crash, after we restart the
server and resume decoding.

The assertion failure reported is in ReorderBufferRestoreCleanup, which
makes sense.

Because there's no xlog record of the crash, we never set the buffer's
final_lsn in ReorderBufferCommit or ReorderBufferAbort .


Yeah, that's the same as my analysis.


Note the comment there:

 * NB: Transactions handled here have to have actively aborted (i.e. have
 * produced an abort record). Implicitly aborted transactions are handled
via
 * ReorderBufferAbortOld(); transactions we're just not interested in, but
 * which have committed are handled in ReorderBufferForget().

That's only called from DecodeStandbyOp in response to an xl_running_xacts.

Here's the backtrace.

Core was generated by `postgres: wal sender process postgres [local'.
Program terminated with signal SIGABRT, Aborted.
...
#2  0x008537b7 in ExceptionalCondition
(conditionName=conditionName@entry=0x9fcdf6 "!(txn->final_lsn != 0)",
errorType=errorType@entry=0x89bcb4 "FailedAssertion",
fileName=fileName@entry=0x9fcd04 "reorderbuffer.c",
lineNumber=lineNumber@entry=2576) at assert.c:54
#3  0x006fec02 in ReorderBufferRestoreCleanup
(rb=rb@entry=0x1b4c370, txn=txn@entry=0x1b5c3b8) at reorderbuffer.c:2576
#4  0x00700693 in ReorderBufferCleanupTXN (rb=rb@entry=0x1b4c370,
txn=txn@entry=0x1b5c3b8) at reorderbuffer.c:1145
#5  0x00701516 in ReorderBufferAbortOld (rb=0x1b4c370,
oldestRunningXid=558) at reorderbuffer.c:1730
#6  0x006f5a47 in DecodeStandbyOp (ctx=0x1af9ce0,
buf=buf@entry=0x7ffd11761200) at decode.c:325
#7  0x006f65bf in LogicalDecodingProcessRecord (ctx=,
record=) at decode.c:117
#8  0x007098ab in XLogSendLogical () at walsender.c:2766
#9  0x0070a875 in WalSndLoop (send_data=send_data@entry=0x709857
) at walsender.c:2134
#10 0x0070b011 in StartLogicalReplication (cmd=cmd@entry=0x1a9cd68)
at walsender.c:1101
#11 0x0070b46f in exec_replication_command
(cmd_string=cmd_string@entry=0x1afec30 "START_REPLICATION SLOT \"sub\"
LOGICAL 0/0 (proto_version '1', publication_names '\"pub\"')") at
walsender.c:1527
#12 0x00758809 in PostgresMain (argc=,
argv=argv@entry=0x1aab870, dbname=, username=)
at postgres.c:4086
#13 0x006e178d in BackendRun (port=port@entry=0x1aa3430) at
postmaster.c:4357
#14 0x006e35e9 in BackendStartup (port=port@entry=0x1aa3430) at
postmaster.c:4029
#15 0x006e39e3 in ServerLoop () at postmaster.c:1753
#16 0x006e4b36 in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x1a7c5d0) at postmaster.c:1361
#17 0x0065d093 in main (argc=3, argv=0x1a7c5d0) at main.c:228

So it's clear why we can call ReorderBufferCleanupOld wit

Re: Combine function returning NULL unhandled?

2017-11-23 Thread Andres Freund
On 2017-11-21 15:51:59 -0500, Robert Haas wrote:
> On Mon, Nov 20, 2017 at 10:36 PM, Andres Freund  wrote:
> > The plain transition case contains:
> > if (pergroupstate->transValueIsNull)
> > {
> > /*
> >  * Don't call a strict function with NULL inputs.  
> > Note it is
> >  * possible to get here despite the above tests, if 
> > the transfn is
> >  * strict *and* returned a NULL on a prior cycle. 
> > If that happens
> >  * we will propagate the NULL all the way to the 
> > end.
> >  */
> > return;
> > }
> >
> > how come similar logic is not present for combine functions? I don't see
> > any checks preventing a combinefunc from returning NULL, nor do I see
> > https://www.postgresql.org/docs/devel/static/sql-createaggregate.html
> > spell out a requirement that that not be the case.
> 
> I don't know of a reason why that logic shouldn't be present for the
> combine-function case as well.  It seems like it should be pretty
> straightforward to write a test that hits that case and watch it blow
> up ... assuming it does, then I guess we should back-patch the
> addition of that logic.

Found it surprisingly not that straightforward ;)

Pushed a fix to the relevant branches, including tests of the
trans/combine functions returns NULL cases.

Greetings,

Andres Freund



RE: [HACKERS] Supporting huge pages on Windows

2017-11-23 Thread Tsunakawa, Takayuki
From: Thomas Munro [mailto:thomas.mu...@enterprisedb.com]
> I hope Tsunakawa-san doesn't mind me posting another rebased version of
> his patch.  The last version conflicted with the change from SGML to XML
> that just landed in commit 3c49c6fa.

Thank you very much for keeping it fresh.  I hope the committer could have some 
time.

Regards
Takayuki Tsunakawa



Re: [HACKERS] Supporting huge pages on Windows

2017-11-23 Thread Thomas Munro
On Thu, Sep 14, 2017 at 12:32 AM, Magnus Hagander  wrote:
> It's my plan to get to this patch during this commitfest. I've been
> travelling for open and some 24/7 work so far, but hope to get CFing soon.

I hope Tsunakawa-san doesn't mind me posting another rebased version
of his patch.  The last version conflicted with the change from SGML
to XML that just landed in commit 3c49c6fa.

-- 
Thomas Munro
http://www.enterprisedb.com


win_large_pages_v15.patch
Description: Binary data


Re: documentation is now XML

2017-11-23 Thread Michael Paquier
On Fri, Nov 24, 2017 at 12:21 AM, Oleg Bartunov  wrote:
> On Thu, Nov 23, 2017 at 6:01 PM, Peter Eisentraut
>  wrote:
>> The documentation sources are now DocBook XML, not SGML.  (The files are
>> still named *.sgml.  That's something to think about separately.)
>
> Congratulations to you and Alexander !

+1.
-- 
Michael



Re: [HACKERS] Custom compression methods

2017-11-23 Thread Tomas Vondra
Hi,

On 11/23/2017 10:38 AM, Ildus Kurbangaliev wrote:
> On Tue, 21 Nov 2017 18:47:49 +0100
> Tomas Vondra  wrote:
> 
>>>   
>>
>> Hmmm, it still doesn't work for me. See this:
>>
>> test=# create extension pg_lz4 ;
>> CREATE EXTENSION
>> test=# create table t_lz4 (v text compressed lz4);
>> CREATE TABLE
>> test=# create table t_pglz (v text);
>> CREATE TABLE
>> test=# insert into t_lz4 select repeat(md5(1::text),300);
>> INSERT 0 1
>> test=# insert into t_pglz select * from t_lz4;
>> INSERT 0 1
>> test=# drop extension pg_lz4 cascade;
>> NOTICE:  drop cascades to 2 other objects
>> DETAIL:  drop cascades to compression options for lz4
>> drop cascades to table t_lz4 column v
>> DROP EXTENSION
>> test=# \c test
>> You are now connected to database "test" as user "user".
>> test=# insert into t_lz4 select repeat(md5(1::text),300);^C
>> test=# select * from t_pglz ;
>> ERROR:  cache lookup failed for compression options 16419
>>
>> That suggests no recompression happened.
> 
> Should be fixed in the attached patch. I've changed your extension a
> little bit according changes in the new patch (also in attachments).
> 

Hmm, this seems to have fixed it, but only in one direction. Consider this:

create table t_pglz (v text);
create table t_lz4 (v text compressed lz4);

insert into t_pglz select repeat(md5(i::text),300)
from generate_series(1,10) s(i);

insert into t_lz4 select repeat(md5(i::text),300)
from generate_series(1,10) s(i);

\d+

 Schema |  Name  | Type  | Owner | Size  | Description
++---+---+---+-
 public | t_lz4  | table | user  | 12 MB |
 public | t_pglz | table | user  | 18 MB |
(2 rows)

truncate t_pglz;
insert into t_pglz select * from t_lz4;

\d+

 Schema |  Name  | Type  | Owner | Size  | Description
++---+---+---+-
 public | t_lz4  | table | user  | 12 MB |
 public | t_pglz | table | user  | 18 MB |
(2 rows)

which is fine. But in the other direction, this happens

truncate t_lz4;
insert into t_lz4 select * from t_pglz;

 \d+
   List of relations
 Schema |  Name  | Type  | Owner | Size  | Description
++---+---+---+-
 public | t_lz4  | table | user  | 18 MB |
 public | t_pglz | table | user  | 18 MB |
(2 rows)

which means the data is still pglz-compressed. That's rather strange, I
guess, and it should compress the data using the compression method set
for the target table instead.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: documentation is now XML

2017-11-23 Thread Tom Lane
Peter Eisentraut  writes:
> The documentation sources are now DocBook XML, not SGML.  (The files are
> still named *.sgml.  That's something to think about separately.)

I think we should have a discussion about whether it'd be smart
to convert the back branches' documentation to XML as well.

The main reason that I want to consider this is that back-patching
documentation fixes is going to be a huge problem if we don't.

Using the same doc-building toolchain across all branches seems like a win
as well.  You could argue that switching build requirements in a minor
release is unfriendly to packagers; but those who build their own docs
have already had to adapt to the xsltproc/fop toolchain for v10, so
standardizing on that for 9.3-9.6 as well doesn't seem like it complicates
their lives.  (Possibly we should canvass opinions on pgsql-packagers to
be sure of that.)

Also, we're way overdue for getting out from under the creaky TeX-based
toolchain for producing PDFs.  Every time we make releases, I worry
whether we're going to get blindsided by its bugs with hotlinks that get
split across pages, since page breaks tend to vary in position depending
on exactly whose version of the toolchain and style sheets you build with.
I've also been living in fear of the day we hit some hardwired TeX limit
that we can't increase or work around.  We've had to hack around such
limits repeatedly in the past (eg commit 944b41fc0).  Now, it's probably
unlikely that growth of the release notes would be enough to put us over
the top in any back branch --- but if it did happen, we'd find out about
it at a point in the release cycle where there's very little margin for
error.

As against all that, there's our traditional conservatism about making
not-strictly-necessary changes in the back branches.  Which is a strong
factor, for sure, but I think maybe here's a case for overriding it.

regards, tom lane



Re: How is the PostgreSQL debuginfo file generated

2017-11-23 Thread Tom Lane
Rui Hai Jiang  writes:
> I’m wondering how to build the debuginfo package from  the PostgreSQL source.
> For example to generate something like this one :   
> postgresql91-debuginfo.x86_64.

On platforms where such things exist, that's handled by the packaging
system, not by PG proper.  You should proceed by making a new SRPM
and building RPMs from that.

regards, tom lane



Re: documentation is now XML

2017-11-23 Thread Oleg Bartunov
On Thu, Nov 23, 2017 at 6:01 PM, Peter Eisentraut
 wrote:
> The documentation sources are now DocBook XML, not SGML.  (The files are
> still named *.sgml.  That's something to think about separately.)

Congratulations to you and Alexander ! That is what I waited for a long time.
Now we could think about diagrams :)

>
> Besides the recent change to require full end tags, which is compatible
> between SGML and XML, being XML now requires empty-element tags to use
> the XML syntax like .  When backpatching
> documentation changes, the extra slash needs to be removed, however.
> (The documentation build in backbranches will error if that is not done.)
>
> For the required tools, there are no new requirements, but you can drop
> docbook-sgml and opensp.  The instructions at
> 
> will be up-to-date once built.
>
> --
> Peter Eisentraut  http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>



Re: has_sequence_privilege() never got the memo

2017-11-23 Thread Peter Eisentraut
On 11/22/17 22:58, Tom Lane wrote:
> Joe Conway  writes:
>> I just noticed that has_sequence_privilege() never got the memo about
>> "WITH GRANT OPTION". Any objections to the attached going back to all
>> supported versions?
> 
> That looks odd.  Patch certainly makes this case consistent with the
> rest of acl.c, but maybe there's some deeper reason?  Peter?

No I think it was just forgotten.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] SQL procedures

2017-11-23 Thread Peter Eisentraut
On 11/23/17 00:59, Craig Ringer wrote:
> Exactly. If we want to handle OUT params this way they really need to be
> the first resultset for just this reason. That could possibly be done by
> the glue code reserving a spot in the resultset list and filling it in
> at the end of the procedure.
> 
> I fail to understand how that can work though. Wouldn't we have to
> buffer all the resultset contents on the server in tuplestores or
> similar, so we can send the parameters and then the result sets?

The current PoC in the other thread puts the extra result sets in
cursors.  That's essentially the buffer you are referring to.  So it
seems possible, but there are some details to be worked out.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



documentation is now XML

2017-11-23 Thread Peter Eisentraut
The documentation sources are now DocBook XML, not SGML.  (The files are
still named *.sgml.  That's something to think about separately.)

Besides the recent change to require full end tags, which is compatible
between SGML and XML, being XML now requires empty-element tags to use
the XML syntax like .  When backpatching
documentation changes, the extra slash needs to be removed, however.
(The documentation build in backbranches will error if that is not done.)

For the required tools, there are no new requirements, but you can drop
docbook-sgml and opensp.  The instructions at

will be up-to-date once built.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Query became very slow after 9.6 -> 10 upgrade

2017-11-23 Thread Dmitry Shalashov
We tried to apply the patch on 10.1 source, but something is wrong it seems:

patch -p1 < ../1.patch
(Stripping trailing CRs from patch; use --binary to disable.)
patching file src/backend/optimizer/plan/analyzejoins.c
(Stripping trailing CRs from patch; use --binary to disable.)
patching file src/backend/utils/adt/selfuncs.c
Hunk #1 succeeded at 3270 (offset -91 lines).
Hunk #2 succeeded at 3304 (offset -91 lines).
Hunk #3 succeeded at 3313 (offset -91 lines).
Hunk #4 succeeded at 3393 (offset -91 lines).
patch unexpectedly ends in middle of line
Hunk #5 succeeded at 3570 with fuzz 1 (offset -91 lines).


Dmitry Shalashov, relap.io & surfingbird.ru

2017-11-23 2:07 GMT+03:00 Tom Lane :

> Dmitry Shalashov  writes:
> > Turns out we had not 9.6 but 9.5.
>
> I'd managed to reproduce the weird planner behavior locally in the
> regression database:
>
> regression=# create table foo (f1 int[], f2 int);
> CREATE TABLE
> regression=# explain select * from tenk1 where unique2 in (select distinct
> unnest(f1) from foo where f2=1);
> QUERY PLAN
> 
> ---
>  Nested Loop  (cost=30.85..80.50 rows=6 width=244)
>->  HashAggregate  (cost=30.57..30.63 rows=6 width=4)
>  Group Key: (unnest(foo.f1))
>  ->  HashAggregate  (cost=30.42..30.49 rows=6 width=4)
>Group Key: unnest(foo.f1)
>->  ProjectSet  (cost=0.00..28.92 rows=600 width=4)
>  ->  Seq Scan on foo  (cost=0.00..25.88 rows=6
> width=32)
>Filter: (f2 = 1)
>->  Index Scan using tenk1_unique2 on tenk1  (cost=0.29..8.30 rows=1
> width=244)
>  Index Cond: (unique2 = (unnest(foo.f1)))
> (10 rows)
>
> Digging into it, the reason for the duplicate HashAggregate step was that
> query_supports_distinctness() punted on SRFs-in-the-targetlist, basically
> on the argument that it wasn't worth extra work to handle that case.
> Thinking a bit harder, it seems to me that the correct analysis is:
> 1. If we are proving distinctness on the grounds of a DISTINCT clause,
> then it doesn't matter whether there are any SRFs, because DISTINCT
> removes duplicates after tlist SRF expansion.
> 2. But tlist SRFs break the ability to prove distinctness on the grounds
> of GROUP BY, unless all of them are within grouping columns.
> It still seems like detecting the second case is harder than it's worth,
> but we can trivially handle the first case, with little more than some
> code rearrangement.
>
> The other problem is that the output rowcount of the sub-select (ie, of
> the HashAggregate) is being estimated as though the SRF weren't there.
> This turns out to be because estimate_num_groups() doesn't consider the
> possibility of SRFs in the grouping columns.  It never has, but in 9.6 and
> before the problem was masked by the fact that grouping_planner scaled up
> the result rowcount by tlist_returns_set_rows() *after* performing
> grouping.  Now we're effectively doing that in the other order, which is
> more correct, but that means estimate_num_groups() has to apply some sort
> of adjustment.  I suggest that it just multiply its old estimate by the
> maximum of the SRF expansion counts.  That's likely to be an overestimate,
> but it's really hard to do better without specific knowledge of the
> individual SRF's behavior.
>
> In short, I propose the attached fixes.  I've checked this and it seems
> to fix Dmitry's original problem according to the test case he sent
> off-list.
>
> regards, tom lane
>
>


RE:PostgreSLQ v10.1 and xlC compiler on AIX

2017-11-23 Thread REIX, Tony
Hi Michael,

Thanks for the information ! I'm not a PostgreSQL expert.

I've found the file:  ./32bit/src/test/regress/regression.diffs

Since I'm rebuilding right now, I do not have just now the file for v10.1 with 
the issue.
However, I have it for 10.0 and 9.6.6 which show the same issues. I've attached 
them to this email.

About the suggestion of adding a AIX machine with xlc v13.1 to the buildfarm, 
I'll see how this could be done.
Also, I'll try to get a more recent version of xlC v13 .


Cordialement,

Tony Reix

Bull - ATOS
IBM Coop Architect & Technical Leader
Office : +33 (0) 4 76 29 72 67
1 rue de Provence - 38432 Échirolles - France
www.atos.net


De : Michael Paquier [michael.paqu...@gmail.com]
Envoyé : jeudi 23 novembre 2017 14:08
À : REIX, Tony
Cc : PostgreSQL Hackers; OLIVA, PASCAL
Objet : Re: PostgreSLQ v10.1 and xlC compiler on AIX

On Thu, Nov 23, 2017 at 7:57 PM, REIX, Tony  wrote:
> We are porting PostgreSQL v10.1 on AIX (7.2 for now).
> And we have several tests failures, in 32bit and 64bit.
> We are using xlc 13.01.0003.0003 with -O2.
> Tests were 100% OK with version 9.6.2 .
>
> About 32 bit failures within v10.1, we have 3 failures including:
>  create_aggregate ... FAILED
>  aggregates   ... FAILED
>
> I've found that these 2 failures disappear when building :
>./src/backend/parser/gram.c
> without -O2 !
> However, this is a 44,498 lines file and it may take a while for finding the
> root issue.

When running regression tests and those fail, there is a file called
regressions.diff which gets generated in src/test/regress showing the
difference between the results generated and the results expected when
running the SQL queries which are part of the regression tests.
Attaching this file to this thread would help in determining what's
wrong with the regression tests.

> I invite anyone involved in porting/using PostgreSQL 10.1 on AIX to take
> part of a discussion about how to make v10.1 work perfectly on AIX.

Note that xlc 12.1 is tested with AIX machines on the buildfarms, but
there is no coverage for 13.1 visibly. It would be nice if you could
set up a buildfarm machine to catch problems way earlier.
--
Michael


regression.diffs-10.0.xlcV13
Description: regression.diffs-10.0.xlcV13


regression.diffs-9.6.6.xlcV13
Description: regression.diffs-9.6.6.xlcV13


Re: PostgreSLQ v10.1 and xlC compiler on AIX

2017-11-23 Thread Alvaro Herrera
Michael Paquier wrote:
> On Thu, Nov 23, 2017 at 7:57 PM, REIX, Tony  wrote:

> > I invite anyone involved in porting/using PostgreSQL 10.1 on AIX to take
> > part of a discussion about how to make v10.1 work perfectly on AIX.
> 
> Note that xlc 12.1 is tested with AIX machines on the buildfarms, but
> there is no coverage for 13.1 visibly. It would be nice if you could
> set up a buildfarm machine to catch problems way earlier.

Other problems have been reported, too; see
https://www.postgresql.org/message-id/2fc764c7--83e3-45c2-33c17853e...@atos.net
and others in the thread.

The fact that a problem goes away when the compile doesn't have -O2
suggests a compiler bug.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Challenges preventing us moving to 64 bit transaction id (XID)?

2017-11-23 Thread Amit Kapila
On Thu, Jun 22, 2017 at 9:06 PM, Alexander Korotkov
 wrote:
> On Wed, Jun 7, 2017 at 11:33 AM, Alexander Korotkov
>  wrote:
>>
>> On Tue, Jun 6, 2017 at 4:05 PM, Peter Eisentraut
>>  wrote:
>>>
>>> On 6/6/17 08:29, Bruce Momjian wrote:
>>> > On Tue, Jun  6, 2017 at 06:00:54PM +0800, Craig Ringer wrote:
>>> >> Tom's point is, I think, that we'll want to stay pg_upgrade
>>> >> compatible. So when we see a pg10 tuple and want to add a new page
>>> >> with a new page header that has an epoch, but the whole page is full
>>> >> so there isn't 32 bits left to move tuples "down" the page, what do we
>>> >> do?
>>> >
>>> > I guess I am missing something.  If you see an old page version number,
>>> > you know none of the tuples are from running transactions so you can
>>> > just freeze them all, after consulting the pg_clog.  What am I missing?
>>> > If the page is full, why are you trying to add to the page?
>>>
>>> The problem is if you want to delete from such a page.  Then you need to
>>> update the tuple's xmax and stick the new xid epoch somewhere.
>>>
>>> We had an unconference session at PGCon about this.  These issues were
>>> all discussed and some ideas were thrown around.  We can expect a patch
>>> to appear soon, I think.
>>
>>
>> Right.  I'm now working on splitting my large patch for 64-bit xids into
>> patchset.
>> I'm planning to post patchset in the beginning of next week.
>
>
> Work on this patch took longer than I expected.  It is still in not so good
> shape, but I decided to publish it anyway in order to not stop progress in
> this area.
> I also tried to split this patch into several.  But actually I manage to
> separate few small pieces, while most of changes are remaining in the single
> big diff.
> Long story short, patchset is attached.
>
> 0001-64bit-guc-relopt-1.patch
> This patch implements 64 bit GUCs and relation options which are used in
> further patches.
>
> 0002-heap-page-special-1.patch
> Putting xid and multixact bases into PageHeaderData would take extra 16
> bytes on index pages too.  That would be waste of space for indexes.  This
> is why I decided to put bases into special area of heap pages.
> This patch adds special area for heap pages contaning prune xid and magic
> number.  Magic number is different for regular heap page and sequence page.
>

uint16 pd_pagesize_version;
- TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */
  ItemIdData pd_linp[FLEXIBLE_ARRAY_MEMBER]; /* line pointer array */
  } PageHeaderData;

Why have you moved pd_prune_xid from page header?

> 0003-64bit-xid-1.patch
> It's the major patch.  It redefines TransactionID ad 64-bit integer and
> defines 32-bit ShortTransactionID which is used for t_xmin and t_xmax.
> Transaction id comparison becomes straight instead of circular. Base values
> for xids and multixact ids are stored in heap page special.  SLRUs also
> became 64-bit and non-circular.   To be able to calculate xmin/xmax without
> accessing heap page, base values are copied into HeapTuple.  Correspondingly
> HeapTupleHeader(Get|Set)(Xmin|Xmax) becomes just
> HeapTuple(Get|Set)(Xmin|Xmax) whose require HeapTuple not just
> HeapTupleHeader.  heap_page_prepare_for_xid() is used to ensure that given
> xid fits particular page base.  If it doesn't fit then base of page is
> shifted, that could require single-page freeze.  Format for wal is changed
> in order to prevent unaligned access to TransactionId.  *_age GUCs and
> relation options are changed to 64-bit.  Forced "autovacuum to prevent
> wraparound" is removed, but there is still freeze to truncate SLRUs.
>

It seems there is no README or some detailed explanation of how all
this works like how the value of pd_xid_base is maintained.  I don't
think there are enough comments in the patch to explain the things.  I
think it will be easier to understand and review the patch if you
provide some more details either in email or in the patch.

> 0004-base-values-for-testing-1.patch
> This patch is used for testing that calculations using 64-bit bases and
> short 32-bit xid values are correct.  It provides initdb options for initial
> xid, multixact id and multixact offset values.  Regression tests initialize
> cluster with large (more than 2^32) values.
>
> There are a lot of open items, but I would like to notice some of them:
>  * WAL becomes significantly larger due to storage 8 byte xids instead of 4
> byte xids.  Probably, its needed to use base approach in WAL too.
>

Yeah and I think it can impact performance as well.  By any chance
have you run pgbench read-write to see the performance impact of this
patch?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Partition-wise aggregation/grouping

2017-11-23 Thread Jeevan Chalke
On Fri, Nov 17, 2017 at 5:54 PM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

> On Wed, Nov 15, 2017 at 5:31 PM, Jeevan Chalke
>  wrote:
> >
> > OK. Done in the attached patch set.
> >
> > I have rebased all my patches on latest HEAD which is at
> > 7518049980be1d90264addab003476ae105f70d4
> >
> > Thanks
>
> These are review comments for the last set and I think most of them
> apply to the new set as well.
>
> Patches 0001 - 0005 refactoring existing code. I haven't
> reviewed them in detail, checking whether we have missed anything in
> moving the
> code, but they mostly look fine.
>

Thanks.


>
> Comments on 0006
>  /*
> + * cost_append
> + *  Determines and returns the cost of an Append node.
> + *
> ... clipped portion
> +
> +/* Add Append node overhead. */
> +run_cost += cpu_tuple_cost * DEFAULT_APPEND_COST_FACTOR * tuples;
> +
>
> I am wondering whether it's really worth creating a new function for a
> single
> line addition to create_append_path(). I think all we need to change in
> create_append_path() is add (cpu_tuple_cost * DEFAULT_APPEND_COST_FACTOR *
> tuples) to path->total_cost.
>

Agree. However, there was ab existing comment in create_append_path() saying
"We don't bother with inventing a cost_append(), but just do it here", which
implies at sometime in future we may need it; so why not now where we are
explicitly costing for an append node. Having a function is good so that,
if required in future, we need update in only this function.
Let me know if you think otherwise, I make those changes in next patchset.


>
> +/* Add MergeAppend node overhead like we do it for the Append node */
> +run_cost += cpu_tuple_cost * DEFAULT_APPEND_COST_FACTOR * tuples;
> +
>
> With this change the following comment is no more true. Please remove it.
>  * extracted tuple.  We don't charge cpu_tuple_cost because a
> MergeAppend
>  * node doesn't do qual-checking or projection, so it has less overhead
>  * than most plan nodes.
>  */
>
>
This was already fixed in v7.

+/*
> + * Arbitrarily use 50% of the cpu_tuple_cost to cost append node. Note
> that
>
> May be reword it as " ... to cost per tuple processing by an append node
> ..."
>

Done.


>
> + * this value should be multiplied with cpu_tuple_cost wherever
> applicable.
> + */
> +#define DEFAULT_APPEND_COST_FACTOR 0.5
>
> I am wondering whether we should just define
> #define APPEND_TUPLE_COST (cpu_tuple_cost * 0.5)
> and use this macro everywhere. What else use DEFAULT_APPEND_COST_FACTOR
> would
> have other than multiplying with cpu_tuple_cost?
>

As suggested by Robert, I have renamed it to APPEND_CPU_COST_MULTIPLIER in
v7 patchset.
Also, retained the #define for just multiplier as suggested by Robert.


>  -- test partition matching with N-way join
>  EXPLAIN (COSTS OFF)
>  SELECT avg(t1.a), avg(t2.b), avg(t3.a + t3.b), t1.c, t2.c, t3.c FROM
> plt1 t1, plt2 t2, plt1_e t3 WHERE t1.c = t2.c AND ltrim(t3.c, 'A') =
> t1.c GROUP BY t1.c, t2.c, t3.c ORDER BY t1.c, t2.c, t3.c;
> -  QUERY PLAN
> 
> ---
> +   QUERY PLAN
> +---
> -
>   Sort
> Sort Key: t1.c, t3.c
> ->  HashAggregate
>   Group Key: t1.c, t2.c, t3.c
> - ->  Result
> + ->  Hash Join
> +   Hash Cond: (t1.c = t2.c)
> ->  Append
> - ->  Hash Join
> -   Hash Cond: (t1.c = t2.c)
>
> That's sad. Interestingly this query has an aggregate, so the plan will use
> partition-wise join again when partition-wise aggregation patch will be
> applied. So may be fine.
>

Yep. I have modified this testcase and enabled partition-wise aggregation
before this test, so that we will see the desired plan.


> - Append  (cost=0.00..0.04 rows=2 width=32)
> + Append  (cost=0.00..0.05 rows=2 width=32)
>
> - Append  (cost=0.00..0.04 rows=2 width=4)
> + Append  (cost=0.00..0.05 rows=2 width=4)
>
> We do have some testcases which print costs. Interesting :). I don't have
> any objection to this change.
>

OK. Thanks.


> Comments on 0007
>
> +   
> +Enables or disables the query planner's use of partition-wise
> grouping
> +or aggregation, which allows  If partition-wise aggregation
> does not result in the
> +cheapest path, it will still spend time in creating these paths
> and
> +consume memory which increase linearly with the number of
> partitions.
> +The default is off.
> +   
> +  
> + 
> +
> May be we should word this in the same manner as partition-wise join like
>
> Enables or disables the query planner's use of partition-wise
> grouping
> or aggregation, which allows aggregation or grouping on a
> partitioned
> tables to be spread across the partitions. If GROUP

Re: PostgreSLQ v10.1 and xlC compiler on AIX

2017-11-23 Thread Michael Paquier
On Thu, Nov 23, 2017 at 7:57 PM, REIX, Tony  wrote:
> We are porting PostgreSQL v10.1 on AIX (7.2 for now).
> And we have several tests failures, in 32bit and 64bit.
> We are using xlc 13.01.0003.0003 with -O2.
> Tests were 100% OK with version 9.6.2 .
>
> About 32 bit failures within v10.1, we have 3 failures including:
>  create_aggregate ... FAILED
>  aggregates   ... FAILED
>
> I've found that these 2 failures disappear when building :
>./src/backend/parser/gram.c
> without -O2 !
> However, this is a 44,498 lines file and it may take a while for finding the
> root issue.

When running regression tests and those fail, there is a file called
regressions.diff which gets generated in src/test/regress showing the
difference between the results generated and the results expected when
running the SQL queries which are part of the regression tests.
Attaching this file to this thread would help in determining what's
wrong with the regression tests.

> I invite anyone involved in porting/using PostgreSQL 10.1 on AIX to take
> part of a discussion about how to make v10.1 work perfectly on AIX.

Note that xlc 12.1 is tested with AIX machines on the buildfarms, but
there is no coverage for 13.1 visibly. It would be nice if you could
set up a buildfarm machine to catch problems way earlier.
-- 
Michael



Re: [HACKERS] UPDATE of partition key

2017-11-23 Thread Amit Khandekar
On 7 November 2017 at 00:33, Robert Haas  wrote:
> +   /* The caller must have already locked all the partitioned tables. */
> +   root_rel = heap_open(root_relid, NoLock);
> +   *all_part_cols = NULL;
> +   foreach(lc, partitioned_rels)
> +   {
> +   Index   rti = lfirst_int(lc);
> +   Oid relid = getrelid(rti, rtables);
> +   Relationpart_rel = heap_open(relid, NoLock);
> +
> +   pull_child_partition_columns(part_rel, root_rel, 
> all_part_cols);
> +   heap_close(part_rel, NoLock);
>
> I don't like the fact that we're opening and closing the relation here
> just to get information on the partitioning columns.  I think it would
> be better to do this someplace that already has the relation open and
> store the details in the RelOptInfo.  set_relation_partition_info()
> looks like the right spot.

It seems, for UPDATE, baserel RelOptInfos are created only for the
subplan relations, not for the partitioned tables. I verified that
build_simple_rel() does not get called for paritioned tables for
UPDATE.

In earlier versions of the patch, we used to collect the partition
keys while expanding the partition tree so that we could get them
while the relations are open. After some reviews, I was inclined to
think that the collection logic better be moved out into the
inheritance_planner(), because it involved pulling the attributes from
partition key expressions, and the bitmap operation, which would be
unnecessarily done for SELECTs as well.

On the other hand, if we collect the partition keys separately in
inheritance_planner(), then as you say, we need to open the relations.
Opening the relation which is already in relcache and which is already
locked, involves only a hash lookup. Do you think this is expensive ?
I am open for either of these approaches.

If we collect the partition keys in expand_partitioned_rtentry(), we
need to pass the root relation also, so that we can convert the
partition key attributes to root rel descriptor. And the other thing
is, may be, we can check beforehand (in expand_inherited_rtentry)
whether the rootrte's updatedCols is empty, which I think implies that
it's not an UPDATE operation. If yes, we can just skip collecting the
partition keys.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] logical decoding of two-phase transactions

2017-11-23 Thread Nikhil Sontakke
Hi all,

>
> I think Nikhils has done some significant work on this patch.
> Hopefully he'll be able to share it.
>

PFA, latest patch. This builds on top of the last patch submitted by
Sokolov Yura and adds the actual logical replication interfaces to
allow PREPARE or COMMIT/ROLLBACK PREPARED on a logical subscriber.

I tested with latest PG head by setting up PUBLICATION/SUBSCRIPTION
for some tables. I tried DML on these tables via 2PC and it seems to
work with subscribers honoring COMMIT|ROLLBACK PREPARED commands.

Now getting back to the two main issues that we have been discussing:

Logical decoding deadlocking/hanging due to locks on catalog tables


When we are decoding, we do not hold long term locks on the table. We
do RelationIdGetRelation() and RelationClose() which
increments/decrements ref counts. Also this ref count is held/released
per ReorderBuffer change record. The call to RelationIdGetRelation()
holds an AccessShareLock on pg_class, pg_attribute etc. while building
the relation descriptor. The plugin itself can access rel/syscache but
none of it holds a lock stronger than AccessShareLock on the catalog
tables.

Even activities like:

ALTER user_table;
CLUSTER user_table;

Do not hold locks that will allow decoding to stall.

The only issue could be with locks on catalog objects itself in the
prepared transaction.

Now if the 2PC transaction is taking an AccessExclusiveLock on catalog
objects via "LOCK pg_class"
for example, then pretty much nothing else will progress ahead in
other sessions in the database
till this active session COMMIT PREPAREs or aborts this 2PC transaction.

Also, in some cases like CLUSTER on catalog objects, the code
explicitly denies preparation of a 2PC transaction.

postgres=# BEGIN;
postgres=# CLUSTER pg_class using pg_class_oid_index ;
postgres=# PREPARE TRANSACTION 'test_prepared_lock';
ERROR:  cannot PREPARE a transaction that modified relation mapping

This makes sense because we do not want to get into a state where the
DB is unable to progress meaningfully at all.

Is there any other locking scenario that we need to consider?
Otherwise, are we all ok on this point being a non-issue for 2PC
logical decoding?

Now on to the second issue:

2PC Logical decoding with concurrent "ABORT PREPARED" of the same
=

Before 2PC, we always decoded regular committed transaction records.
Now with prepared
transactions, we run the risk of running decoding when some other
backend could come in and
COMMIT PREPARE or ROLLBACK PREPARE simultaneously. If the other backend commits,
that's not an issue at all.

The issue is with a concurrent rollback of the prepared transaction.
We need a way to ensure that
the 2PC does not abort when we are in the midst of a change record
apply activity.

One way to handle this is to ensure that we interlock the abort
prepared with an ongoing logical decoding operation for a bounded
period of maximum one change record apply cycle.

I am outlining one solution but am all ears for better, elegant solutions.

* We introduce two new booleans in the TwoPhaseState
GlobalTransactionData structure.
  bool beingdecoded;
  bool abortpending;

1) Before we start iterating through the change records, if it happens
to be a prepared transaction, we
check "abortpending" in the corresponding TwoPhaseState entry. If it's
not set, then we set "beingdecoded".
If abortpending is set, we know that this transaction is going to go
away and we treat it like a regular abort and do
not do any decoding at all.

2) With "beingdecoded" set, we start with the first change record from
the iteration, decode it and apply it.

3) Before starting decode of the next change record, we re-check if
"abortpending" is set. If "abortpending"
is set, we do not decode the next change record. Thus the abort is
delay-bounded to a maximum of one change record decoding/apply cycle
after we signal our intent to abort it. Then, we need to send ABORT
(regular, not rollback prepared, since we have not sent "PREPARE" yet.
We cannot send PREPARE midways because the transaction block on the
whole might not be consistent) to the subscriber. We will have to add
an ABORT callback in pgoutput for this. There's only a COMMIT callback
as of now. The subscribers will ABORT this transaction midways due to
this. We can then follow this up with a DUMMY prepared txn. E.g.
"BEGIN; PREPARE TRANSACTION 'gid'"; The reasoning for the DUMMY 2PC is
mentioned below in (6).

4) Keep decoding change records as long as "abortpending" is not set.

5) At end of the change set, send "PREPARE" to the subscribers and
then remove the "beingdecoded" flag from the TwoPhaseState entry. We
are now free to commit/rollback the prepared transaction anytime.

6) We will still decode the "ROLLBACK PREPARED" wal entry when it
comes to us on the provider. This will call the abort_prepared
callback on the subscriber. I have alread

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2017-11-23 Thread amul sul
On Sat, Nov 11, 2017 at 1:05 AM, Robert Haas  wrote:
> On Wed, Sep 27, 2017 at 7:07 AM, amul sul  wrote:
>> Attaching POC patch that throws an error in the case of a concurrent update
>> to an already deleted tuple due to UPDATE of partition key[1].
>>
>> In a normal update new tuple is linked to the old one via ctid forming
>> a chain of tuple versions but UPDATE of partition key[1] move tuple
>> from one partition to an another partition table which breaks that chain.
>
> This patch needs a rebase.  It has one whitespace-only hunk that
> should possibly be excluded.
>
Thanks for looking at the patch.

> I think the basic idea of this is sound.  Either you or Amit need to
> document the behavior in the user-facing documentation, and it needs
> tests that hit every single one of the new error checks you've added
> (obviously, the tests will only work in combination with Amit's
> patch).  The isolation should be sufficient to write such tests.
>
> It needs some more extensive comments as well.  The fact that we're
> assigning a meaning to ip_blkid -> InvalidBlockNumber is a big deal,
> and should at least be documented in itemptr.h in the comments for the
> ItemPointerData structure.
>
> I suspect ExecOnConflictUpdate needs an update for the
> HeapTupleUpdated case similar to what you've done elsewhere.
>

UPDATE of partition key v25[1] has documented this concurrent scenario,
I need to rework on that document paragraph to include this behaviour, will
discuss with Amit.

Attached 0001 patch includes error check for 8 functions, out of 8 I am able
to build isolation test for 4 functions -- ExecUpdate,ExecDelete,
GetTupleForTrigger & ExecLockRows.

And remaining are EvalPlanQualFetch, ExecOnConflictUpdate,
RelationFindReplTupleByIndex & RelationFindReplTupleSeq.  Note that check in
RelationFindReplTupleByIndex & RelationFindReplTupleSeq will have LOG not an
ERROR.

Any help/suggestion to build test for these function would be much appreciated.


1] 
http://postgr.es/m/CAJ3gD9f4Um99sOJmVSEPj783VWw5GbXkgU3OWcYZJv=ipjt...@mail.gmail.com


Regards,
Amul


0001-POC-Invalidate-ip_blkid-v2.patch
Description: Binary data


0002-isolation-tests-v1.patch
Description: Binary data


PostgreSLQ v10.1 and xlC compiler on AIX

2017-11-23 Thread REIX, Tony
Hi,

We are porting PostgreSQL v10.1 on AIX (7.2 for now).
And we have several tests failures, in 32bit and 64bit.
We are using xlc 13.01.0003.0003 with -O2.
Tests were 100% OK with version 9.6.2 .

About 32 bit failures within v10.1, we have 3 failures including:
 create_aggregate ... FAILED
 aggregates   ... FAILED

I've found that these 2 failures disappear when building :
   ./src/backend/parser/gram.c
without -O2 !
However, this is a 44,498 lines file and it may take a while for finding the 
root issue.

I invite anyone involved in porting/using PostgreSQL 10.1 on AIX to take part 
of a discussion about how to make v10.1 work perfectly on AIX.

Regards,

Cordialement,

Tony Reix

Bull - ATOS
IBM Coop Architect & Technical Leader
Office : +33 (0) 4 76 29 72 67
1 rue de Provence - 38432 Échirolles - France
www.atos.net


Re: With commit 4e5fe9ad19, range partition missing handling for the NULL partition key

2017-11-23 Thread Rushabh Lathia
On Wed, Nov 22, 2017 at 2:36 PM, Amit Langote  wrote:

> On 2017/11/22 17:42, amul sul wrote:
> > On Wed, Nov 22, 2017 at 11:38 AM, Amit Langote wrote:
> >> On 2017/11/22 13:45, Rushabh Lathia wrote:
> >>> Attaching patch to fix as well as regression test.
> >>
> >> Thanks for the patch.  About the code, how about do it like the attached
> >> instead?
> >>
> >
> > Looks good to me, even we can skip the following change in v2 patch:
> >
> > 19 @@ -2560,6 +2559,8 @@ get_partition_for_tuple(Relation relation,
> > Datum *values, bool *isnull)
> >  20  */
> >  21 part_index =
> > partdesc->boundinfo->indexes[bound_offset + 1];
> >  22 }
> >  23 +   else
> >  24 +   part_index = partdesc->boundinfo->default_index;
> >  25 }
> >  26 break;
> >  27
> >
> > default_index will get assign by following code in
> get_partition_for_tuple() :
> >
> >/*
> >  * part_index < 0 means we failed to find a partition of this parent.
> >  * Use the default partition, if there is one.
> >  */
> > if (part_index < 0)
> > part_index = partdesc->boundinfo->default_index;
>
> Good point.  Updated patch attached.
>

Thanks Amit.

Patch looks good to me.


-- 
Rushabh Lathia
www.EnterpriseDB.com


How is the PostgreSQL debuginfo file generated

2017-11-23 Thread Rui Hai Jiang
Hello hackers,

I’m wondering how to build the debuginfo package from  the PostgreSQL source.

For example to generate something like this one :   
postgresql91-debuginfo.x86_64.

Is there existing support in the current PostgreSQL Makefiles to generate such 
debuginfo? I have searched in the source code and could find anything.   How 
were the existing debuginfo packages created?

Thank you!

Ruihai



Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-23 Thread Rafia Sabih
On Thu, Nov 16, 2017 at 12:24 AM, Andres Freund  wrote:
> Hi,
>
> On 2017-11-15 13:48:18 -0500, Robert Haas wrote:
>> I think that we need a little bit deeper analysis here to draw any
>> firm conclusions.
>
> Indeed.
>
>
>> I suspect that one factor is that many of the queries actually send
>> very few rows through the Gather.
>
> Yep. I kinda wonder if the same result would present if the benchmarks
> were run with parallel_leader_participation. The theory being what were
> seing is just that the leader doesn't accept any tuples, and the large
> queue size just helps because workers can run for longer.
>
I ran Q12 with parallel_leader_participation = off and could not get
any performance improvement with the patches given by Robert.The
result was same for head as well. The query plan also remain
unaffected with the value of this parameter.

Here are the details of the experiment,
TPC-H scale factor = 20,
work_mem = 1GB
random_page_cost = seq_page_cost = 0.1
max_parallel_workers_per_gather = 4

PG commit: 745948422c799c1b9f976ee30f21a7aac050e0f3

Please find the attached file for the explain analyse output for
either values of parallel_leader_participation and patches.
-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/
with parallel_leader_participation = 1;

 QUERY PLAN 
 
 
--
-
 Limit  (cost=1001.19..504469.79 rows=1 width=27) (actual 
time=21833.206..21833.207 rows=1 loops=1)
   ->  GroupAggregate  (cost=1001.19..3525281.42 rows=7 width=27) (actual 
time=21833.203..21833.203 rows=1 loops=1)
 Group Key: lineitem.l_shipmode
 ->  Gather Merge  (cost=1001.19..3515185.81 rows=576888 width=27) 
(actual time=4.388..21590.757 rows=311095 loops=1)
   Workers Planned: 4
   Workers Launched: 4
   ->  Nested Loop  (cost=1.13..3445472.82 rows=144222 width=27) 
(actual time=0.150..4399.384 rows=62247 loops=5)
 ->  Parallel Index Scan using l_shipmode on lineitem  
(cost=0.57..3337659.69 rows=144222 width=19) (actual time=0.111..3772.865 
rows=62247 loops=5)
   Index Cond: (l_shipmode = ANY ('{"REG 
AIR",RAIL}'::bpchar[]))
   Filter: ((l_commitdate < l_receiptdate) AND 
(l_shipdate < l_commitdate) AND (l_receiptdate >= '1995-01-01'::date) AND 
(l_receiptdate < '1996-01-01 00:00:00'::timestamp without
 time zone))
   Rows Removed by Filter: 3367603
 ->  Index Scan using orders_pkey on orders  
(cost=0.56..0.75 rows=1 width=20) (actual time=0.009..0.009 rows=1 loops=311236)
   Index Cond: (o_orderkey = lineitem.l_orderkey)
 Planning time: 0.526 ms
 Execution time: 21835.922 ms
(15 rows)

postgres=# set parallel_leader_participation =0;
SET
postgres=# \i /data/rafia.sabih/dss/queries/12.sql 

  QUERY PLAN
  
 
--
-
 Limit  (cost=1001.19..504469.79 rows=1 width=27) (actual 
time=21179.065..21179.066 rows=1 loops=1)
   ->  GroupAggregate  (cost=1001.19..3525281.42 rows=7 width=27) (actual 
time=21179.064..21179.064 rows=1 loops=1)
 Group Key: lineitem.l_shipmode
 ->  Gather Merge  (cost=1001.19..3515185.81 rows=576888 width=27) 
(actual time=4.201..20941.385 rows=311095 loops=1)
   Workers Planned: 4
   Workers Launched: 4
   ->  Nested Loop  (cost=1.13..3445472.82 rows=144222 width=27) 
(actual time=0.187..5105.780 rows=77797 loops=4)
 ->  Parallel Index Scan using l_shipmode on lineitem  
(cost=0.57..3337659.69 rows=144222 width=19) (actual time=0.149..4362.235 
rows=77797 loops=4)
   Index Cond: (l_shipmode = ANY ('{"REG 
AIR",RAIL}'::bpchar[]))
   Filter: ((l_commitdate < l_receiptdate) AND 
(l_shipdate < l_commitdate) AND (l_receiptdate >= '1995-01-01'::date) AND 
(l_receiptdate < '1996-01-01 00:00:00'::timestamp without
 time zone))
   Rows Removed by Filter: 4208802
 ->  Index Scan using orders_pkey on orders  
(cost=0.56..0.75 rows=1 width=20) (actual time=0.008..0.008 rows=1 loops=311187)
   Index Cond: (o_orderkey = l

Re: [HACKERS] Parallel Append implementation

2017-11-23 Thread amul sul
Look like it is the same crash what v20 claim to be fixed, indeed I
missed to add fix[1] in v20 patch, sorry about that. Attached updated
patch includes aforementioned fix.


1] 
http://postgr.es/m/CAAJ_b97kLNW8Z9nvc_JUUG5wVQUXvG=f37WsX8ALF0A=kah...@mail.gmail.com


Regards,
Amul

On Thu, Nov 23, 2017 at 1:50 PM, Rajkumar Raghuwanshi
 wrote:
> On Thu, Nov 23, 2017 at 9:45 AM, amul sul  wrote:
>>
>> Attaching updated version of "ParallelAppend_v19_rebased" includes this
>> fix.
>
>
> Hi,
>
> I have applied attached patch and got a crash with below query. please take
> a look.
>
> CREATE TABLE tbl (a int, b int, c text, d int) PARTITION BY LIST(c);
> CREATE TABLE tbl_p1 PARTITION OF tbl FOR VALUES IN ('', '0001', '0002',
> '0003');
> CREATE TABLE tbl_p2 PARTITION OF tbl FOR VALUES IN ('0004', '0005', '0006',
> '0007');
> CREATE TABLE tbl_p3 PARTITION OF tbl FOR VALUES IN ('0008', '0009', '0010',
> '0011');
> INSERT INTO tbl SELECT i % 20, i % 30, to_char(i % 12, 'FM'), i % 30
> FROM generate_series(0, 999) i;
> ANALYZE tbl;
>
> EXPLAIN ANALYZE SELECT c, sum(a), avg(b), COUNT(*) FROM tbl GROUP BY c
> HAVING avg(d) < 15 ORDER BY 1, 2, 3;
> WARNING:  terminating connection because of crash of another server process
> DETAIL:  The postmaster has commanded this server process to roll back the
> current transaction and exit, because another server process exited
> abnormally and possibly corrupted shared memory.
> HINT:  In a moment you should be able to reconnect to the database and
> repeat your command.
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
> !>
>
>
> stack-trace is given below.
>
> Reading symbols from /lib64/libnss_files.so.2...Reading symbols from
> /usr/lib/debug/lib64/libnss_files-2.12.so.debug...done.
> done.
> Loaded symbols for /lib64/libnss_files.so.2
> Core was generated by `postgres: parallel worker for PID 104999
> '.
> Program terminated with signal 11, Segmentation fault.
> #0  0x006dc4b3 in ExecProcNode (node=0x7f7f7f7f7f7f7f7e) at
> ../../../src/include/executor/executor.h:238
> 238if (node->chgParam != NULL) /* something changed? */
> Missing separate debuginfos, use: debuginfo-install
> keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64
> libcom_err-1.41.12-23.el6.x86_64 libselinux-2.0.94-7.el6.x86_64
> openssl-1.0.1e-57.el6.x86_64 zlib-1.2.3-29.el6.x86_64
> (gdb) bt
> #0  0x006dc4b3 in ExecProcNode (node=0x7f7f7f7f7f7f7f7e) at
> ../../../src/include/executor/executor.h:238
> #1  0x006dc72e in ExecAppend (pstate=0x1947ed0) at nodeAppend.c:207
> #2  0x006d1e7c in ExecProcNodeInstr (node=0x1947ed0) at
> execProcnode.c:446
> #3  0x006dcef1 in ExecProcNode (node=0x1947ed0) at
> ../../../src/include/executor/executor.h:241
> #4  0x006dd398 in fetch_input_tuple (aggstate=0x1947fe8) at
> nodeAgg.c:699
> #5  0x006e02f7 in agg_fill_hash_table (aggstate=0x1947fe8) at
> nodeAgg.c:2536
> #6  0x006dfb37 in ExecAgg (pstate=0x1947fe8) at nodeAgg.c:2148
> #7  0x006d1e7c in ExecProcNodeInstr (node=0x1947fe8) at
> execProcnode.c:446
> #8  0x006d1e4d in ExecProcNodeFirst (node=0x1947fe8) at
> execProcnode.c:430
> #9  0x006c9439 in ExecProcNode (node=0x1947fe8) at
> ../../../src/include/executor/executor.h:241
> #10 0x006cbd73 in ExecutePlan (estate=0x1947590,
> planstate=0x1947fe8, use_parallel_mode=0 '\000', operation=CMD_SELECT,
> sendTuples=1 '\001', numberTuples=0,
> direction=ForwardScanDirection, dest=0x192acb0, execute_once=1 '\001')
> at execMain.c:1718
> #11 0x006c9a12 in standard_ExecutorRun (queryDesc=0x194ffc0,
> direction=ForwardScanDirection, count=0, execute_once=1 '\001') at
> execMain.c:361
> #12 0x006c982e in ExecutorRun (queryDesc=0x194ffc0,
> direction=ForwardScanDirection, count=0, execute_once=1 '\001') at
> execMain.c:304
> #13 0x006d096c in ParallelQueryMain (seg=0x18aa2a8,
> toc=0x7f899a227000) at execParallel.c:1271
> #14 0x0053272d in ParallelWorkerMain (main_arg=1218206688) at
> parallel.c:1149
> #15 0x007e8ca5 in StartBackgroundWorker () at bgworker.c:841
> #16 0x007fc035 in do_start_bgworker (rw=0x18ced00) at
> postmaster.c:5741
> #17 0x007fc377 in maybe_start_bgworkers () at postmaster.c:5945
> #18 0x007fb406 in sigusr1_handler (postgres_signal_arg=10) at
> postmaster.c:5134
> #19 
> #20 0x003dd26e1603 in __select_nocancel () at
> ../sysdeps/unix/syscall-template.S:82
> #21 0x007f6bfa in ServerLoop () at postmaster.c:1721
> #22 0x007f63e9 in PostmasterMain (argc=3, argv=0x18a8180) at
> postmaster.c:1365
> #23 0x0072cb4c in main (argc=3, argv=0x18a8180) at main.c:228
> (gdb)
>
>
> Thanks & Regards,
> Rajkumar Raghuwanshi
> QMG, EnterpriseDB Corporation


ParallelAppend_v21.patch
Descriptio

Re: [HACKERS] Parallel Append implementation

2017-11-23 Thread Rajkumar Raghuwanshi
On Thu, Nov 23, 2017 at 9:45 AM, amul sul  wrote:

> Attaching updated version of "ParallelAppend_v19_rebased" includes this
> fix.
>

Hi,

I have applied attached patch and got a crash with below query. please take
a look.

CREATE TABLE tbl (a int, b int, c text, d int) PARTITION BY LIST(c);
CREATE TABLE tbl_p1 PARTITION OF tbl FOR VALUES IN ('', '0001', '0002',
'0003');
CREATE TABLE tbl_p2 PARTITION OF tbl FOR VALUES IN ('0004', '0005', '0006',
'0007');
CREATE TABLE tbl_p3 PARTITION OF tbl FOR VALUES IN ('0008', '0009', '0010',
'0011');
INSERT INTO tbl SELECT i % 20, i % 30, to_char(i % 12, 'FM'), i % 30
FROM generate_series(0, 999) i;
ANALYZE tbl;

EXPLAIN ANALYZE SELECT c, sum(a), avg(b), COUNT(*) FROM tbl GROUP BY c
HAVING avg(d) < 15 ORDER BY 1, 2, 3;
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>


stack-trace is given below.

Reading symbols from /lib64/libnss_files.so.2...Reading symbols from
/usr/lib/debug/lib64/libnss_files-2.12.so.debug...done.
done.
Loaded symbols for /lib64/libnss_files.so.2
Core was generated by `postgres: parallel worker for PID
104999 '.
Program terminated with signal 11, Segmentation fault.
#0  0x006dc4b3 in ExecProcNode (node=0x7f7f7f7f7f7f7f7e) at
../../../src/include/executor/executor.h:238
238if (node->chgParam != NULL) /* something changed? */
Missing separate debuginfos, use: debuginfo-install
keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64
libcom_err-1.41.12-23.el6.x86_64 libselinux-2.0.94-7.el6.x86_64
openssl-1.0.1e-57.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0  0x006dc4b3 in ExecProcNode (node=0x7f7f7f7f7f7f7f7e) at
../../../src/include/executor/executor.h:238
#1  0x006dc72e in ExecAppend (pstate=0x1947ed0) at nodeAppend.c:207
#2  0x006d1e7c in ExecProcNodeInstr (node=0x1947ed0) at
execProcnode.c:446
#3  0x006dcef1 in ExecProcNode (node=0x1947ed0) at
../../../src/include/executor/executor.h:241
#4  0x006dd398 in fetch_input_tuple (aggstate=0x1947fe8) at
nodeAgg.c:699
#5  0x006e02f7 in agg_fill_hash_table (aggstate=0x1947fe8) at
nodeAgg.c:2536
#6  0x006dfb37 in ExecAgg (pstate=0x1947fe8) at nodeAgg.c:2148
#7  0x006d1e7c in ExecProcNodeInstr (node=0x1947fe8) at
execProcnode.c:446
#8  0x006d1e4d in ExecProcNodeFirst (node=0x1947fe8) at
execProcnode.c:430
#9  0x006c9439 in ExecProcNode (node=0x1947fe8) at
../../../src/include/executor/executor.h:241
#10 0x006cbd73 in ExecutePlan (estate=0x1947590,
planstate=0x1947fe8, use_parallel_mode=0 '\000', operation=CMD_SELECT,
sendTuples=1 '\001', numberTuples=0,
direction=ForwardScanDirection, dest=0x192acb0, execute_once=1 '\001')
at execMain.c:1718
#11 0x006c9a12 in standard_ExecutorRun (queryDesc=0x194ffc0,
direction=ForwardScanDirection, count=0, execute_once=1 '\001') at
execMain.c:361
#12 0x006c982e in ExecutorRun (queryDesc=0x194ffc0,
direction=ForwardScanDirection, count=0, execute_once=1 '\001') at
execMain.c:304
#13 0x006d096c in ParallelQueryMain (seg=0x18aa2a8,
toc=0x7f899a227000) at execParallel.c:1271
#14 0x0053272d in ParallelWorkerMain (main_arg=1218206688) at
parallel.c:1149
#15 0x007e8ca5 in StartBackgroundWorker () at bgworker.c:841
#16 0x007fc035 in do_start_bgworker (rw=0x18ced00) at
postmaster.c:5741
#17 0x007fc377 in maybe_start_bgworkers () at postmaster.c:5945
#18 0x007fb406 in sigusr1_handler (postgres_signal_arg=10) at
postmaster.c:5134
#19 
#20 0x003dd26e1603 in __select_nocancel () at
../sysdeps/unix/syscall-template.S:82
#21 0x007f6bfa in ServerLoop () at postmaster.c:1721
#22 0x007f63e9 in PostmasterMain (argc=3, argv=0x18a8180) at
postmaster.c:1365
#23 0x0072cb4c in main (argc=3, argv=0x18a8180) at main.c:228
(gdb)


Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation