date:20160922

Re: [HACKERS] ICU integration

2016-09-22 Thread Thomas Munro

On Wed, Aug 31, 2016 at 2:46 PM, Peter Eisentraut
 wrote:
> Here is a patch I've been working on to allow the use of ICU for sorting
> and other locale things.

This is very interesting work, and it's great to see some development
in this area.  I've been peripherally involved in more than one
collation-change-broke-my-data incident over the years.  I took the
patch for a quick spin today.  Here are a couple of initial
observations.

This patch adds pkg-config support to our configure script, in order
to produce the build options for ICU.  That's cool, and I'm a fan of
pkg-config, but it's an extra dependency that I just wanted to
highlight.  For example MacOSX appears to ship with ICU but has is no
pkg-config or ICU .pc files out of the box (erm, I can't even find the
headers, so maybe that copy of ICU is useless to everyone except
Apple).  The MacPorts ICU port ships with .pc files, so that's easy to
deal with, and I don't expect it to be difficult to get a working
pkg-config and meta-data installed alongside ICU on any platform that
doesn't already ship with them.  It may well be useful for configuring
other packages.  (There is also other cool stuff that would become
possible once ICU is optionally around, off topic.)

There is something missing from the configure script however: the
output of pkg-config is not making it into CFLAGS or LDFLAGS, so
building fails on FreeBSD and MacOSX where for example
 doesn't live in the default search path.  I tried
very briefly to work out what but autoconfuse defeated me.

You call the built-in strcoll/strxfrm collation provider "posix", and
although POSIX does define those functions, it only does so because it
inherits them from ISO C90.  As far as I can tell Windows provides
those functions because it conforms to the C spec, not the POSIX spec,
though I suppose you could argue that in that respect it DOES conform
to the POSIX spec...  Also, we already have a collation called
"POSIX".  Maybe we should avoid confusing people with a "posix"
provider and a "POSIX" collation?  But then I'm not sure what else to
call it, but perhaps "system" as Petr Jelinek suggested[1], or "libc".

postgres=# select * from pg_collation where collname like 'en_NZ%';
┌──┬───┬───┬──┬──┬──┬──┬─┐
│ collname │ collnamespace │ collowner │ collprovider │
collencoding │   collcollate│collctype │ collversion │
├──┼───┼───┼──┼──┼──┼──┼─┤
│ en_NZ│11 │10 │ p│
6 │ en_NZ│ en_NZ│   0 │
│ en_NZ│11 │10 │ p│
8 │ en_NZ.ISO8859-1  │ en_NZ.ISO8859-1  │   0 │
│ en_NZ│11 │10 │ p│
   16 │ en_NZ.ISO8859-15 │ en_NZ.ISO8859-15 │   0 │
│ en_NZ.ISO8859-1  │11 │10 │ p│
8 │ en_NZ.ISO8859-1  │ en_NZ.ISO8859-1  │   0 │
│ en_NZ.ISO8859-15 │11 │10 │ p│
   16 │ en_NZ.ISO8859-15 │ en_NZ.ISO8859-15 │   0 │
│ en_NZ.UTF-8  │11 │10 │ p│
6 │ en_NZ.UTF-8  │ en_NZ.UTF-8  │   0 │
│ en_NZ%icu│11 │10 │ i│
   -1 │ en_NZ│ en_NZ│ -1724383232 │
└──┴───┴───┴──┴──┴──┴──┴─┘
(7 rows)

I notice that you use encoding -1, meaning "all".  I haven't fully
grokked what that really means but it appears that we won't be able to
create any new such collations after initdb using CREATE COLLATION, if
for example you update your ICU installation and now have a new
collation available.  When I tried dropping some and recreating them
they finished up with collencoding = 6.  Should the initdb-created
rows use 6 too?

+ ucol_getVersion(collator, versioninfo);
+ numversion = ntohl(*((uint32 *) versioninfo));
+
+ if (numversion != collform->collversion)
+ ereport(WARNING,
+ (errmsg("ICU collator version mismatch"),
+ errdetail("The database was created using version 0x%08X, the
library provides version 0x%08X.",
+   (uint32) collform->collversion, (uint32) numversion),
+ errhint("Rebuild affected indexes, or build PostgreSQL with the
right version of ICU.")));

I played around with bumping version numbers artificially.  That gives
each session that accesses the collation one warning:

postgres=# select * from foo order by id;
WARNING:  ICU collator version mismatch
DETAIL:  The database was created using version 0x99380001, the
library provides version 0x9938.
HINT:  Rebuild affected indexes, or build PostgreSQL with the right
version of ICU.
┌┐
│ id │
├┤
└┘
(0 rows)

postgres=# select * from foo order by id;
┌┐
│ id │
├┤
└

Re: [HACKERS] Showing parallel status in \df+

2016-09-22 Thread Pavel Stehule

2016-09-23 7:22 GMT+02:00 Rushabh Lathia :

>
>
> On Thu, Sep 22, 2016 at 10:04 PM, Tom Lane  wrote:
>
>> Rushabh Lathia  writes:
>> > I agree with the argument in this thread, having "Source code" as part
>> > of \df+ is bit annoying, specifically when output involve some really
>> > big PL language functions. Having is separate does make \df+ output more
>> > readable. So I would vote for \df++ rather then adding the source code
>> > as part of footer for \df+.
>>
>> If it's unreadable in \df+, how would \df++ make that any better?
>>
>>
> Eventhough source code as part of \df+ is bit annoying (specifically for
> PL functions),
> I noticed the argument in this thread that it's useful information for
> some of.  So \df++
> is just alternate option for the those who want the source code.
>

++ is little bit obscure. So better to remove src everywhere.

Regards

Pavel


>
>
>
>> regards, tom lane
>>
>
>
>
> --
> Rushabh Lathia
>

[HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared memory segment “Global/PostgreSQL.851401618

2016-09-22 Thread Amit Kapila

On Fri, Sep 23, 2016 at 1:21 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Tue, Sep 20, 2016 at 12:53 PM, Tom Lane  wrote:
>>> I'd be the first to agree that this point is inadequately documented
>>> in the code, but PostmasterRandom should be reserved for its existing
>>> security-related uses, not exposed to the world for (ahem) random other
>>> uses.
>
>> So, we could have dsm_postmaster_startup() seed the random number
>> generator itself, and then let PostmasterRandom() override the seed
>> later.  Like maybe:
>
> Works for me, at least as a temporary solution.  The disturbing thing
> here is that this still only does what we want if dsm_postmaster_startup
> happens before the first PostmasterRandom call --- which is OK today but
> seems pretty fragile.  Still, redesigning PostmasterRandom's seeding
> technique is not something to do right before 9.6 release.  Let's revert
> the prior patch and do it as you have here:
>
>> struct timeval tv;
>> gettimeofday(&tv, NULL);
>> srandom(tv.tv_sec);
>> ...
>> dsm_control_handle = random();
>
> for the time being.
>

Isn't it better if we use the same technique in dsm_create() as well
which uses random() for handle?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] File system operations.

2016-09-22 Thread Craig Ringer

On 22 September 2016 at 20:02, Yury Zhuravlev
 wrote:
> On четверг, 15 сентября 2016 г. 19:09:16 MSK, Tom Lane wrote:
>>
>> Robert Haas  writes:
>>>
>>> On Thu, Sep 15, 2016 at 11:01 AM, Anastasia Lubennikova
>>>  wrote:

 What do you think about moving stuff from pg_upgrade/file.c to
 storage/file/
 to allow reuse of this code? I think it'll be really helpful for
 developers
 of contrib modules
 like backup managers.
>>
>>
>>> Well, storage/file is backend and pg_upgrade is frontend.  If you want
>>> to reuse the same code for both it's got to go someplace else.
>>
>>
>> Also, to the extent that it's important to use those wrappers rather
>> than libc directly, it's because the wrappers are tied into backend
>> resource management and error handling conventions.  I don't see how
>> you convert that into code that also works in a frontend program,
>> or what the point would be exactly.
>
>
> Maybe we should towards to framework ecosystem. I mean, developers of third
> party tools must use maximum same api what in backend.

Well, there's a very gradual shift in that direction with libpgcommon
etc. But it's minimalist.

Most of what's done in the backend makes no sense in frontend code.

It'd make much more sense to adopt an existing portable runtime (nspr,
apr, whatever) than write our own if we wanted to go full framework.
But I don't think we're likely to. Much more likely to cause pain than
prevent it, esp since we're multiprocessing and shmem based.

There are a few areas we could use abstractions for though, like
fork()/exec() vs CreateProcessEx(...).


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pageinspect: Hash index support

2016-09-22 Thread Amit Kapila

On Fri, Sep 23, 2016 at 9:40 AM, Peter Eisentraut
 wrote:
> On 9/21/16 9:30 AM, Jesper Pedersen wrote:
>> Attached is v5, which add basic page verification.
>
> There are still some things that I found that appear to follow the old
> style (btree) conventions instead the new style (brin, gin) conventions.
>
> - Rename hash_metap to hash_metapage_info.
>
> - Don't use BuildTupleFromCStrings().  (There is nothing inherently
> wrong with it, but why convert numbers to strings and back again when it
> can be done more directly.)
>
> - Documentation for hash_page_stats still has blkno argument.
>
> - In hash_metap, the magic field could be type bytea, so that it
> displays in hex.  Or text like the brin functions.
>
> Some other comments:
>
> - hash_metap result fields spares and mapp should be arrays of integer.
>

how would you like to see both those arrays in tuple, right now, I
think Jesper's code is showing it as string.

> (Incidentally, the comment in hash.h talks about bitmaps[] but I think
> it means hashm_mapp[].)
>

which comment are you referring here?  hashm_mapp contains block
numbers of bitmap pages.

While looking at patch, I noticed below code which seems somewhat problematic:

+ stat->max_avail = BLCKSZ - (BLCKSZ - phdr->pd_special + SizeOfPageHeaderData);
+
+ /* page type (flags) */
+ if (opaque->hasho_flag & LH_META_PAGE)
+ stat->type = 'm';
+ else if (opaque->hasho_flag & LH_OVERFLOW_PAGE)
+ stat->type = 'v';
+ else if (opaque->hasho_flag & LH_BUCKET_PAGE)
+ stat->type = 'b';
+ else if (opaque->hasho_flag & LH_BITMAP_PAGE)
+ stat->type = 'i';

In the above code, it appears that you are trying to calculate
max_avail space for all pages in same way.  Don't you need to
calculate it differently for bitmap page or meta page as they don't
share the same format as other type of pages?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Showing parallel status in \df+

2016-09-22 Thread Rushabh Lathia

On Thu, Sep 22, 2016 at 10:04 PM, Tom Lane  wrote:

> Rushabh Lathia  writes:
> > I agree with the argument in this thread, having "Source code" as part
> > of \df+ is bit annoying, specifically when output involve some really
> > big PL language functions. Having is separate does make \df+ output more
> > readable. So I would vote for \df++ rather then adding the source code
> > as part of footer for \df+.
>
> If it's unreadable in \df+, how would \df++ make that any better?
>
>
Eventhough source code as part of \df+ is bit annoying (specifically for PL
functions),
I noticed the argument in this thread that it's useful information for some
of.  So \df++
is just alternate option for the those who want the source code.



> regards, tom lane
>



-- 
Rushabh Lathia

Re: [HACKERS] Typo in libpq-int.h

2016-09-22 Thread Heikki Linnakangas


On 09/22/2016 04:35 PM, Daniel Gustafsson wrote:

Ran into a typo in libpq-int.h while reading/hacking:

-   char   *gsslib; /* What GSS librart to use 
("gssapi" or
+   char   *gsslib; /* What GSS library to use 
("gssapi” or

Patch attached.


Thanks, fixed.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC] Should we fix postmaster to avoid slow shutdown?

2016-09-22 Thread Tsunakawa, Takayuki

> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> On Tue, Sep 20, 2016 at 2:20 AM, Tsunakawa, Takayuki
>  wrote:
> > There's no apparent evidence to indicate the cause, but I could guess
> > a few reasons.  What do you think these are correct and should fix
> > PostgreSQL? (I think so)
> 
> I think that we shouldn't start changing things based on guesses about what
> the problem is, even if they're fairly smart guesses.  The thing to do would
> be to construct a test rig, crash the server repeatedly, and add debugging
> instrumentation to figure out where the time is actually going.

We have tried to reproduce the problem in the past several days with much more 
stress on our environment than on the customer's one -- 1,000 tables aiming for 
a dozens of times larger stats file and repeated reconnection requests from 
hundreds of clients -- but we could not succeed.

> I do think your theory about the stats collector might be worth pursuing.
> It seems that the stats collector only responds to SIGQUIT, ignoring SIGTERM.
> Making it do a clean shutdown on SIGTERM and a fast exit on SIGQUIT seems
> possibly worthwhile.

Thank you for giving confidence for proceeding.  And I also believe that 
postmaster should close the listening ports earlier. Regardless of whether this 
problem will be solved not confident these will solve the, I think it'd be 
better to fix these two points so that postmaster doesn't longer time than 
necessary.  I think I'll create a patch after giving it a bit more thought.

Regards
Takayuki Tsunakawa

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Stopping logical replication protocol

2016-09-22 Thread Craig Ringer

On 19 September 2016 at 07:12, Vladimir Gordiychuk  wrote:
> New patch in attach. 0001 and 0002 without changes.
> 0003 - patch contain improvements for pg_recvloginca, now pg_recvloginca
> after receive SIGINT will send CopyDone package to postgresql and wait from
> server CopyDone. For backward compatible after repeat send SIGINT
> pg_recvloginca will continue immediately without wait CopyDone from server.
> 0004 patch contain regression tests on scenarios that fix 0001 and 0002
> patches.

Great.

Thanks for this.

First observation (which I'll fix in updated patch):

It looks like you didn't import my updated patches, so I've rebased
your new patches on top of them.

Whitespace issues:

$ git am 
~/Downloads/0003-Add-ability-for-pg_recvlogical-safe-stop-replication.patch
Applying: Add ability for pg_recvlogical safe stop replication
/home/craig/projects/2Q/postgres/.git/rebase-apply/patch:140: indent
with spaces.
   msgBuf + hdr_len + bytes_written,
/home/craig/projects/2Q/postgres/.git/rebase-apply/patch:550: indent
with spaces.
/* Backward compatible, allow force interrupt logical replication
/home/craig/projects/2Q/postgres/.git/rebase-apply/patch:551: indent
with spaces.
 * after second SIGINT without wait CopyDone from server
/home/craig/projects/2Q/postgres/.git/rebase-apply/patch:552: indent
with spaces.
 */
warning: 4 lines add whitespace errors.

Remember to use "git log --check" before sending patches.

Also, comment style, please do

/* this */

or

/*
 * this
 */

not

/* this
 */

I did't see you explain why this was removed:

-/* fast path */
-/* Try to flush pending output to the client */
-if (pq_flush_if_writable() != 0)
-WalSndShutdown();
-
-if (!pq_is_send_pending())
-return;

I fixed a warning introduced here:

pg_recvlogical.c: In function ‘ProcessXLogData’:
pg_recvlogical.c:289:2: warning: ISO C90 forbids mixed declarations
and code [-Wdeclaration-after-statement]
  int bytes_left = msgLength - hdr_len;
  ^

Some of the changes to pg_recvlogical look to be copied from
receivelog.c, most notably the functions ProcessKeepalive and
ProcessXLogData . I thought that rather than copying them, shouldn't
the existing ones be modified to meet your needs? But it looks like
the issue is that a fair chunk of the rest of pg_recvlogical doesn't
re-use code from receivelog.c either; for example, pg_recvlogical's
sendFeedback differs from receivelog.c's sendFeedback. So
pg_recvlogical doesn't share the globals that receivelog.c assumes are
used. Also, what I thought were copied from receivelog.c were actually
extracted from code previously inline in StreamLogicalLog(...) in
pg_recvlogical.c .

I'm reluctant to try to untangle this in the same patch for risk that
it'll get stalled by issues with refactoring. The change you've
already made is a useful cleanup without messing with unnecessary
code.

Your patch 3 does something ... odd:

 src/test/recovery/t/001_stream_rep.pl|  63
--
 src/test/recovery/t/002_archiving.pl |  53 ---
 src/test/recovery/t/003_recovery_targets.pl  | 146
---
 src/test/recovery/t/004_timeline_switch.pl   |  75
--
 src/test/recovery/t/005_replay_delay.pl  |  69

 src/test/recovery/t/006_logical_decoding.pl  |  40 --
 src/test/recovery/t/007_sync_rep.pl  | 174

 src/test/recovery/t/previous/001_stream_rep.pl   |  63
++
 src/test/recovery/t/previous/002_archiving.pl|  53 +++
 src/test/recovery/t/previous/003_recovery_targets.pl | 146
+++
 src/test/recovery/t/previous/004_timeline_switch.pl  |  75
++
 src/test/recovery/t/previous/005_replay_delay.pl |  69

 src/test/recovery/t/previous/006_logical_decoding.pl |  40 ++
 src/test/recovery/t/previous/007_sync_rep.pl | 174

so I've revised it to remove that whole change, which I think you
probably did not mean to commit.

Reworded commit message.

Ensured that we send feedback just before a client-initiated CopyDone
so server knows what position we saved up to. We don't discard already
buffered data

I've modified patch 3 so that it also flushes to disk and sends
feedback before sending CopyDone, then discards any xlog data received
after it sends CopyDone. This is helpful in ensuring that we get
predictable behaviour when cancelling pg_recvxlog and restarting it
because the client and server agree on the point at which replay
stopped.

I was evaluating the tests and I don't think they actually demonstrate
that the patch works. There's nothing done to check that
pg_recvl

Re: [HACKERS] pg_ctl promote wait

2016-09-22 Thread Michael Paquier

On Fri, Sep 23, 2016 at 12:16 AM, Masahiko Sawada  wrote:
> On Thu, Sep 22, 2016 at 1:25 AM, Peter Eisentraut
>  wrote:
>> On 8/11/16 9:28 AM, Michael Paquier wrote:
>> Committed with that adjustment.

Thanks...

> Commit e7010ce seems to forget to change help message of pg_ctl.
> Attached patch.

And right, we missed that bit.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Tracking wait event for latches

2016-09-22 Thread Amit Kapila

On Fri, Sep 23, 2016 at 7:02 AM, Robert Haas  wrote:
> On Thu, Sep 22, 2016 at 7:10 PM, Thomas Munro
>  wrote:
>
>> I was thinking about suggesting a category "Replication" to cover the
>> waits for client IO relating to replication, as opposed to client IO
>> waits relating to regular user connections.  Then you could put sync
>> rep into that category instead of IPC, even though technically it is
>> waiting for IPC from walsender process(es), on the basis that it's
>> more newsworthy to a DBA that it's really waiting for a remote replica
>> to respond.  But it's probably pretty clear what's going on from the
>> the wait point names, so maybe it's not worth a category.  Thoughts?
>
> I thought about a replication category but either it will only have
> SyncRep in it, which is odd, or it will pull in other things that
> otherwise fit nicely into the Activity category, and then that
> boundaries of all the categories become mushy: is the subsystem that
> causes the wait that we are trying to document, or the kind of thing
> for which we are waiting?
>

I also think that it can add some confusion in defining boundaries and
also might not be consistent with current way we have defined the
waits, however there is some value in using subsystem which is that
user can identify the bottlenecks with ease.  I think this applies to
both Replication and Parallel Query.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Tracking wait event for latches

2016-09-22 Thread Amit Kapila

On Thu, Sep 22, 2016 at 7:19 PM, Robert Haas  wrote:
> On Wed, Sep 21, 2016 at 5:49 PM, Thomas Munro
>  wrote:
>
> So, I tried to classify these.  Here's what I came up with.
>
> Activity: ArchiverMain, AutoVacuumMain, BgWriterMain,
> BgWriterHibernate, CheckpointerMain, PgStatMain, RecoveryWalAll,
> RecoveryWalStream, SysLoggerMain, WalReceiverMain, WalWriterMain
>
> Client: SecureRead, SecureWrite, SSLOpenServer, WalSenderMain,
> WalSenderWaitForWAL, WalSenderWriteData, WalReceiverWaitStart
>
> Timeout: BaseBackupThrottle, PgSleep, RecoveryApplyDelay
>
> IPC: BgWorkerShutdown, BgWorkerStartup, ExecuteGather,
> MessageQueueInternal, MessageQueuePutMessage, MessageQueueReceive,
> MessageQueueSend, ParallelFinish, ProcSignal, ProcSleep, SyncRep
>
> Extension: Extension
>

We already call lwlock waits from an extension as "extension", so I
think just naming this an Extension might create some confusion.

> I classified all of the main loop waits as waiting for activity; all
> of those are background processes that are waiting for something to
> happen and are more or less happy to sleep forever until it does.  I
> also included the RecoveryWalAll and RecoveryWalStream events in
> there; those don't have the sort of "main loop" flavor of the others
> but they are happy to wait more or less indefinitely for something to
> occur.  Likewise, it was pretty easy to find all of the events that
> were waiting for client I/O, and I grouped those all under "Client".
> A few of the remaining wait events seemed like they were clearly
> waiting for a particular timeout to expire, so I gave those their own
> "Timeout" category.
>
> I believe these categorizations are actually useful for users.  For
> example, you might want to see all of the waits in the system but
> exclude the "Client", "Activity", and "Timeout" categories because
> those are things that aren't signs of a problem.  A "Timeout" wait is
> one that you explicitly requested, a "Client" wait isn't the server's
> fault, and an "Activity" wait just means nothing is happening.  In
> contrast, a "Lock" or "LWLock" or "IPC" wait shows that something is
> actually delaying work that we'd ideally prefer to have get done
> sooner.
>
> I grouped the rest of this stuff as "IPC" because all of these events
> are cases where one server process is waiting for another server
> processes .  That could be further subdivided, of course: most of
> those events are only going to occur in relation to parallel query,
> but I didn't want to group it that way explicitly because both
> background workers and shm_mq have other potential uses.  ProcSignal
> and ProcSleep are related to heavyweight locks and SyncRep is of
> course related to synchronous replication.   But they're all related
> in that one server process is waiting for another server process to
> tell it that a certain state has been reached, so IPC seems like a
> good categorization.
>
> Finally, extensions got their own category in this taxonomy, though I
> wonder if it would be better to instead have
> Activity/ExtensionActivity, Client/ExtensionClient,
> Timeout/ExtensionTimeout, and IPC/ExtensionIPC instead of making it a
> separate toplevel category.
>

+1. It can avoid confusion.

> To me, this seems like a pretty solid toplevel categorization and a
> lot more useful than just throwing all of these in one bucket and
> saying "good luck".

Agreed.  This categorisation is very good and can help patch author to proceed.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pageinspect: Hash index support

2016-09-22 Thread Peter Eisentraut

On 9/21/16 9:30 AM, Jesper Pedersen wrote:
> Attached is v5, which add basic page verification.

There are still some things that I found that appear to follow the old
style (btree) conventions instead the new style (brin, gin) conventions.

- Rename hash_metap to hash_metapage_info.

- Don't use BuildTupleFromCStrings().  (There is nothing inherently
wrong with it, but why convert numbers to strings and back again when it
can be done more directly.)

- Documentation for hash_page_stats still has blkno argument.

- In hash_metap, the magic field could be type bytea, so that it
displays in hex.  Or text like the brin functions.

Some other comments:

- hash_metap result fields spares and mapp should be arrays of integer.

(Incidentally, the comment in hash.h talks about bitmaps[] but I think
it means hashm_mapp[].)

- As of very recently, we don't need to move pageinspect--1.5.sql to
pageinspect--1.6.sql anymore.  Just add pageinspect--1.5--1.6.sql.

- In the documentation for hash_page_stats(), the "+" sign is misaligned.

- In hash_page_items, the fields itemlen, nulls, vars are always 16,
false, false.  So perhaps there is no need for them.  Similarly, the
hash_page_stats in hash_page_stats is always 16.

- The data field could be of type bytea.

- Add a pointer in the documentation to the relevant header file.

Bonus:

- Add subsections in the documentation (general functions, B-tree
functions, etc.)

- Add tests.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel sec scan in plpgsql

2016-09-22 Thread Amit Kapila

On Thu, Sep 22, 2016 at 7:32 PM, Robert Haas  wrote:
> On Thu, Sep 22, 2016 at 8:36 AM, Amit Kapila  wrote:
>> I think for certain cases like into clause, the rows passed will be
>> equal to actual number of rows, otherwise it will generate error.  So
>> we can pass that information in executor layer.  Another kind of cases
>> which are worth considering are when from plpgsql we fetch limited
>> rows at-a-time, but we fetch till end like the case of
>> exec_stmt_return_query().
>
> Yes, I think that those cases can be considered.  Some careful code
> inspection will be needed to make sure the cases we want to enable are
> safe, and some testing will be needed to make sure they behave
> properly.  But it doesn't sound like an unsolvable problem.  I hope
> someone who isn't me will decide to work on it.  :-)
>

makes sense.  I think along with that we can also evaluate, if we can
enable parallel query from other pl languages. I think if we can
enable parallelism from all pl languages in 10.0, that will be a good
step forward.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-22 Thread Amit Kapila

On Fri, Sep 23, 2016 at 5:14 AM, Tomas Vondra
 wrote:
> On 09/21/2016 08:04 AM, Amit Kapila wrote:
>>
>
> (c) Although it's not visible in the results, 4.5.5 almost perfectly
> eliminated the fluctuations in the results. For example when 3.2.80 produced
> this results (10 runs with the same parameters):
>
> 12118 11610 27939 11771 18065
> 12152 14375 10983 13614 11077
>
> we get this on 4.5.5
>
> 37354 37650 37371 37190 37233
> 38498 37166 36862 37928 38509
>
> Notice how much more even the 4.5.5 results are, compared to 3.2.80.
>

how long each run was?  Generally, I do half-hour run to get stable results.

> (d) There's no sign of any benefit from any of the patches (it was only
> helpful >= 128 clients, but that's where the tps actually dropped on 3.2.80
> - apparently 4.5.5 fixes that and the benefit is gone).
>
> It's a bit annoying that after upgrading from 3.2.80 to 4.5.5, the
> performance with 32 and 64 clients dropped quite noticeably (by more than
> 10%). I believe that might be a kernel regression, but perhaps it's a price
> for improved scalability for higher client counts.
>
> It of course begs the question what kernel version is running on the machine
> used by Dilip (i.e. cthulhu)? Although it's a Power machine, so I'm not sure
> how much the kernel matters on it.
>

cthulhu is a x86 m/c and the kernel version is 3.10.  Seeing, the
above results I think kernel version do matter, but does that mean we
ignore the benefits we are seeing on somewhat older kernel version.  I
think right answer here is to do some experiments which can show the
actual contention as suggested by Robert and you.

> I'll ask someone else with access to this particular machine to repeat the
> tests, as I have a nagging suspicion that I've missed something important
> when compiling / running the benchmarks. I'll also retry the benchmarks on
> 3.2.80 to see if I get the same numbers.
>
>>
>> Okay, but I think it is better to see the results between 64~128
>> client count and may be greater than128 client counts, because it is
>> clear that patch won't improve performance below that.
>>
>
> There are results for 64, 128 and 192 clients. Why should we care about
> numbers in between? How likely (and useful) would it be to get improvement
> with 96 clients, but no improvement for 64 or 128 clients?
>

The only point to take was to see from where we have started seeing
improvement, saying that the TPS has improved from >=72 client count
is different from saying that it has improved from >=128.

>> No issues, I have already explained why I think it is important to
>> reduce the remaining CLOGControlLock contention in yesterday's and
>> this mail. If none of you is convinced, then I think we have no
>> choice but to drop this patch.
>>
>
> I agree it's useful to reduce lock contention in general, but considering
> the last set of benchmarks shows no benefit with recent kernel, I think we
> really need a better understanding of what's going on, what workloads /
> systems it's supposed to improve, etc.
>
> I don't dare to suggest rejecting the patch, but I don't see how we could
> commit any of the patches at this point. So perhaps "returned with feedback"
> and resubmitting in the next CF (along with analysis of improved workloads)
> would be appropriate.
>

Agreed with your conclusion and changed the status of patch in CF accordingly.

Many thanks for doing the tests.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-22 Thread Amit Kapila

On Fri, Sep 23, 2016 at 7:17 AM, Tomas Vondra
 wrote:
> On 09/23/2016 03:20 AM, Robert Haas wrote:
>>
>> On Thu, Sep 22, 2016 at 7:44 PM, Tomas Vondra
>>  wrote:
>>>
>>> I don't dare to suggest rejecting the patch, but I don't see how
>>> we could commit any of the patches at this point. So perhaps
>>> "returned with feedback" and resubmitting in the next CF (along
>>> with analysis of improvedworkloads) would be appropriate.
>>
>>
>> I think it would be useful to have some kind of theoretical analysis
>> of how much time we're spending waiting for various locks. So, for
>> example, suppose we one run of these tests with various client
>> counts - say, 1, 8, 16, 32, 64, 96, 128, 192, 256 - and we run
>> "select wait_event from pg_stat_activity" once per second throughout
>> the test. Then we see how many times we get each wait event,
>> including NULL (no wait event). Now, from this, we can compute the
>> approximate percentage of time we're spending waiting on
>> CLogControlLock and every other lock, too, as well as the percentage
>> of time we're not waiting for lock. That, it seems to me, would give
>> us a pretty clear idea what the maximum benefit we could hope for
>> from reducing contention on any given lock might be.
>>
>
> Yeah, I think that might be a good way to analyze the locks in general, not
> just got these patches. 24h run with per-second samples should give us about
> 86400 samples (well, multiplied by number of clients), which is probably
> good enough.
>
> We also have LWLOCK_STATS, that might be interesting too, but I'm not sure
> how much it affects the behavior (and AFAIK it also only dumps the data to
> the server log).
>

Right, I think LWLOCK_STATS give us the count of how many time we have
blocked due to particular lock like below where *blk* gives that
number.

PID 164692 lwlock main 11: shacq 2734189 exacq 146304 blk 73808
spindelay 73 dequeue self 57241

I think doing some experiments with both the techniques can help us to
take a call on these patches.

Do we want these experiments on different kernel versions or are we
okay with the current version on cthulhu (3.10) or we want to only
consider the results with latest kernel?

>>
>>
>> Now, we could also try that experiment with various patches. If we
>> can show that some patch reduces CLogControlLock contention without
>> increasing TPS, they might still be worth committing for that
>> reason. Otherwise, you could have a chicken-and-egg problem. If
>> reducing contention on A doesn't help TPS because of lock B and
>> visca-versa, then does that mean we can never commit any patch to
>> reduce contention on either lock? Hopefully not. But I agree with you
>> that there's certainly not enough evidence to commit any of these
>> patches now. To my mind, these numbers aren't convincing.
>>
>
> Yes, the chicken-and-egg problem is why the tests were done with unlogged
> tables (to work around the WAL lock).
>

Yeah, but I suspect still there was a impact due to ProcArrayLock.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 9.6 TAP tests and extensions

2016-09-22 Thread Craig Ringer

On 23 September 2016 at 00:32, Tom Lane  wrote:
> Craig Ringer  writes:
>> On 13 September 2016 at 22:02, Tom Lane  wrote:
>>> Without taking a position on the merits of this patch per se, I'd like
>>> to say that I find the argument for back-patching into 9.6 and not
>>> further than that to be pretty dubious.  $(prove_check) has been there
>>> since 9.4, and in the past we've often regretted it when we failed
>>> to back-patch TAP infrastructure fixes all the way back to 9.4.
>
>> No objection to backpatching, I just thought I'd be more intrusive to
>> do that than just 9.6.
>
>> Since 9.5 and older have more limited versions of PostgresNode which
>> lack safe_psql, etc, I'm not sure it's very practical for extensions
>> to bother running TAP tests on 9.4 and 9.5 anyway.
>
> Certainly there are restrictions, but I'd imagine that every new release
> will be adding features to the TAP test infrastructure for some time to
> come.  I think it's silly to claim that 9.6 is the first branch where
> TAP testing is usable at all.

OK.

I didn't intend to claim 9.6 is the first branch where it's usable,
just that the differences between 9.4/9.5 and 9.6 mean that supporting
both in extensions will likely be more pain than it is worth. Mainly
due to the safe_psql change in 2c83f435. It might've been good to
backport that, but it was pretty wide reaching and at the time I
wasn't thinking in terms of using TAP tests in extensions so it didn't
occur to me.

Extension Perl code can always use some adapter/glue code to handle
9.4 and 9.5 if they want, or error if they don't, so it's not a fatal
barrier.

Also, despite what I said upthread, there's no need for normal
PGXS-using extensions to define their $(prove_installcheck)
replacement. It works as-is. The problems I was having stemmed from
the fact that I was working with a pretty nonstandard Makefile without
realising that the changes were going to affect prove. $(prove_check)
isn't useful from extensions of course since it expects a temp install
and PGXS doesn't know how to make one, but $(prove_installcheck) does
the job fine.

It's thus sufficient to apply the patch to install the perl modules to
9.4, 9.5 and 9.6. Nothing else is needed. I've attached backports for
9.4 and 9.5.

If you were really keen, we could actually backport the new TAP code
to 9.4 and 9.5 pretty much wholesale. 9.4 and 9.5 don't have the psql
method, PostgresNode, etc, so there's nothing to break. But that's for
separate consideration.

>> Extension authors can just use:
>> ifeq ($(MAJORVERSION),9.6)
>> endif
>> when defining their prove rules.
>
> That will break as soon as 10 comes out.  And numerical >= tests aren't
> all that convenient in Make.  It'd be much better if a test on whether
> $(prove_check) is defined would be sufficient.

Fair enough. I forgot how much numeric range tests sucked in Make.

In that case we should backpatch the installation of the Perl modules
right back to 9.4. There's not even a need to test if
$(prove_installcheck) is defined then, just need a semicolon so it
evaluates to empty e.g.

prove_installcheck:
$(prove_installcheck) ;

check: prove_check

So ... patches attached. The 9.6 patch applies to head too.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
From 75c5c0fe98618a6e59af82c4314832df47e6268e Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Fri, 23 Sep 2016 10:15:00 +0800
Subject: [PATCH] Install Perl modules for TAP tests for use from PGXS

---
 src/Makefile  |  1 +
 src/test/Makefile |  2 +-
 src/test/perl/GNUmakefile | 35 +++
 3 files changed, 37 insertions(+), 1 deletion(-)
 create mode 100644 src/test/perl/GNUmakefile

diff --git a/src/Makefile b/src/Makefile
index e859826..750c3a9 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -25,6 +25,7 @@ SUBDIRS = \
 	bin \
 	pl \
 	makefiles \
+	test/perl \
 	test/regress

 # There are too many interdependencies between the subdirectories, so
diff --git a/src/test/Makefile b/src/test/Makefile
index 0fd7eab..ef1dcb5 100644
--- a/src/test/Makefile
+++ b/src/test/Makefile
@@ -12,6 +12,6 @@ subdir = src/test
 top_builddir = ../..
 include $(top_builddir)/src/Makefile.global

-SUBDIRS = regress isolation
+SUBDIRS = regress isolation perl

 $(recurse)
diff --git a/src/test/perl/GNUmakefile b/src/test/perl/GNUmakefile
new file mode 100644
index 000..09945f7
--- /dev/null
+++ b/src/test/perl/GNUmakefile
@@ -0,0 +1,35 @@
+#-
+#
+# Makefile for src/test/perl
+#
+# Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/test/perl/GNUmakefile
+#
+#-
+
+subdir = src/test/perl
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.globa

Re: [HACKERS] pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)

2016-09-22 Thread Peter Eisentraut

This is looking pretty good.  More comments on this patch set:

0001:

Keep the file order alphabetical in Mkvcbuild.pm.

Needs nls.mk updates for file move (in initdb and pg_basebackup
directories).

0002:

durable_rename needs close(fd) in non-error path (compare backend code).

Changing from fsync() to fsync_name() in some cases means that
inaccessible files are now ignored.  I guess this would only happen in
very obscure circumstances, but it's worth considering if that is OK.

You added this comment:

 * XXX: This means that we might not restart if a crash occurs
before the
 * fsync below. We probably should create the file in a temporary path
 * like the backend does...

pg_receivexlog uses the .partial suffix for this reason.  Why doesn't
pg_basebackup do that?

In open_walfile, could we move the fsync calls before the fstat or
somewhere around there so we don't have to make the same calls in two
different branches?

0003:

There was a discussion about renaming the --nosync option in initdb to
--no-sync, but until that is done, the option in pg_basebackup should
stay what initdb has right now.

There is a whitespace alignment error in usage().

The -N option should be listed after -n.

Some fsync calls are not covered by a do_sync conditional.  I see some
in close_walfile(), HandleCopyStream(), ProcessKeepaliveMsg().

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-22 Thread Tomas Vondra


On 09/23/2016 03:20 AM, Robert Haas wrote:

On Thu, Sep 22, 2016 at 7:44 PM, Tomas Vondra
 wrote:

I don't dare to suggest rejecting the patch, but I don't see how
we could commit any of the patches at this point. So perhaps
"returned with feedback" and resubmitting in the next CF (along
with analysis of improvedworkloads) would be appropriate.


I think it would be useful to have some kind of theoretical analysis
of how much time we're spending waiting for various locks. So, for
example, suppose we one run of these tests with various client
counts - say, 1, 8, 16, 32, 64, 96, 128, 192, 256 - and we run
"select wait_event from pg_stat_activity" once per second throughout
the test. Then we see how many times we get each wait event,
including NULL (no wait event). Now, from this, we can compute the
approximate percentage of time we're spending waiting on
CLogControlLock and every other lock, too, as well as the percentage
of time we're not waiting for lock. That, it seems to me, would give
us a pretty clear idea what the maximum benefit we could hope for
from reducing contention on any given lock might be.



Yeah, I think that might be a good way to analyze the locks in general, 
not just got these patches. 24h run with per-second samples should give 
us about 86400 samples (well, multiplied by number of clients), which is 
probably good enough.


We also have LWLOCK_STATS, that might be interesting too, but I'm not 
sure how much it affects the behavior (and AFAIK it also only dumps the 
data to the server log).


>

Now, we could also try that experiment with various patches. If we
can show that some patch reduces CLogControlLock contention without
increasing TPS, they might still be worth committing for that
reason. Otherwise, you could have a chicken-and-egg problem. If
reducing contention on A doesn't help TPS because of lock B and
visca-versa, then does that mean we can never commit any patch to
reduce contention on either lock? Hopefully not. But I agree with you
that there's certainly not enough evidence to commit any of these
patches now. To my mind, these numbers aren't convincing.



Yes, the chicken-and-egg problem is why the tests were done with 
unlogged tables (to work around the WAL lock).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Tracking wait event for latches

2016-09-22 Thread Robert Haas

On Thu, Sep 22, 2016 at 7:10 PM, Thomas Munro
 wrote:
> Interesting.  OK, I agree that it'd be useful to show that we're
> waiting because there's nothing happening, or waiting because the user
> asked us to sleep, or waiting on IO, or waiting for an IPC response
> because something is happening, and that higher level information is
> difficult/impossible to extract automatically from the WaitEventSet.

Cool.  :-)

> I understand that "Activity" is the category of wait points that are
> waiting for activity, but I wonder if it might be clearer to users if
> that were called "Idle", because it's the category of idle waits
> caused by non-activity.

I thought about that but figured it would be better to consistently
state the thing *for which* we were waiting.  We wait FOR a client or
a timeout or activity.  We do not wait FOR idle; we wait to be NOT
idle.

> Why is WalSenderMain not in that category alongside WalReceiverMain?
> Hmm, actually it's kind of a tricky one: whether it's really idle or
> waiting for IO depends.  It's always ready to wait for clients to send
> messages, but I'd say that's part of its "idle" behaviour.  But it's
> sometimes waiting for the socket to be writable: if
> (pq_is_send_pending()) wakeEvents |= WL_SOCKET_WRITABLE, and that's
> when it's definitely not idle, it's actively trying to feed WAL down
> the pipe.  Do we want to get into dynamic categories depending on
> conditions like that?

I suspect that's overkill.  I don't want wait-point-naming to make
programming the system noticeably more difficult, so I think it's fine
to pick a categorization of what we think the typical case will be and
call it good.  If we try that and people find it's a nuisance, we can
fix it then.  In the case of WAL sender, I assume it will normally be
waiting for more WAL to be generated; whereas in the case of WAL
receiver, I assume it will normally be waiting for more WAL to be
received from the remote side.  The reverse cases are possible: the
sender could be waiting for the socket buffer to drain so it can push
more WAL onto the wire, and the receiver could likewise be waiting for
buffer space to push out feedback messages.  But probably mostly not.
At least for a first cut, I'd be inclined to handle this fuzziness by
putting weasel-words in the documentation rather than by trying to
make the reporting 100% perfectly accurate.

> I was thinking about suggesting a category "Replication" to cover the
> waits for client IO relating to replication, as opposed to client IO
> waits relating to regular user connections.  Then you could put sync
> rep into that category instead of IPC, even though technically it is
> waiting for IPC from walsender process(es), on the basis that it's
> more newsworthy to a DBA that it's really waiting for a remote replica
> to respond.  But it's probably pretty clear what's going on from the
> the wait point names, so maybe it's not worth a category.  Thoughts?

I thought about a replication category but either it will only have
SyncRep in it, which is odd, or it will pull in other things that
otherwise fit nicely into the Activity category, and then that
boundaries of all the categories become mushy: is the subsystem that
causes the wait that we are trying to document, or the kind of thing
for which we are waiting?

> I do suspect that the set of wait points will grow quite a bit as we
> develop more parallel stuff though.  For example, I have been working
> on a patch that adds several more wait points, indirectly via
> condition variables (using your patch).  Actually in my case it's
> BarrierWait -> ConditionVariableWait -> WaitEventSetWait.  I propose
> that these higher level wait primitives should support passing a wait
> point identifier through to WaitEventSetWait.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-22 Thread Robert Haas

On Thu, Sep 22, 2016 at 7:44 PM, Tomas Vondra
 wrote:
> I don't dare to suggest rejecting the patch, but I don't see how we could
> commit any of the patches at this point. So perhaps "returned with feedback"
> and resubmitting in the next CF (along with analysis of improved workloads)
> would be appropriate.

I think it would be useful to have some kind of theoretical analysis
of how much time we're spending waiting for various locks.  So, for
example, suppose we one run of these tests with various client counts
- say, 1, 8, 16, 32, 64, 96, 128, 192, 256 - and we run "select
wait_event from pg_stat_activity" once per second throughout the test.
Then we see how many times we get each wait event, including NULL (no
wait event).  Now, from this, we can compute the approximate
percentage of time we're spending waiting on CLogControlLock and every
other lock, too, as well as the percentage of time we're not waiting
for lock.  That, it seems to me, would give us a pretty clear idea
what the maximum benefit we could hope for from reducing contention on
any given lock might be.

Now, we could also try that experiment with various patches.  If we
can show that some patch reduces CLogControlLock contention without
increasing TPS, they might still be worth committing for that reason.
Otherwise, you could have a chicken-and-egg problem.  If reducing
contention on A doesn't help TPS because of lock B and visca-versa,
then does that mean we can never commit any patch to reduce contention
on either lock?  Hopefully not.  But I agree with you that there's
certainly not enough evidence to commit any of these patches now.  To
my mind, these numbers aren't convincing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why postgres take RowExclusiveLock on all partition

2016-09-22 Thread Bruce Momjian

On Fri, Sep 16, 2016 at 09:56:39PM +0530, Sachin Kotwal wrote:
> Hi Tom,
> 
> What I understood from this https://www.postgresql.org/docs/9.5/static/
> explicit-locking.html#TABLE-LOCK-COMPATIBILITY
> is :
> 
> The RowExclusiveLock conflicts with queries want SHARE, SHARE ROW EXCLUSIVE, 
> EXCLUSIVE ACCESS EXCLUSIVE locks.
> 
> In one of our customer environment we want do some DDL operation everyday
> through cronjobs . This cronjobs get blocked by RowExclusiveLock lock taken by
> UPDATE query.  And then lot more queries are waiting on this cronjob as sqls
> under cronjob have hold ACCESS EXCLUSIVE on related tables  involved in other
> select queries.
> 
> 
> If we can not reduce locking in partition scenario, then it is fine. We can
> consider this is limitation of PostgreSQL or any other RDBMS system.

We can't have DDL happening while a table is being accessed.  I guess we
could drop the lock once we are done with the partition but we don't
currently do that, and it would be complicated.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-22 Thread Tomas Vondra


On 09/21/2016 08:04 AM, Amit Kapila wrote:

On Wed, Sep 21, 2016 at 3:48 AM, Tomas Vondra
 wrote:

...



I'll repeat the test on the 4-socket machine with a newer kernel,
but that's probably the last benchmark I'll do for this patch for
now.



Attached are results from benchmarks running on kernel 4.5 (instead of 
the old 3.2.80). I've only done synchronous_commit=on, and I've added a 
few client counts (mostly at the lower end). The data are pushed the 
data to the git repository, see


git push --set-upstream origin master

The summary looks like this (showing both the 3.2.80 and 4.5.5 results):

1) Dilip's workload

 3.2.80 16 32 64128192
---
 master  26138  37790  38492  13653   8337
 granular-locking25661  38586  40692  14535   8311
 no-content-lock 25653  39059  41169  14370   8373
 group-update26472  39170  42126  18923   8366

 4.5.5 1  8 16 32 64128192
---
 granular-locking   4050  23048  27969  32076  34874  36555  37710
 no-content-lock4025  23166  28430  33032  35214  37576  39191
 group-update   4002  23037  28008  32492  35161  36836  38850
 master 3968  22883  27437  32217  34823  36668  38073


2) pgbench

 3.2.80 16 32 64128192
---
 master  22904  36077  41295  35574   8297
 granular-locking23323  36254  42446  43909   8959
 no-content-lock 23304  36670  42606  48440   8813
 group-update23127  36696  41859  46693   8345

 4.5.5 1  8 16 32 64128192
---
 granular-locking   3116  19235  27388  29150  31905  34105  36359
 no-content-lock3206  19071  27492  29178  32009  34140  36321
 group-update   3195  19104  26888  29236  32140  33953  35901
 master 3136  18650  26249  28731  31515  33328  35243


The 4.5 kernel clearly changed the results significantly:

(a) Compared to the results from 3.2.80 kernel, some numbers improved, 
some got worse. For example, on 3.2.80 pgbench did ~23k tps with 16 
clients, on 4.5.5 it does 27k tps. With 64 clients the performance 
dropped from 41k tps to ~34k (on master).


(b) The drop above 64 clients is gone - on 3.2.80 it dropped very 
quickly to only ~8k with 192 clients. On 4.5 the tps actually continues 
to increase, and we get ~35k with 192 clients.


(c) Although it's not visible in the results, 4.5.5 almost perfectly 
eliminated the fluctuations in the results. For example when 3.2.80 
produced this results (10 runs with the same parameters):


12118 11610 27939 11771 18065
12152 14375 10983 13614 11077

we get this on 4.5.5

37354 37650 37371 37190 37233
38498 37166 36862 37928 38509

Notice how much more even the 4.5.5 results are, compared to 3.2.80.

(d) There's no sign of any benefit from any of the patches (it was only 
helpful >= 128 clients, but that's where the tps actually dropped on 
3.2.80 - apparently 4.5.5 fixes that and the benefit is gone).


It's a bit annoying that after upgrading from 3.2.80 to 4.5.5, the 
performance with 32 and 64 clients dropped quite noticeably (by more 
than 10%). I believe that might be a kernel regression, but perhaps it's 
a price for improved scalability for higher client counts.


It of course begs the question what kernel version is running on the 
machine used by Dilip (i.e. cthulhu)? Although it's a Power machine, so 
I'm not sure how much the kernel matters on it.


I'll ask someone else with access to this particular machine to repeat 
the tests, as I have a nagging suspicion that I've missed something 
important when compiling / running the benchmarks. I'll also retry the 
benchmarks on 3.2.80 to see if I get the same numbers.




Okay, but I think it is better to see the results between 64~128
client count and may be greater than128 client counts, because it is
clear that patch won't improve performance below that.



There are results for 64, 128 and 192 clients. Why should we care about 
numbers in between? How likely (and useful) would it be to get 
improvement with 96 clients, but no improvement for 64 or 128 clients?


>>

I agree with Robert that the cases the patch is supposed to
improve are a bit contrived because of the very high client
counts.



No issues, I have already explained why I think it is important to
reduce the remaining CLOGControlLock contention in yesterday's and
this mail. If none of you is convinced, then I think we have no
choice but to drop this patch.



I agree it's useful to reduce lock contention in general, but 
considering the last set of

Re: [HACKERS] pg_upgrade vs user created range type extension

2016-09-22 Thread Andrew Dunstan




On 09/22/2016 07:33 PM, Tom Lane wrote:

Andrew Dunstan  writes:

I have just encountered an apparent bug in pg_upgrade (or possibly pg_dump).

Hmm, it sort of looks like pg_dump believes it should dump the range's
constructor function in binary-upgrade mode, while the backend is creating
the constructor function during CREATE TYPE anyway.  But if that's the
case, upgrade of user-defined range types would never have worked ...
seems like we should have noticed before now.

If that diagnosis is correct, we should either change pg_dump to not
dump that function, or change CREATE TYPE AS RANGE to not auto-create
the constructor functions in binary-upgrade mode.  The latter might be
more flexible in the long run.





Yeah, I think your diagnosis is correct. I'm not sure I see the point of 
the flexibility given that you can't specify a constructor function for 
range types (if that feature had been available I would probably have 
used it in this extension).


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_upgrade vs user created range type extension

2016-09-22 Thread Tom Lane

Andrew Dunstan  writes:
> I have just encountered an apparent bug in pg_upgrade (or possibly pg_dump).

Hmm, it sort of looks like pg_dump believes it should dump the range's
constructor function in binary-upgrade mode, while the backend is creating
the constructor function during CREATE TYPE anyway.  But if that's the
case, upgrade of user-defined range types would never have worked ...
seems like we should have noticed before now.

If that diagnosis is correct, we should either change pg_dump to not
dump that function, or change CREATE TYPE AS RANGE to not auto-create
the constructor functions in binary-upgrade mode.  The latter might be
more flexible in the long run.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Tracking wait event for latches

2016-09-22 Thread Thomas Munro

On Fri, Sep 23, 2016 at 1:49 AM, Robert Haas  wrote:
> On Wed, Sep 21, 2016 at 5:49 PM, Thomas Munro
>  wrote:
>>> Moreover, it's pretty confusing that we have this general concept of
>>> wait events in pg_stat_activity, and then here the specific type of
>>> wait event we're waiting for is the ... wait event kind.  Uh, what?
>>
>> Yeah, that is confusing.  It comes about because of the coincidence
>> that pg_stat_activity finished up with a wait_event column, and our IO
>> multiplexing abstraction finished up with the name WaitEventSet.
>> We could instead call these new things "wait
>> points", because, well, erm, they name points in the code at which we
>> wait.  They don't name events (they just happen to use the
>> WaitEventSet mechanism, which itself does involve events: but those
>> events are things like "hey, this socket is now ready for
>> writing").
>
> Sure, we could do that, but it means breaking backward compatibility
> for pg_stat_activity *again*.  I'd be willing to do it but I don't
> think I'd win any popularity contests.

I didn't mean changing the column headings in pg_stat_activity.  I
meant that the enum called WaitEventIdentifier in Michael's v5 patch
should be called WaitPointIdentifier, and if we go with a single name
to appear in the wait_event_type column then it could be "WaitPoint".
But I would also prefer to see something more informative in that
column, as discussed below (and upthread).

>> Well we already discussed a couple of different ways to get "Socket",
>> "Latch", "Socket|Latch", ... or something like that into the
>> wait_event_type column or new columns.  Wouldn't that be better, and
>> automatically fall out of the code rather than needing human curated
>> categories?  Although Michael suggested that that should be done as a
>> separate patch on top of the basic feature.
>
> I think making that a separate patch is just punting the decision down
> the field to a day that may never come.  Let's try to agree on
> something that we can all live with and implement it now.  In terms of
> avoiding human-curated categories, I basically see two options:
>
> 1. Find a name that is sufficiently generic that it covers everything
> that might reach WaitEventSetWait either now or in the future when it
> might wait for even more kinds of things than it does now.  For
> example, we could call it "Stuff" or "Thing".  Less tongue-in-cheek
> suggestions are welcome, but it's hard to come up with something that
> is sufficiently-generic without being tautological.  "Event" is an
> example of a name which is general enough to encompass everything but
> also stupid: the column is called "wait_event" so everything that
> appears in it is an event by definition.
>
> 2. Classify the events that fall into this category by some rigid
> principle based on the types of things being awaited.  For example, we
> could decide that if any sockets are awaited, the event class will be
> "Client" if it is connected to a user and "IPC" for auxiliary
> processes.
>
> [...]
> occur.  Likewise, it was pretty easy to find all of the events that
> were waiting for client I/O, and I grouped those all under "Client".
> A few of the remaining wait events seemed like they were clearly
> waiting for a particular timeout to expire, so I gave those their own
> "Timeout" category.

Interesting.  OK, I agree that it'd be useful to show that we're
waiting because there's nothing happening, or waiting because the user
asked us to sleep, or waiting on IO, or waiting for an IPC response
because something is happening, and that higher level information is
difficult/impossible to extract automatically from the WaitEventSet.

I understand that "Activity" is the category of wait points that are
waiting for activity, but I wonder if it might be clearer to users if
that were called "Idle", because it's the category of idle waits
caused by non-activity.

Why is WalSenderMain not in that category alongside WalReceiverMain?
Hmm, actually it's kind of a tricky one: whether it's really idle or
waiting for IO depends.  It's always ready to wait for clients to send
messages, but I'd say that's part of its "idle" behaviour.  But it's
sometimes waiting for the socket to be writable: if
(pq_is_send_pending()) wakeEvents |= WL_SOCKET_WRITABLE, and that's
when it's definitely not idle, it's actively trying to feed WAL down
the pipe.  Do we want to get into dynamic categories depending on
conditions like that?

I was thinking about suggesting a category "Replication" to cover the
waits for client IO relating to replication, as opposed to client IO
waits relating to regular user connections.  Then you could put sync
rep into that category instead of IPC, even though technically it is
waiting for IPC from walsender process(es), on the basis that it's
more newsworthy to a DBA that it's really waiting for a remote replica
to respond.  But it's probably pretty clear what's going on from the
the wait point names, so maybe it's not worth a categ

[HACKERS] pg_upgrade vs user created range type extension

2016-09-22 Thread Andrew Dunstan



I have just encountered an apparent bug in pg_upgrade (or possibly pg_dump).


To recreate, install the cranges extension - which can be obtained via 
"git clone https://bitbucket.org/adunstan/pg-closed-ranges.git"; - for 
both 9.4 and 9.5. Create a fresh 9.4 instance, create a database and in 
it run "create extension cranges schema pg_catalog". Then create a fresh 
9.5 instance and try to pg_upgrade from the 9.4 instance to the 9.5 
instance. Here's the tail of the log:



   pg_restore: creating SCHEMA "public"
   pg_restore: creating COMMENT "SCHEMA "public""
   pg_restore: creating EXTENSION "cranges"
   pg_restore: creating COMMENT "EXTENSION "cranges""
   pg_restore: creating SHELL TYPE "pg_catalog.cdaterange"
   pg_restore: creating FUNCTION
   "pg_catalog.cdaterange_canonical("cdaterange")"
   pg_restore: creating TYPE "pg_catalog.cdaterange"
   pg_restore: creating SHELL TYPE "pg_catalog.cint4range"
   pg_restore: creating FUNCTION
   "pg_catalog.cint4range_canonical("cint4range")"
   pg_restore: creating TYPE "pg_catalog.cint4range"
   pg_restore: creating SHELL TYPE "pg_catalog.cint8range"
   pg_restore: creating FUNCTION
   "pg_catalog.cint8range_canonical("cint8range")"
   pg_restore: creating TYPE "pg_catalog.cint8range"
   pg_restore: creating FUNCTION "pg_catalog.cdaterange("date", "date")"
   pg_restore: [archiver (db)] Error while PROCESSING TOC:
   pg_restore: [archiver (db)] Error from TOC entry 191; 1255 16389
   FUNCTION cdaterange("date", "date") andrew
   pg_restore: [archiver (db)] could not execute query: ERROR: function
   "cdaterange" already exists with same argument types
Command was: CREATE FUNCTION "cdaterange"("date", "date")
   RETURNS "cdaterange"
LANGUAGE "internal" IMMUTABLE
AS $$range_construct...

The end result is that I currently can't upgrade a database using this 
extension, which is rather ugly.


Similar things happen if I put the extension in public instead of 
pg_catalog.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-22 Thread Robert Haas

On Thu, Sep 22, 2016 at 3:51 PM, Heikki Linnakangas  wrote:
> It'd be good if you could overlap the final merges in the workers with the
> merge in the leader. ISTM it would be quite straightforward to replace the
> final tape of each worker with a shared memory queue, so that the leader
> could start merging and returning tuples as soon as it gets the first tuple
> from each worker. Instead of having to wait for all the workers to complete
> first.

If you do that, make sure to have the leader read multiple tuples at a
time from each worker whenever possible.  It makes a huge difference
to performance.  See bc7fcab5e36b9597857fa7e3fa6d9ba54aaea167.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-22 Thread Heikki Linnakangas


On 08/02/2016 01:18 AM, Peter Geoghegan wrote:

No merging in parallel
--

Currently, merging worker *output* runs may only occur in the leader
process. In other words, we always keep n worker processes busy with
scanning-and-sorting (and maybe some merging), but then all processes
but the leader process grind to a halt (note that the leader process
can participate as a scan-and-sort tuplesort worker, just as it will
everywhere else, which is why I specified "parallel_workers = 7" but
talked about 8 workers).

One leader process is kept busy with merging these n output runs on
the fly, so things will bottleneck on that, which you saw in the
example above. As already described, workers will sometimes merge in
parallel, but only their own runs -- never another worker's runs. I
did attempt to address the leader merge bottleneck by implementing
cross-worker run merging in workers. I got as far as implementing a
very rough version of this, but initial results were disappointing,
and so that was not pursued further than the experimentation stage.

Parallel merging is a possible future improvement that could be added
to what I've come up with, but I don't think that it will move the
needle in a really noticeable way.


It'd be good if you could overlap the final merges in the workers with 
the merge in the leader. ISTM it would be quite straightforward to 
replace the final tape of each worker with a shared memory queue, so 
that the leader could start merging and returning tuples as soon as it 
gets the first tuple from each worker. Instead of having to wait for all 
the workers to complete first.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared memory segment “Global/PostgreSQL.851401618”: Permiss

2016-09-22 Thread Tom Lane

Robert Haas  writes:
> On Tue, Sep 20, 2016 at 12:53 PM, Tom Lane  wrote:
>> I'd be the first to agree that this point is inadequately documented
>> in the code, but PostmasterRandom should be reserved for its existing
>> security-related uses, not exposed to the world for (ahem) random other
>> uses.

> So, we could have dsm_postmaster_startup() seed the random number
> generator itself, and then let PostmasterRandom() override the seed
> later.  Like maybe:

Works for me, at least as a temporary solution.  The disturbing thing
here is that this still only does what we want if dsm_postmaster_startup
happens before the first PostmasterRandom call --- which is OK today but
seems pretty fragile.  Still, redesigning PostmasterRandom's seeding
technique is not something to do right before 9.6 release.  Let's revert
the prior patch and do it as you have here:

> struct timeval tv;
> gettimeofday(&tv, NULL);
> srandom(tv.tv_sec);
> ...
> dsm_control_handle = random();

for the time being.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Possibly too stringent Assert() in b-tree code

2016-09-22 Thread Tom Lane

Robert Haas  writes:
> On Mon, Sep 19, 2016 at 7:07 AM, Amit Kapila  wrote:
>> I think you have a valid point.  It seems we don't need to write WAL
>> for reuse page (aka don't call _bt_log_reuse_page()), if the page is
>> new, as the only purpose of that log is to handle conflict based on
>> transaction id stored in special area which will be anyway zero.

> +1.

This is clearly an oversight in Simon's patch fafa374f2, which introduced
this code without any consideration for the possibility that the page
doesn't have a valid special area.  We could prevent the crash by
doing nothing if PageIsNew, a la

if (_bt_page_recyclable(page))
{
/*
 * If we are generating WAL for Hot Standby then create a
 * WAL record that will allow us to conflict with queries
 * running on standby.
 */
-   if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+   if (XLogStandbyInfoActive() && RelationNeedsWAL(rel) &&
+   !PageIsNew(page))
{
BTPageOpaque opaque = (BTPageOpaque) 
PageGetSpecialPointer(page);

_bt_log_reuse_page(rel, blkno, opaque->btpo.xact);
}

/* Okay to use page.  Re-initialize and return it */

but I'm not very clear on whether this is a safe fix, mainly because
I don't understand what the reuse WAL record really accomplishes.
Maybe we need to instead generate a reuse record with some special
transaction ID indicating worst-case assumptions?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Use of SizeOfIptrData - is that obsolete?

2016-09-22 Thread Tom Lane

Pavan Deolasee  writes:
> On Tue, Sep 20, 2016 at 8:34 PM, Tom Lane  wrote:
>> Realistically, because struct HeapTupleHeaderData contains a field of
>> type ItemPointerData, it's probably silly to imagine that we can save
>> anything if the compiler can't be persuaded to believe that
>> sizeof(ItemPointerData) is 6.  It may well be that the structure pragmas
>> work on everything that wouldn't natively believe that anyway.

> Yeah, that's what I thought and rest of the code seems to make that
> assumption as well. Attached patch removes the last remaining reference to
> SizeOfIptrData and also removes the macro and the associated comment.

I thought removing the comment altogether was not appropriate, because
it remains true that you want to work really hard to ensure that
sizeof(ItemPointerData) is 6.  We're just giving up on pretense of support
for compilers that don't believe that.  I'm half tempted to introduce a
StaticAssert about it, but refrained for the moment.

> While htup.h refactoring happened in 9.5, I don't see any point in back
> patching this.

Agreed.  Pushed to HEAD only.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] gratuitous casting away const

2016-09-22 Thread Mark Dilger

> On Sep 22, 2016, at 9:14 AM, Tom Lane  wrote:
> 
> I'd call this kind of a wash, I guess.  I'd be more excited about it if
> the change allowed removal of an actual cast-away-of-constness somewhere.
> 
> I suppose it's a bit of a chicken and egg situation, in that the lack
> of const markings on leaf subroutines discourages use of "const" in
> callers, and you have to start somewhere if you want to make it better.
> But I don't really want to just plaster "const" onto individual functions
> without some larger vision of where we're going and which code is going
> to benefit.  Otherwise it seems like mostly just code churn.
> 
>   regards, tom lane

I have two purposes in doing this.  First, I find the code more self-documenting
this way.  Second, I can get whole directories to compile cleanly without
warnings using the -Wcast-qual flag, where currently that flag results in 
warnings.  That makes it possible to add cast-qual to more individual source
directories' Makefiles than I can currently do while still using -Werror in 
Makefile.global.

Now, I'm not proposing that everybody else needs to have -Wcast-qual.  I'm
just saying that I'd like to be able to have that in my copy of the project.

mark

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Exclude schema during pg_restore

2016-09-22 Thread Michael Banck

Hi,

On Tue, Sep 20, 2016 at 08:59:37PM -0400, Peter Eisentraut wrote:
> On 9/19/16 3:23 PM, Michael Banck wrote:
> > Version 2 attached.
> 
> Committed, thanks.
 
Thanks!

> I added the new option to the help output in pg_restore.

Oh, sorry I missed that.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Better tracking of free space during SP-GiST index build

2016-09-22 Thread Tom Lane

Tomas Vondra  writes:
>> On 08/25/2016 01:45 AM, Tom Lane wrote:
>>> I'll put this in the commitfest queue. It could use review from
>>> someone with the time and motivation to do performance
>>> testing/tuning.

> I've been toying with this patch a bit today, particularly looking at 
> (1) sizing of the cache, and (2) how beneficial it'd be to choose pages 
> from the cache in a smarter way.

Thanks for reviewing!

> I wonder why the patch only sets the cache size to 100 items, when we 
> might fit many more entries into the ~8kB limit.

I chose that because it would still work with the minimum allowed page
size of 1K.  We could make the cache size variable depending on BLCKSZ,
but I'm not sure it's worth the trouble.  There is some cost to a larger
lastUsedPages array, in that you spend more time memcpy'ing it back and
forth between the metapage buffer and backend local memory; and I was
afraid of that outweighing the incremental gains from a larger cache.

> ... I've tried increasing the cache size to 768 
> entries, with vast majority of them (~600) allocated to leaf pages.
> Sadly, this seems to only increase the CREATE INDEX duration a bit, 
> without making the index significantly smaller (still ~120MB).

Yeah, that's in line with my results: not much further gain from a
larger cache.  Though if you were testing with the same IRRExplorer
data, it's not surprising that our results would match.  Would be
good to try some other cases...

> I do think selecting the page with the least free space would save 
> additional space. Instead of making SpGistGetBuffer() more complicated, 
> I've simply shoved a pg_qsort() calls on a few places, sorting the cache 
> by free space, with unused slots at the end. Clearly, this is quite 
> expensive and proper patch would have to do that differently (e.g. 
> maintaining the order incrementally), but OTOH it'd allow some 
> optimizations in SpGistGetBuffer() - the first page with enough free 
> space would have the smallest amount of free space (more or less).

> This actually helped a bit, and the index size dropped by ~2MB. So not 
> bad, but nowhere close to the initial 132MB -> 120MB improvement.

OK, I'll think about how to do that more efficiently.  The smaller
incremental improvement isn't surprising, because in this example the
index would still be 90-something MB if it had no free space at all,
so there's going to be decreasing returns from any additional work
to avoid wasted free space.  But if we can do it cheaply, this does
suggest that using pages in order by free space is of value.

> It's worth mentioning the spgist fillfactor is a bit crude, most likely 
> thanks to splits - e.g. the 109MB index still contains ~10MB of free 
> space on the pages (measures using pageinspect as upper-lower), so 
> almost 10%. Perhaps it really is time to increase the spgist default 
> fillfactor?

Maybe; I don't think anyone's put much effort into tuning that.

> It seems the patch keeps new/empty/deleted pages in the cache, and thus 
> segregated by type. Is that intentional, or should 
> SpGistSetLastUsedPage() remove such pages from the cache? Or maybe move 
> them into a special category? It's true we'll reuse those pages, as 
> allocNewBuffer() allocates the buffer directly, but those pages are 
> unlikely to get evicted from the cache due to high freeSpace value 
> (despite possibly already reused).

My thought was that once we've found a new/empty/deleted page, putting
it in the cache is a cheap way of making sure it gets used soon.  Sure,
we could record it in the FSM instead, but that sounds relatively more
expensive.  That intuition could be wrong of course.

> Similarly for completely full pages (with freeSpace==0) - does it make 
> sense to keep them in the cache? Although it's probably harmless, as 
> those pages will get evicted first if needed.

IIRC, we don't have anything to replace the cache entry with at that
point, so there's no particular value in marking the entry unused
rather than used-but-with-zero-space.  It will be first priority for
replacement either way, so there seemed little point in writing
more code to make it unused.

> One thing I'd change is making the SpGistLUPCache dynamic, i.e. storing 
> the size and lastUsedPagesMap on the meta page. That should allow us 
> resizing the cache and tweak lastUsedPagesMap in the future.

Yeah, probably a good idea.  I had thought of bumping SPGIST_MAGIC_NUMBER
again if we want to revisit the cache size; but keeping it as a separate
field won't add noticeable cost, and it might save some trouble.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared memory segment “Global/Po

2016-09-22 Thread Tom Lane

Amit Kapila  writes:
> On Tue, Sep 20, 2016 at 10:33 PM, Tom Lane  wrote:
>> ISTM both the previous coding and this version can fail for no good
>> reason, that is what if GetLastError() happens to return one of these
>> error codes as a result of some unrelated failure from before this
>> subroutine is entered?  That is, wouldn't it be a good idea to
>> do SetLastError(0) before calling CreateFileMapping?

> Yes, that seems like a good idea.  Do you need a patch with some
> testing on windows environment?

Please; I can't test it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash Indexes

2016-09-22 Thread AP

On Wed, Sep 21, 2016 at 08:44:15PM +0100, Geoff Winkless wrote:
> On 21 September 2016 at 13:29, Robert Haas  wrote:
> > I'd be curious what benefits people expect to get.
> 
> An edge case I came across the other day was a unique index on a large
> string: postgresql popped up and told me that I couldn't insert a
> value into the field because the BTREE-index-based constraint wouldn't
> support the size of string, and that I should use a HASH index
> instead. Which, of course, I can't, because it's fairly clearly
> deprecated in the documentation...

Thanks for that. Forgot about that bit of nastiness. I came across the
above migrating a MySQL app to PostgreSQL. MySQL, I believe, handles
this by silently truncating the string on index. PostgreSQL by telling
you it can't index. :( So, as a result, AFAIK, I had a choice between a
trigger that did a left() on the string and inserts it into a new column
on the table that I can then index or do an index on left(). Either way
you wind up re-writing a whole bunch of queries. If I wanted to avoid
the re-writes I had the option of making the DB susceptible to poor
recovery from crashes, et all.

No matter which option I chose, the end result was going to be ugly.

It would be good not to have to go ugly in such situations. 

Sometimes one size does not fit all.

For me this would be a second major case where I'd use usable hashed
indexes the moment they showed up.

Andrew

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Showing parallel status in \df+

2016-09-22 Thread Tom Lane

Rushabh Lathia  writes:
> I agree with the argument in this thread, having "Source code" as part
> of \df+ is bit annoying, specifically when output involve some really
> big PL language functions. Having is separate does make \df+ output more
> readable. So I would vote for \df++ rather then adding the source code
> as part of footer for \df+.

If it's unreadable in \df+, how would \df++ make that any better?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 9.6 TAP tests and extensions

2016-09-22 Thread Tom Lane

Craig Ringer  writes:
> On 13 September 2016 at 22:02, Tom Lane  wrote:
>> Without taking a position on the merits of this patch per se, I'd like
>> to say that I find the argument for back-patching into 9.6 and not
>> further than that to be pretty dubious.  $(prove_check) has been there
>> since 9.4, and in the past we've often regretted it when we failed
>> to back-patch TAP infrastructure fixes all the way back to 9.4.

> No objection to backpatching, I just thought I'd be more intrusive to
> do that than just 9.6.

> Since 9.5 and older have more limited versions of PostgresNode which
> lack safe_psql, etc, I'm not sure it's very practical for extensions
> to bother running TAP tests on 9.4 and 9.5 anyway.

Certainly there are restrictions, but I'd imagine that every new release
will be adding features to the TAP test infrastructure for some time to
come.  I think it's silly to claim that 9.6 is the first branch where
TAP testing is usable at all.

> Extension authors can just use:
> ifeq ($(MAJORVERSION),9.6)
> endif
> when defining their prove rules.

That will break as soon as 10 comes out.  And numerical >= tests aren't
all that convenient in Make.  It'd be much better if a test on whether
$(prove_check) is defined would be sufficient.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] gratuitous casting away const

2016-09-22 Thread Tom Lane

Mark Dilger  writes:
>> On Sep 20, 2016, at 1:06 PM, Tom Lane  wrote:
>> ... that seems to be discarding type information in order to add
>> "const"; does not seem like a net benefit from here.

> The following seems somewhere in between, with ItemPointer
> changing to const ItemPointerData *.  I expect you would not care
> for this change, but thought I'd check to see where you draw the line:

I'd call this kind of a wash, I guess.  I'd be more excited about it if
the change allowed removal of an actual cast-away-of-constness somewhere.

I suppose it's a bit of a chicken and egg situation, in that the lack
of const markings on leaf subroutines discourages use of "const" in
callers, and you have to start somewhere if you want to make it better.
But I don't really want to just plaster "const" onto individual functions
without some larger vision of where we're going and which code is going
to benefit.  Otherwise it seems like mostly just code churn.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-09-22 Thread Tom Lane

Alvaro Herrera  writes:
> Petr Jelinek wrote:
>>> I'm not sure how well it will work to replace all the bits of LIKE and
>>> regular expressions with ICU API calls.  One problem is that ICU likes
>>> to do case folding as a whole string, not by character.  I need to do
>>> more research about that.  

>> Can't we use the regular expression api provided by ICU, or is that too
>> incompatible?

> That probably won't fly very well -- it would be rather absurd to have
> different regex behavior when ICU is compiled compared to when it's
> not.  Perhaps down the road we could have an extension that provides
> ICU-based regex operators, but most likely not in core.

The regex stuff does not handle case-insensitivity by case-folding the
data string, anyway.  Rather, it does that by converting 'a' or 'A'
in the pattern to the equivalent of '[aA]'.  So performance for that
step is not terribly critical.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] pgpassfile connection option

2016-09-22 Thread Julian Markwort

I haven't really thought about this as I had been asked to make this 
work as an additional option to the connection parameters...
Now that I've looked at it - there is really only the benefit of saving 
the step of setting the PGPASSFILE environment variable.
However, there might be cases in which setting an environment variable 
might not be the easiest option.


Am 22.09.2016 um 17:15 schrieb Andrew Dunstan:
I'm not necessarily opposed to this, but what is the advantage over 
the existing PGPASSFILE  environment setting mechanism? 




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-09-22 Thread Alvaro Herrera

Petr Jelinek wrote:

> > I'm not sure how well it will work to replace all the bits of LIKE and
> > regular expressions with ICU API calls.  One problem is that ICU likes
> > to do case folding as a whole string, not by character.  I need to do
> > more research about that.  
> 
> Can't we use the regular expression api provided by ICU, or is that too
> incompatible?

That probably won't fly very well -- it would be rather absurd to have
different regex behavior when ICU is compiled compared to when it's
not.  Perhaps down the road we could have an extension that provides
ICU-based regex operators, but most likely not in core.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] pgpassfile connection option

2016-09-22 Thread Andrew Dunstan




On 09/22/2016 10:44 AM, Julian Markwort wrote:

Hello psql-hackers!

We thought it would be advantageous to be able to specify a 'custom' 
pgpassfile within the connection string along the lines of the 
existing parameters sslkey and sslcert.


Which is exactly what this very compact patch does.
The patch is minimally invasive - when no pgpassfile attribute is 
provided in the connection string, the regular pgpassfile is used.
The security-measures (which are limited to checking the permissions 
for 0600) are kept, however we could loosen that restriciton to allow 
group access as well along the lines of the ssl key file , if this is 
preferred. (in case multiple users belonging to the same group would 
like to connect using the same file).


The patch applies cleanly to master and compiles and runs as expected 
(as there are no critical alterations).
I've not written any documentation as of now, but I'll follow up 
closely if there is any interest for this patch.


notes:
 - using ~ to denote the user's home directory in the path does not 
work, however $HOME works (as this is translated by bash beforehand).
 - the notation in the custom pgpassfile should follow the notation of 
the 'default' pgpass files:

hostname:port:database:username:password
 - this has only been tested on linux so far, however due to the 
nature of the changes I suspect that there is nothing that could go 
wrong in other environments, although I could test that as well, if 
deemed necessary.




I'm not necessarily opposed to this, but what is the advantage over the 
existing PGPASSFILE  environment setting mechanism?



cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_ctl promote wait

2016-09-22 Thread Masahiko Sawada

On Thu, Sep 22, 2016 at 1:25 AM, Peter Eisentraut
 wrote:
> On 8/11/16 9:28 AM, Michael Paquier wrote:
>> I have looked at them and the changes are looking fine for me. So I
>> have switched the patch as ready for committer, aka you.
>>
>> Just a nit:
>> +   if (wait_seconds > 0)
>> +   {
>> +   sleep(1);
>> +   wait_seconds--;
>> +   continue;
>> +   }
>> This may be better this pg_usleep() instead of sleep().
>
> Committed with that adjustment.
>

Commit e7010ce seems to forget to change help message of pg_ctl.
Attached patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


fix_pg_ctl_help.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] [PATCH] pgpassfile connection option

2016-09-22 Thread Julian Markwort


Hello psql-hackers!

We thought it would be advantageous to be able to specify a 'custom' 
pgpassfile within the connection string along the lines of the existing 
parameters sslkey and sslcert.


Which is exactly what this very compact patch does.
The patch is minimally invasive - when no pgpassfile attribute is 
provided in the connection string, the regular pgpassfile is used.
The security-measures (which are limited to checking the permissions for 
0600) are kept, however we could loosen that restriciton to allow group 
access as well along the lines of the ssl key file , if this is 
preferred. (in case multiple users belonging to the same group would 
like to connect using the same file).


The patch applies cleanly to master and compiles and runs as expected 
(as there are no critical alterations).
I've not written any documentation as of now, but I'll follow up closely 
if there is any interest for this patch.


notes:
 - using ~ to denote the user's home directory in the path does not 
work, however $HOME works (as this is translated by bash beforehand).
 - the notation in the custom pgpassfile should follow the notation of 
the 'default' pgpass files:

hostname:port:database:username:password
 - this has only been tested on linux so far, however due to the nature 
of the changes I suspect that there is nothing that could go wrong in 
other environments, although I could test that as well, if deemed necessary.



I'm looking forward to any feedback,
Julian

--

Julian Markwort
Westphalian Wilhelms-University in Münster
julian.markw...@uni-muenster.de

diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index 9b2839b..5c0d88a 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -184,6 +184,10 @@ static const internalPQconninfoOption PQconninfoOptions[] = {
 		"Database-Password", "*", 20,
 	offsetof(struct pg_conn, pgpass)},
 
+	{"pgpassfile", "PGPASSFILE", NULL, NULL,
+		"Database-Password-File", "", 64,
+	offsetof(struct pg_conn, pgpassfile)},
+
 	{"connect_timeout", "PGCONNECT_TIMEOUT", NULL, NULL,
 		"Connect-timeout", "", 10,		/* strlen(INT32_MAX) == 10 */
 	offsetof(struct pg_conn, connect_timeout)},
@@ -375,7 +379,7 @@ static int parseServiceFile(const char *serviceFile,
  bool *group_found);
 static char *pwdfMatchesString(char *buf, char *token);
 static char *PasswordFromFile(char *hostname, char *port, char *dbname,
- char *username);
+ char *username, char *pgpassfile);
 static bool getPgPassFilename(char *pgpassfile);
 static void dot_pg_pass_warning(PGconn *conn);
 static void default_threadlock(int acquire);
@@ -806,8 +810,9 @@ connectOptions2(PGconn *conn)
 	{
 		if (conn->pgpass)
 			free(conn->pgpass);
+		/* We'll pass conn->pgpassfile regardless of it's contents - checks happen in PasswordFromFile() */
 		conn->pgpass = PasswordFromFile(conn->pghost, conn->pgport,
-		conn->dbName, conn->pguser);
+		conn->dbName, conn->pguser, conn->pgpassfile);
 		if (conn->pgpass == NULL)
 		{
 			conn->pgpass = strdup(DefaultPassword);
@@ -5699,14 +5704,15 @@ pwdfMatchesString(char *buf, char *token)
 	return NULL;
 }
 
-/* Get a password from the password file. Return value is malloc'd. */
+/* Get a password from the password file or the user-specified pgpassfile. Return value is malloc'd. */
 static char *
-PasswordFromFile(char *hostname, char *port, char *dbname, char *username)
+PasswordFromFile(char *hostname, char *port, char *dbname, char *username, char *pgpassfile)
 {
 	FILE	   *fp;
-	char		pgpassfile[MAXPGPATH];
+	char 		temp_path[MAXPGPATH];
 	struct stat stat_buf;
 
+
 #define LINELEN NAMEDATALEN*5
 	char		buf[LINELEN];
 
@@ -5731,8 +5737,13 @@ PasswordFromFile(char *hostname, char *port, char *dbname, char *username)
 	if (port == NULL)
 		port = DEF_PGPORT_STR;
 
-	if (!getPgPassFilename(pgpassfile))
-		return NULL;
+	if(!pgpassfile || pgpassfile[0]=='\0')
+	{
+		/* If no pgpassfile was supplied by the user or it is empty, we try to get a global pgpassfile */
+		if (!getPgPassFilename(temp_path))
+			return NULL;
+		pgpassfile = temp_path;
+	}
 
 	/* If password file cannot be opened, ignore it. */
 	if (stat(pgpassfile, &stat_buf) != 0)
diff --git a/src/interfaces/libpq/libpq-int.h b/src/interfaces/libpq/libpq-int.h
index 1183323..ae9317f 100644
--- a/src/interfaces/libpq/libpq-int.h
+++ b/src/interfaces/libpq/libpq-int.h
@@ -317,6 +317,7 @@ struct pg_conn
 	char	   *replication;	/* connect as the replication standby? */
 	char	   *pguser;			/* Postgres username and password, if any */
 	char	   *pgpass;
+	char 	   *pgpassfile;		/* path to a file containing the password */
 	char	   *keepalives;		/* use TCP keepalives? */
 	char	   *keepalives_idle;	/* time between TCP keepalives */
 	char	   *keepalives_interval;	/* time between TCP keepalive

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mai

Re: [HACKERS] Executor's internal ParamExecData Params versus EvalPlanQual

2016-09-22 Thread Tom Lane

I wrote:
> The reason this happens is that EvalPlanQualBegin copies down all the
> es_param_exec_vals values from the outer query into the EPQ execution
> state.  That's OK for most uses of es_param_exec_vals values, but not
> at all for cteParam Params, which are used as internal communication
> variables within a plan tree.

After further study and caffeine-consumption, I concluded that that's
wrong.  There's nothing wrong with reusing the result of a CTE that has
already been scanned by the original query invocation (its output can't
be affected by the new candidate-for-update row).  The problem is
that ExecInitCteScan is assuming that when it initializes as a follower,
the tuplestore read pointer it's given by tuplestore_alloc_read_pointer
will be pointing to the start of the tuplestore.  That's wrong; the new
read pointer is defined as being a clone of read pointer 0, which is
already at EOF in this scenario.  So a simple and localized fix is to
forcibly rewind the new read pointer, as in the attached patch.  (This
should have negligible cost as long as the tuplestore hasn't yet spilled
to disk.)

I also considered redefining tuplestore_alloc_read_pointer as creating
a read pointer that points to start-of-tuplestore.  The other existing
callers wouldn't care, because they are working with a just-created,
known-empty tuplestore.  But there is a risk of breaking third-party
code, so this didn't seem like a back-patchable solution.  Also, I think
the reason why tuplestore_alloc_read_pointer was given this behavior is
so that it could succeed even if the tuplestore has already been moved
forward and perhaps had old data truncated off it.  With the other
behavior, it would have to have the same failure cases as
tuplestore_rescan.

BTW, I no longer think the recursive-union case is broken; rather, failure
to copy its communication Param would break it, in scenarios where a
WorkTableScan is underneath a SELECT FOR UPDATE that's underneath the
RecursiveUnion.  So that's another reason not to mess with the Param
propagation logic.

regards, tom lane

diff --git a/src/backend/executor/nodeCtescan.c b/src/backend/executor/nodeCtescan.c
index 3c2f684..162650a 100644
*** a/src/backend/executor/nodeCtescan.c
--- b/src/backend/executor/nodeCtescan.c
*** ExecInitCteScan(CteScan *node, EState *e
*** 224,232 
--- 224,236 
  	{
  		/* Not the leader */
  		Assert(IsA(scanstate->leader, CteScanState));
+ 		/* Create my own read pointer, and ensure it is at start */
  		scanstate->readptr =
  			tuplestore_alloc_read_pointer(scanstate->leader->cte_table,
  		  scanstate->eflags);
+ 		tuplestore_select_read_pointer(scanstate->leader->cte_table,
+ 	   scanstate->readptr);
+ 		tuplestore_rescan(scanstate->leader->cte_table);
  	}

  	/*
diff --git a/src/test/isolation/expected/eval-plan-qual.out b/src/test/isolation/expected/eval-plan-qual.out
index 5898d94..10c784a 100644
*** a/src/test/isolation/expected/eval-plan-qual.out
--- b/src/test/isolation/expected/eval-plan-qual.out
*** ta_id  ta_value   tb_row
*** 163,165 
--- 163,186 

  1  newTableAValue (1,tableBValue)
  step c2: COMMIT;
+ 
+ starting permutation: wrtwcte readwcte c1 c2
+ step wrtwcte: UPDATE table_a SET value = 'tableAValue2' WHERE id = 1;
+ step readwcte: 
+ 	WITH
+ 	cte1 AS (
+ 	  SELECT id FROM table_b WHERE value = 'tableBValue'
+ 	),
+ 	cte2 AS (
+ 	  SELECT * FROM table_a
+ 	  WHERE id = (SELECT id FROM cte1)
+ 	  FOR UPDATE
+ 	)
+ 	SELECT * FROM cte2;
+  
+ step c1: COMMIT;
+ step c2: COMMIT;
+ step readwcte: <... completed>
+ id value  
+ 
+ 1  tableAValue2   
diff --git a/src/test/isolation/specs/eval-plan-qual.spec b/src/test/isolation/specs/eval-plan-qual.spec
index de481a3..7ff6f6b 100644
*** a/src/test/isolation/specs/eval-plan-qual.spec
--- b/src/test/isolation/specs/eval-plan-qual.spec
*** step "readforss"	{
*** 103,113 
--- 103,129 
  	FROM table_a ta
  	WHERE ta.id = 1 FOR UPDATE OF ta;
  }
+ step "wrtwcte"	{ UPDATE table_a SET value = 'tableAValue2' WHERE id = 1; }
  step "c2"	{ COMMIT; }

  session "s3"
  setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
  step "read"	{ SELECT * FROM accounts ORDER BY accountid; }
+ 
+ # this test exercises EvalPlanQual with a CTE, cf bug #14328
+ step "readwcte"	{
+ 	WITH
+ 	cte1 AS (
+ 	  SELECT id FROM table_b WHERE value = 'tableBValue'
+ 	),
+ 	cte2 AS (
+ 	  SELECT * FROM table_a
+ 	  WHERE id = (SELECT id FROM cte1)
+ 	  FOR UPDATE
+ 	)
+ 	SELECT * FROM cte2;
+ }
+ 
  teardown	{ COMMIT; }

  permutation "wx1" "wx2" "c1" "c2" "read"
*** permutation "writep2" "returningp1" "c1"
*** 118,120 
--- 134,137 
  permutation "wx2" "partiallock" "c2" "c1" "read"
  permutation "wx2" "lockwithvalues" "c2" "c1" "read"
  permutation "updateforss" "readforss" "c1" "c2"

Re: [HACKERS] Parallel sec scan in plpgsql

2016-09-22 Thread Robert Haas

On Thu, Sep 22, 2016 at 8:36 AM, Amit Kapila  wrote:
> I think for certain cases like into clause, the rows passed will be
> equal to actual number of rows, otherwise it will generate error.  So
> we can pass that information in executor layer.  Another kind of cases
> which are worth considering are when from plpgsql we fetch limited
> rows at-a-time, but we fetch till end like the case of
> exec_stmt_return_query().

Yes, I think that those cases can be considered.  Some careful code
inspection will be needed to make sure the cases we want to enable are
safe, and some testing will be needed to make sure they behave
properly.  But it doesn't sound like an unsolvable problem.  I hope
someone who isn't me will decide to work on it.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Tracking wait event for latches

2016-09-22 Thread Robert Haas

On Thu, Sep 22, 2016 at 1:52 AM, Michael Paquier
 wrote:
> Then comes the interesting bits: I don't think that it is a good idea
> to associate only one event for a waiting point in the system views.
> Say that at a waiting point WaitLatch is called with
> WL_POSTMASTER_DEATH and WL_TIMEOUT, I'd rather let the user know that
> both of them have been set up and not choose just one of them using
> some hierarchy. But I rather think that we should list all of them
> depending on the WL_* flags set up by the caller. Here comes a second
> patch: let's use the last byte of the wait events and add this new
> text[] column in pg_stat_activity, say wait_details. So for a waiting
> point it would be possible to tell to the user that it is waiting on
> '{socket,timeout,postmaster_death}' for example.

I think this is focusing on the wrong thing.  What we need here is not
infinite detail on exactly what is going on under the hood but useful
classification rules.  For example, as I said in my email to Thomas,
being able to exclude all cases of waiting for client I/O is useful
because those aren't signs of a problem in the way that LWLock or Lock
waits are.  It's better for us to provide that classification using
the existing system than for users to have to work out exactly which
things ought to be excluded on the basis of specific event names.

On the other hand, I submit that knowing which waits will be
interrupted by the death of the postmaster and which will not doesn't
add much.  In fact, I think that's almost an anti-feature, because it
will encourage users to care about details that are very rarely
relevant.

> Then comes a third patch: addition of a system view to list all the
> activity happening for processes that are not dependent on
> max_connections. pg_stat_activity has the advantage, as Tom mentioned
> a couple of days ago, that one can just run count(*) on it to estimate
> the number of connections left. I think this is quite helpful. Or we
> could just add a column in pg_stat_activity to mark a process type: is
> it a system process or not? This deserves its own discussion for sure.

Sure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Tracking wait event for latches

2016-09-22 Thread Robert Haas

On Wed, Sep 21, 2016 at 5:49 PM, Thomas Munro
 wrote:
>> Moreover, it's pretty confusing that we have this general concept of
>> wait events in pg_stat_activity, and then here the specific type of
>> wait event we're waiting for is the ... wait event kind.  Uh, what?
>
> Yeah, that is confusing.  It comes about because of the coincidence
> that pg_stat_activity finished up with a wait_event column, and our IO
> multiplexing abstraction finished up with the name WaitEventSet.
> We could instead call these new things "wait
> points", because, well, erm, they name points in the code at which we
> wait.  They don't name events (they just happen to use the
> WaitEventSet mechanism, which itself does involve events: but those
> events are things like "hey, this socket is now ready for
> writing").

Sure, we could do that, but it means breaking backward compatibility
for pg_stat_activity *again*.  I'd be willing to do it but I don't
think I'd win any popularity contests.

>> I have to admit that I like the individual event names quite a bit,
>> and I think the detail will be useful to users.  But I wonder if
>> there's a better way to describe the class of events that we're
>> talking about that's not so dependent on internal data structures.
>> Maybe we could divide these waits into a couple of categories - e.g.
>> "Socket", "Timeout", "Process" - and then divide these detailed wait
>> events among those classes.
>
> Well we already discussed a couple of different ways to get "Socket",
> "Latch", "Socket|Latch", ... or something like that into the
> wait_event_type column or new columns.  Wouldn't that be better, and
> automatically fall out of the code rather than needing human curated
> categories?  Although Michael suggested that that should be done as a
> separate patch on top of the basic feature.

I think making that a separate patch is just punting the decision down
the field to a day that may never come.  Let's try to agree on
something that we can all live with and implement it now.  In terms of
avoiding human-curated categories, I basically see two options:

1. Find a name that is sufficiently generic that it covers everything
that might reach WaitEventSetWait either now or in the future when it
might wait for even more kinds of things than it does now.  For
example, we could call it "Stuff" or "Thing".  Less tongue-in-cheek
suggestions are welcome, but it's hard to come up with something that
is sufficiently-generic without being tautological.  "Event" is an
example of a name which is general enough to encompass everything but
also stupid: the column is called "wait_event" so everything that
appears in it is an event by definition.

2. Classify the events that fall into this category by some rigid
principle based on the types of things being awaited.  For example, we
could decide that if any sockets are awaited, the event class will be
"Client" if it is connected to a user and "IPC" for auxiliary
processes.

For myself, I don't see any real problem with using humans to classify
things; that's pretty much the one thing humans are much better at
than computers, so we might as well take advantage of it.  I think
that it's useful to try to group together types of waits which the
user will see as logically related to each other, even if that
involves applying some human judgement that might someday lead to some
discussion about what the best categorization for some new thing is.
PostgreSQL is intended to be used by humans, and avoiding discussions
(or even arguments) on pgsql-hackers shouldn't outrank usability on
the list of concerns.

So, I tried to classify these.  Here's what I came up with.

Activity: ArchiverMain, AutoVacuumMain, BgWriterMain,
BgWriterHibernate, CheckpointerMain, PgStatMain, RecoveryWalAll,
RecoveryWalStream, SysLoggerMain, WalReceiverMain, WalWriterMain

Client: SecureRead, SecureWrite, SSLOpenServer, WalSenderMain,
WalSenderWaitForWAL, WalSenderWriteData, WalReceiverWaitStart

Timeout: BaseBackupThrottle, PgSleep, RecoveryApplyDelay

IPC: BgWorkerShutdown, BgWorkerStartup, ExecuteGather,
MessageQueueInternal, MessageQueuePutMessage, MessageQueueReceive,
MessageQueueSend, ParallelFinish, ProcSignal, ProcSleep, SyncRep

Extension: Extension

I classified all of the main loop waits as waiting for activity; all
of those are background processes that are waiting for something to
happen and are more or less happy to sleep forever until it does.  I
also included the RecoveryWalAll and RecoveryWalStream events in
there; those don't have the sort of "main loop" flavor of the others
but they are happy to wait more or less indefinitely for something to
occur.  Likewise, it was pretty easy to find all of the events that
were waiting for client I/O, and I grouped those all under "Client".
A few of the remaining wait events seemed like they were clearly
waiting for a particular timeout to expire, so I gave those their own
"Timeout" category.

I believe these categorizations are actually useful f

[HACKERS] Typo in libpq-int.h

2016-09-22 Thread Daniel Gustafsson

Ran into a typo in libpq-int.h while reading/hacking:

-   char   *gsslib; /* What GSS librart to use 
("gssapi" or
+   char   *gsslib; /* What GSS library to use 
("gssapi” or

Patch attached.

cheers ./daniel



typo-libpq-int.diff
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Executor's internal ParamExecData Params versus EvalPlanQual

2016-09-22 Thread Tom Lane

I looked into the misbehavior reported in bug #14328,
https://www.postgresql.org/message-id/20160919160136.1348.55...@wrigleys.postgresql.org

What is happening is that during the EvalPlanQual recheck to see whether
the updated row still satisfies the SELECT's quals, we run ExecInitCteScan
and then CteScanNext in the EPQ-created copy of the plan tree.  This
should result in a fresh scan of the underlying CTE ... but it does not,
because ExecInitCteScan sees the communication value
estate->es_param_exec_vals[node->cteParam] already set and decides it is
not a leader node, but rather a follower of the outer-query CTE Scan node
that it's a doppelganger of.  That one's already at EOF so the
node->leader->eof_cte test fails in CteScanNext and it returns nothing.

The reason this happens is that EvalPlanQualBegin copies down all the
es_param_exec_vals values from the outer query into the EPQ execution
state.  That's OK for most uses of es_param_exec_vals values, but not
at all for cteParam Params, which are used as internal communication
variables within a plan tree.

I believe that recursive unions are probably broken in EPQ rechecks in the
same way, since they use a Param for similar intra-plan communication
purposes, but haven't bothered to devise a test case to prove it.

I think the most robust way to fix this is to explicitly mark
es_param_exec_vals entries as to whether they should be copied down to
the EPQ execution state.  There is room in struct ParamExecData to fit
a uint8 or uint16 field into existing pad space, so we could add a flags
field and then use one bit of it for this purpose without causing any
ABI breaks in the back branches.

A simpler fix would be to decide that we never need to copy any
es_param_exec_vals entries into the EPQ execution state.  I think that
that would work, but it would provoke extra evaluations of InitPlan
subplans, and I'm hesitant to make such a significant behavioral change
without a lot more analysis.

Comments?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] get_home_path: use HOME

2016-09-22 Thread Daniel Verite

Tom Lane wrote:

> If we take this patch, what's to stop someone from complaining that we
> broke *their* badly-designed system that abuses the HOME variable?

POSIX warns against doing that, listing HOME in the variables that
should be left to their intended usage:
http://pubs.opengroup.org/onlinepubs/9699919799/

  If the variables in the following two sections are present in the
  environment during the execution of an application or utility, they
  shall be given the meaning described below
  [...]
  HOME
  The system shall initialize this variable at the time of login to
  be a pathname of the user's home directory. See .

psql is indirectly using $HOME already for readline and terminfo:

$ HOME=/tmp/home2 strace psql 2>tr ; grep home2 tr
...
stat("/tmp/home2/.terminfo", 0x7ff985bf4730) = -1 ENOENT (No such file or
directory)
stat("/tmp/home2/.inputrc", 0x7fff3f641d70) = -1 ENOENT (No such file or
directory)

Also when using Debian's psql, the wrapper looks for it own config file in
$HOME:
open("/tmp/home2/.postgresqlrc", O_RDONLY) = -1 ENOENT (No such file or
directory)
Being written in Perl, it could use getpwuid(), but it doesn't, like I
believe
the majority of programs that just want the home directory.

+1 on using HOME for being consistent with other pieces of code around
postgres, and for the easiness of locally overriding it when
troubleshooting problems with dot files.

Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel sec scan in plpgsql

2016-09-22 Thread Amit Kapila

On Tue, Sep 20, 2016 at 8:31 PM, Robert Haas  wrote:
> On Tue, Sep 20, 2016 at 9:24 AM, Amit Kapila  wrote:
>> I think here point is that for any case where there is count of rows
>> to be selected, we disable parallelism.  There are many genuine cases
>> like select count(*) into cnt ... which will run to completion, but as
>> plpgsql passes row count to be 1 or 2, it doesn't enter parallel mode.
>> There are couple other cases like that.  Do you see a reason for not
>> enabling parallelism for such cases?
>
> If we can somehow know that the rowcount which is passed is greater
> than or equal to the actual number of rows which will be generated,
> then it's fine to enable parallelism.
>

I think for certain cases like into clause, the rows passed will be
equal to actual number of rows, otherwise it will generate error.  So
we can pass that information in executor layer.  Another kind of cases
which are worth considering are when from plpgsql we fetch limited
rows at-a-time, but we fetch till end like the case of
exec_stmt_return_query().


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] File system operations.

2016-09-22 Thread Yury Zhuravlev


On четверг, 15 сентября 2016 г. 19:09:16 MSK, Tom Lane wrote:

Robert Haas  writes:

On Thu, Sep 15, 2016 at 11:01 AM, Anastasia Lubennikova
 wrote:
What do you think about moving stuff from pg_upgrade/file.c 
to storage/file/
to allow reuse of this code? I think it'll be really helpful 
for developers

of contrib modules
like backup managers.



Well, storage/file is backend and pg_upgrade is frontend.  If you want
to reuse the same code for both it's got to go someplace else.


Also, to the extent that it's important to use those wrappers rather
than libc directly, it's because the wrappers are tied into backend
resource management and error handling conventions.  I don't see how
you convert that into code that also works in a frontend program,
or what the point would be exactly.


Maybe we should towards to framework ecosystem. I mean, developers of third 
party tools must use maximum same api what in backend.


--
Yury Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench - compute & show latency consistently

2016-09-22 Thread Kuntal Ghosh

On Wed, Sep 21, 2016 at 6:53 PM, Fabien COELHO  wrote:
> In front of the tps line. Well, the performance displayed could also be
> improved... On my dual core SSD laptop I just got:
>
>  sh> ./pgbench -c 10 -t 1000
>  starting vacuum...end.
>  transaction type: 
>  scaling factor: 100
>  query mode: simple
>  number of clients: 10
>  number of threads: 1
>  number of transactions per client: 1000
>  number of transactions actually processed: 1/1
>  latency average = 9.527 ms
>  tps = 1049.665115 (including connections establishing)
>  tps = 1049.890194 (excluding connections establishing)
>
> Which is about 10 times better.

Yes, you are right. In the documentation, the above example has not
been updated across different pg versions. Although this is just an
example, it should reflect the current performance.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Confusing docs about GetForeignUpperPaths in fdwhandler.sgml

2016-09-22 Thread Rushabh Lathia

On Fri, Sep 2, 2016 at 5:00 PM, Etsuro Fujita 
wrote:

> Hi Robert,
>
> Thanks for the comments!
>
> On 2016/09/02 11:55, Robert Haas wrote:
>
>> On Mon, Aug 1, 2016 at 5:44 PM, Etsuro Fujita
>>  wrote:
>>
>>> I noticed that the following note about direct modification via
>>> GetForeignUpperPaths in fdwhandler.sgml is a bit confusing.  We have
>>> another
>>> approach using PlanDirectModify, so that should be reflected in the note
>>> as
>>> well.  Please find attached a patch.
>>>
>>>  PlanForeignModify and the other callbacks described in
>>>   are designed around the
>>> assumption
>>>  that the foreign relation will be scanned in the usual way and then
>>>  individual row updates will be driven by a local
>>> ModifyTable
>>>  plan node.  This approach is necessary for the general case where an
>>>  update requires reading local tables as well as foreign tables.
>>>  However, if the operation could be executed entirely by the foreign
>>>  server, the FDW could generate a path representing that and insert
>>> it
>>>  into the UPPERREL_FINAL upper relation, where it would
>>>  compete against the ModifyTable approach.
>>>
>>
> I suppose this is factually correct, but I don't think it's very
>> illuminating.  I think that if we're going to document both
>> approaches, there should be some discussion of the pros and cons of
>> PlanDirectModify vs. PlanForeignModify.
>>
>
> PlanDirectModify vs. GetForeignUpperPaths for an UPPERREL_FINAL upper
> relation?
>
> Of course either should be
>> better than an iterative ModifyTable, but how should the FDW author
>> decide between the two of them?
>>
>
> That would apply to row locking.  We have two approaches for that too:
> GetForeignRowMarkType and GetForeignUpperPaths, which is documented in the
> same paragraph following the above documentation:
>
>  This approach
>  could also be used to implement remote SELECT FOR UPDATE,
>  rather than using the row locking callbacks described in
>  .  Keep in mind that a path
>
> The point of the patch is just to let the FDW author know that there is
> another approach for implementing direct modification (ie,
> PlanDirectModify) just as for implementing row locking.
>
>
Considering the primary object of this patch is just to let the FDW author
know
that there is another approach for implementing direct modification, I like
the
idea of modifying the document.

I agree that the documentation about how the FDW author should decide
> between the two would be helpful, but I'd like to leave that for future
> work.
>


I performed basic test with patch,

a) patch get applied cleanly on latest source,
b) able to build documentation cleanly.

Marking this as ready for committer.


>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
Rushabh Lathia

Re: [HACKERS] Declarative partitioning - another take

2016-09-22 Thread Ashutosh Bapat

On Thu, Sep 22, 2016 at 1:02 PM, Ashutosh Bapat
 wrote:
> For list partitions, the ListInfo stores the index maps for values
> i.e. the index of the partition to which the value belongs. Those
> indexes are same as the indexes in partition OIDs array and come from
> the catalogs. In case a user creates two partitioned tables with
> exactly same lists for partitions but specifies them in a different
> order, the OIDs are stored in the order specified. This means that
> index array for these tables come out different. equal_list_info()
> works around that by creating an array of mappings and checks whether
> that mapping is consistent for all values. This means we will create
> the mapping as many times as equal_list_info() is called, which is
> expected to be more than the number of time
> RelationBuildPartitionDescriptor() is called. Instead, if we
> "canonicalise" the indexes so that they come out exactly same for
> similarly partitioned tables, we build the mapping only once and
> arrange OIDs accordingly.
>
> Here's patch to do that. I have ran make check with this and it didn't
> show any failure. Please consider this to be included in your next set
> of patches.

The patch has an if condition as statement by itself
+if (l1->null_index != l2->null_index);
 return false;

There shouldn't be ';' at the end. It looks like in the tests you have
added the function always bails out before it reaches this statement.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2016-09-22 Thread Rajkumar Raghuwanshi

On Tue, Sep 20, 2016 at 4:26 PM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

> PFA patch which takes care of some of the TODOs mentioned in my
> previous mail. The patch is based on the set of patches supporting
> declarative partitioning by Amit Langoted posted on 26th August.


I have applied declarative partitioning patches posted by Amit Langote on
26 Aug 2016 and then latest partition-wise-join patch,  getting below error
while make install.

../../../../src/include/catalog/partition.h:37: error: redefinition of
typedef ‘PartitionScheme’
../../../../src/include/nodes/relation.h:492: note: previous declaration of
‘PartitionScheme’ was here
make[4]: *** [commit_ts.o] Error 1
make[4]: Leaving directory
`/home/edb/Desktop/edb_work/WORKDB/PG_PWJ/postgresql/src/backend/access/transam'
make[3]: *** [transam-recursive] Error 2
make[3]: Leaving directory
`/home/edb/Desktop/edb_work/WORKDB/PG_PWJ/postgresql/src/backend/access'
make[2]: *** [access-recursive] Error 2
make[2]: Leaving directory
`/home/edb/Desktop/edb_work/WORKDB/PG_PWJ/postgresql/src/backend'
make[1]: *** [all-backend-recurse] Error 2
make[1]: Leaving directory
`/home/edb/Desktop/edb_work/WORKDB/PG_PWJ/postgresql/src'
make: *** [all-src-recurse] Error 2

PS : I am using - gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17)

I have commented below statement in src/include/catalog/partition.h file
and then tried to install, it worked fine.

/* typedef struct PartitionSchemeData*PartitionScheme; */

Thanks & Regards,
Rajkumar Raghuwanshi

Re: [HACKERS] README of hash index

2016-09-22 Thread Amit Kapila

On Fri, Sep 16, 2016 at 9:56 PM, Jeff Janes  wrote:
> On Fri, Sep 16, 2016 at 4:20 AM, Amit Kapila 
> wrote:
>>
>> Currently README of hash module contain algorithms written in below form.
>>
>> The insertion algorithm is rather similar:
>>
>> pin meta page and take buffer content lock in shared mode
>> loop:
>> compute bucket number for target hash key
>> release meta page buffer content lock
>> if (correct bucket page is already locked)
>> break
>> release any existing bucket page lock (if a concurrent split happened)
>> take heavyweight bucket lock in shared mode
>> retake meta page buffer content lock in shared mode
>> -- (so far same as reader)
>> release pin on metapage
>> ..
>> ..
>>
>> I have mostly updated them in the patches I have proposed to improve
>> hash index.  However, each time I try to update them, I find that it
>> is easy to follow the code than to read and understand the existing
>> algorithm written in above form from README.
>>
>> Do others find it useful to maintain the algorithms in above form?
>
>
> I think that having them all condensed into one place makes it easier to
> think through the locking models to decide if there might be races or
> deadlocks.  If you only care about the algorithm for inserting in isolation,
> I agree reading the code might be better.
>


Thanks Jeff and Ken for the inputs.  I will keep the algorithms
updated in README of hash index.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Tuplesort merge pre-reading

2016-09-22 Thread Claudio Freire

On Thu, Sep 22, 2016 at 4:17 AM, Heikki Linnakangas  wrote:
> On 09/22/2016 03:40 AM, Claudio Freire wrote:
>>
>> On Tue, Sep 20, 2016 at 3:34 PM, Claudio Freire 
>> wrote:
>>>
>>> The results seem all over the map. Some regressions seem significant
>>> (both in the amount of performance lost and their significance, since
>>> all 4 runs show a similar regression). The worst being "CREATE INDEX
>>> ix_lotsofitext_zz2ijw ON lotsofitext (z, z2, i, j, w);" with 4GB
>>> work_mem, which should be an in-memory sort, which makes it odd.
>>>
>>> I will re-run it overnight just in case to confirm the outcome.
>>
>>
>> A new run for "patched" gives better results, it seems it was some
>> kind of glitch in the run (maybe some cron decided to do something
>> while running those queries).
>>
>> Attached
>>
>> In essence, it doesn't look like it's harmfully affecting CPU
>> efficiency. Results seem neutral on the CPU front.
>
>
> Looking at the spreadsheet, there is a 40% slowdown in the "slow" "CREATE
> INDEX ix_lotsofitext_zz2ijw ON lotsofitext (z, z2, i, j, w);" test with 4GB
> of work_mem. I can't reproduce that on my laptop, though. Got any clue
> what's going on there?

It's not present in other runs, so I think it's a fluke the
spreadsheet isn't filtering out. Especially considering that one
should be a fully in-memory fast sort and thus unaffected by the
current patch (z and z2 being integers, IIRC, most comparisons should
be about comparing the first columns and thus rarely involve the big
strings).

I'll try to confirm that's the case though.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning - another take

2016-09-22 Thread Ashutosh Bapat

For list partitions, the ListInfo stores the index maps for values
i.e. the index of the partition to which the value belongs. Those
indexes are same as the indexes in partition OIDs array and come from
the catalogs. In case a user creates two partitioned tables with
exactly same lists for partitions but specifies them in a different
order, the OIDs are stored in the order specified. This means that
index array for these tables come out different. equal_list_info()
works around that by creating an array of mappings and checks whether
that mapping is consistent for all values. This means we will create
the mapping as many times as equal_list_info() is called, which is
expected to be more than the number of time
RelationBuildPartitionDescriptor() is called. Instead, if we
"canonicalise" the indexes so that they come out exactly same for
similarly partitioned tables, we build the mapping only once and
arrange OIDs accordingly.

Here's patch to do that. I have ran make check with this and it didn't
show any failure. Please consider this to be included in your next set
of patches.

That helps partition-wise join as well. For partition-wise join (and
further optimizations for partitioned tables), we create a list of
canonical partition schemes. In this list two similarly partitioned
tables share partition scheme pointer. A join between relations with
same partition scheme pointer can be joined partition-wise. It's
important that the indexes in partition scheme match to the OIDs array
to find matching RelOptInfos for partition-wise join.

On Thu, Sep 22, 2016 at 11:12 AM, Ashutosh Bapat
 wrote:
> Hi Amit,
> Following sequence of DDLs gets an error
> --
> -- multi-leveled partitions
> --
> CREATE TABLE prt1_l (a int, b int, c varchar) PARTITION BY RANGE(a);
> CREATE TABLE prt1_l_p1 PARTITION OF prt1_l FOR VALUES START (0) END
> (250) PARTITION BY RANGE (b);
> CREATE TABLE prt1_l_p1_p1 PARTITION OF prt1_l_p1 FOR VALUES START (0) END 
> (100);
> CREATE TABLE prt1_l_p1_p2 PARTITION OF prt1_l_p1 FOR VALUES START
> (100) END (250);
> CREATE TABLE prt1_l_p2 PARTITION OF prt1_l FOR VALUES START (250) END
> (500) PARTITION BY RANGE (c);
> CREATE TABLE prt1_l_p2_p1 PARTITION OF prt1_l_p2 FOR VALUES START
> ('0250') END ('0400');
> CREATE TABLE prt1_l_p2_p2 PARTITION OF prt1_l_p2 FOR VALUES START
> ('0400') END ('0500');
> CREATE TABLE prt1_l_p3 PARTITION OF prt1_l FOR VALUES START (500) END
> (600) PARTITION BY RANGE ((b + a));
> ERROR:  cannot use column or expression from ancestor partition key
>
> The last statement is trying create subpartitions by range (b + a),
> which contains a partition key from ancestor partition key but is not
> exactly same as that. In fact it contains some extra columns other
> than the ancestor partition key columns. Why do we want to prohibit
> such cases?
>
> On Thu, Sep 15, 2016 at 2:23 PM, Amit Langote
>  wrote:
>>
>> Hi
>>
>> On 2016/09/09 17:55, Amit Langote wrote:
>>> On 2016/09/06 22:04, Amit Langote wrote:
 Will fix.
>>>
>>> Here is an updated set of patches.
>>
>> An email from Rajkumar somehow managed to break out of this thread.
>> Quoting his message below so that I don't end up replying with patches on
>> two different threads.
>>
>> On 2016/09/14 16:58, Rajkumar Raghuwanshi wrote:
>>> I have Continued with testing declarative partitioning with the latest
>>> patch. Got some more observation, given below
>>
>> Thanks a lot for testing.
>>
>>> -- Observation 1 : Getting overlap error with START with EXCLUSIVE in range
>>> partition.
>>>
>>> create table test_range_bound ( a int) partition by range(a);
>>> --creating a partition to contain records {1,2,3,4}, by default 1 is
>>> inclusive and 5 is exclusive
>>> create table test_range_bound_p1 partition of test_range_bound for values
>>> start (1) end (5);
>>> --now trying to create a partition by explicitly mentioning start is
>>> exclusive to contain records {5,6,7}, here trying to create with START with
>>> 4 as exclusive so range should be 5 to 8, but getting partition overlap
>>> error.
>>> create table test_range_bound_p2 partition of test_range_bound for values
>>> start (4) EXCLUSIVE end (8);
>>> ERROR:  partition "test_range_bound_p2" would overlap partition
>>> "test_range_bound_p1"
>>
>> Wow, this is bad.  What is needed in this case is "canonicalization" of
>> the range partition bounds specified in the command.  Range types do this
>> and hence an equivalent test done with range type values would disagree
>> with the result given by the patch for range partition bounds.
>>
>> select '[1,5)'::int4range && '(4,8]'::int4range as cmp;
>>  cmp
>> -
>>  f
>> (1 row)
>>
>> In this case, the second range is converted into its equivalent canonical
>> form viz. '[5, 9)'.  Then comparison of bounds 5) and [5 can tell that the
>> ranges do not overlap after all.  Range type operators can do this because
>> their code can rely on the availability of a canonicalization function for
>> a given range type.  From the range type

Re: [HACKERS] PL/Python adding support for multi-dimensional arrays

2016-09-22 Thread Pavel Stehule

Hi

2016-09-21 19:53 GMT+02:00 Dave Cramer :

>
> On 18 September 2016 at 09:27, Dave Cramer  wrote:
>
>>
>> On 10 August 2016 at 01:53, Pavel Stehule 
>> wrote:
>>
>>> Hi
>>>
>>> 2016-08-03 13:54 GMT+02:00 Alexey Grishchenko :
>>>
 On Wed, Aug 3, 2016 at 12:49 PM, Alexey Grishchenko <
 agrishche...@pivotal.io> wrote:

> Hi
>
> Current implementation of PL/Python does not allow the use of
> multi-dimensional arrays, for both input and output parameters. This 
> forces
> end users to introduce workarounds like casting arrays to text before
> passing them to the functions and parsing them after, which is an
> error-prone approach
>
> This patch adds support for multi-dimensional arrays as both input and
> output parameters for PL/Python functions. The number of dimensions
> supported is limited by Postgres MAXDIM macrovariable, by default equal to
> 6. Both input and output multi-dimensional arrays should have fixed
> dimension sizes, i.e. 2-d arrays should represent MxN matrix, 3-d arrays
> represent MxNxK cube, etc.
>
> This patch does not support multi-dimensional arrays of composite
> types, as composite types in Python might be represented as iterators and
> there is no obvious way to find out when the nested array stops and
> composite type structure starts. For example, if we have a composite type
> of (int, text), we can try to return "[ [ [1,'a'], [2,'b'] ], [ [3,'c'],
> [4,'d'] ] ]", and it is hard to find out that the first two lists are
> lists, and the third one represents structure. Things are getting even 
> more
> complex when you have arrays as members of composite type. This is why I
> think this limitation is reasonable.
>
> Given the function:
>
> CREATE FUNCTION test_type_conversion_array_int4(x int4[]) RETURNS
> int4[] AS $$
> plpy.info(x, type(x))
> return x
> $$ LANGUAGE plpythonu;
>
> Before patch:
>
> # SELECT * FROM test_type_conversion_array_int
> 4(ARRAY[[1,2,3],[4,5,6]]);
> ERROR:  cannot convert multidimensional array to Python list
> DETAIL:  PL/Python only supports one-dimensional arrays.
> CONTEXT:  PL/Python function "test_type_conversion_array_int4"
>
>
> After patch:
>
> # SELECT * FROM test_type_conversion_array_int
> 4(ARRAY[[1,2,3],[4,5,6]]);
> INFO:  ([[1, 2, 3], [4, 5, 6]], )
>  test_type_conversion_array_int4
> -
>  {{1,2,3},{4,5,6}}
> (1 row)
>
>
> --
> Best regards,
> Alexey Grishchenko
>

 Also this patch incorporates the fix for https://www.postgresql.org
 /message-id/CAH38_tkwA5qgLV8zPN1OpPzhtkNKQb30n3xq-2NR9jUfv3q
 wHA%40mail.gmail.com, as they touch the same piece of code - array
 manipulation in PL/Python


>>> I am sending review of this patch:
>>>
>>> 1. The implemented functionality is clearly benefit - passing MD arrays,
>>> pretty faster passing bigger arrays
>>> 2. I was able to use this patch cleanly without any errors or warnings
>>> 3. There is no any error or warning
>>> 4. All tests passed - I tested Python 2.7 and Python 3.5
>>> 5. The code is well commented and clean
>>> 6. For this new functionality the documentation is not necessary
>>>
>>> 7. I invite more regress tests for both directions (Python <-> Postgres)
>>> for more than two dimensions
>>>
>>> My only one objection is not enough regress tests - after fixing this
>>> patch will be ready for commiters.
>>>
>>
Now, the tests are enough - so I'll mark this patch as ready for commiters.

I had to fix tests - there was lot of white spaces, and the result for
python3 was missing

Regards

Pavel




>
>>> Good work, Alexey
>>>
>>> Thank you
>>>
>>> Regards
>>>
>>> Pavel
>>>
>>>
 --
 Best regards,
 Alexey Grishchenko

>>>
>>>
>>
>> Pavel,
>>
>> I will pick this up.
>>
>>
>> Pavel,
>
> Please see attached patch which provides more test cases
>
> I just realized this patch contains the original patch as well. What is
> the protocol for sending in subsequent patches ?
>
>>
>> Dave Cramer
>>
>> da...@postgresintl.com
>> www.postgresintl.com
>>
>>
>
diff --git a/src/pl/plpython/expected/plpython_types.out b/src/pl/plpython/expected/plpython_types.out
index f0b6abd..296ef8b 100644
--- a/src/pl/plpython/expected/plpython_types.out
+++ b/src/pl/plpython/expected/plpython_types.out
@@ -537,9 +537,126 @@ INFO:  (None, )
 (1 row)
 
 SELECT * FROM test_type_conversion_array_int4(ARRAY[[1,2,3],[4,5,6]]);
-ERROR:  cannot convert multidimensional array to Python list
-DETAIL:  PL/Python only supports one-dimensional arrays.
-CONTEXT:  PL/Python function "test_type_conversion_array_int4"
+INFO:  ([[1, 2, 3], [4, 5, 6]], )
+ test_type_conversion_array_int4 
+-
+ {{1,2,3},{4,5,6}}
+(1 row)
+
+SELECT * FROM test_type_conversion_array_int4(ARRAY[[[1,2,NULL],[NULL,

Re: [HACKERS] Tuplesort merge pre-reading

2016-09-22 Thread Heikki Linnakangas


On 09/22/2016 03:40 AM, Claudio Freire wrote:

On Tue, Sep 20, 2016 at 3:34 PM, Claudio Freire  wrote:

The results seem all over the map. Some regressions seem significant
(both in the amount of performance lost and their significance, since
all 4 runs show a similar regression). The worst being "CREATE INDEX
ix_lotsofitext_zz2ijw ON lotsofitext (z, z2, i, j, w);" with 4GB
work_mem, which should be an in-memory sort, which makes it odd.

I will re-run it overnight just in case to confirm the outcome.


A new run for "patched" gives better results, it seems it was some
kind of glitch in the run (maybe some cron decided to do something
while running those queries).

Attached

In essence, it doesn't look like it's harmfully affecting CPU
efficiency. Results seem neutral on the CPU front.


Looking at the spreadsheet, there is a 40% slowdown in the "slow" 
"CREATE INDEX ix_lotsofitext_zz2ijw ON lotsofitext (z, z2, i, j, w);" 
test with 4GB of work_mem. I can't reproduce that on my laptop, though. 
Got any clue what's going on there?


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

64 matches

Mail list logo