date:20181218

Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

2018-12-18 Thread Amit Kapila

On Sat, Dec 15, 2018 at 12:14 AM Robert Haas  wrote:
>
> On Wed, Nov 28, 2018 at 1:43 PM Alvaro Herrera  
> wrote:
> > Right, I think option 4 is a clear improvement over option 1.  I can get
> > behind that one.  Since not many people care to vote, I think this tips
> > the scales enough to that side.
>
> I'm showing up very late to the party here, but I like option 1 best.
> I feel like the SQL standard has a pretty clear idea that NULL is how
> you represent a value is unknown, which shows up in a lot of places.
> Deciding that we're going to use a different sentinel value in this
> one case because NULL is a confusing concept in general seems pretty
> strange to me.
>

The proposed patch has used option-4.  I am not sure if you are
completely convinced or not, but I hope that you won't mind too much
if we pursue with option-4.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Tab completion for ALTER INDEX|TABLE ALTER COLUMN SET STATISTICS

2018-12-18 Thread Tatsuro Yamada


On 2018/12/19 14:27, Michael Paquier wrote:

On Tue, Dec 04, 2018 at 02:31:51PM +0900, Tatsuro Yamada wrote:

  - For now, there is no column_number column on "\d index_name" result.
So, if tab-completion suggested column_numbers, user can't match
these easily.


Well, it depends how many columns an index definition has.  If that's
only a few then it is not really a problem.  However I agree that we
could add that in the output of \d for indexes just before the
definition to help with an easy match.  The columns showing up are
relkind-dependent so that's not an issue.  This would create some noise
in some regression tests.


I see.
Hopefully, I'll create new patch for adding column_number to \d for indexes.



So, we should better to vote to decide implementation of the tab-completion.

Which is better to use either column_names or column_numbers?
I'd like to hear opinions from others. :)


There has been a discussion in this area this week where the conclusion
is that we had better use column numbers rather than names arbitrarily
generated by the server for pg_dump:
https://postgr.es/m/CAARsnT3UQ4V=ydnw468w8rqhfyiy9mpn2r_c5ukbj97naap...@mail.gmail.com

So my take on the matter is to use column numbers, and show only those
with an expression as index attributes with no expressions cannot have
statistics.


Agreed.
I'll revise the patch replaced column_name with column_number.



Looking at the patches, fix_manual_of_alter_index.patch does not matter
much as the documentation of ALTER INDEX already mentions some
compatibilities with ALTER TABLE.


Hmm... I confused because these parameter are not same. Please see below:


  https://www.postgresql.org/docs/11/sql-altertable.html

ALTER [ COLUMN ] column_name SET STATISTICS integer

  https://www.postgresql.org/docs/current/sql-alterindex.html

ALTER [ COLUMN ] column_number SET STATISTICS integer


If we can use both parameter column_name and column_number, it would be better 
to
describe both on the documents.



+   /* ALTER TABLE ALTER [COLUMN]  SET STATISTICS */
+   else if (HeadMatches("ALTER", "TABLE") && TailMatches("SET", 
"STATISTICS")){
+   /* We don't complete after "SET STATISTICS" */
+   }
Okay, this one makes sense taken independently as the current completion
is confusing.  Could you also do the same for ALTER INDEX?


Yep, I already did that for ALTER INDEX in 
tab_completion_alter_index_set_statistics.patch like this:

+   /* ALTER INDEX  ALTER COLUMN  SET STATISTICS */
+   else if (HeadMatches("ALTER", "INDEX") && TailMatches("SET", "STATISTICS")){
+   /* We don't complete after "SET STATISTICS" */
+   }


Thanks,
Tatsuro Yamada

Re: [HACKERS] CLUSTER command progress monitor

2018-12-18 Thread Tatsuro Yamada


Hi Alvaro,

On 2018/12/19 2:23, Alvaro Herrera wrote:

Hello Yamada-san,

On 2018-Dec-03, Tatsuro Yamada wrote:


Thank you for managing the CF and Sorry for the late reply.
I'll rebase it for the next CF and also I'll clear my head because the patch
needs design change to address the feedbacks, I guess. Therefore, the status is
reconsidering the design of the patch. :)


Do you have a new version of this patch?  If not, do you think you'll
have something in time for the upcoming commitfest?


Not yet, I'll be able to send only a rebased patch by the end of this month.
I mean it has no design change because I can't catch up on how to get a progress
from sort and index scan. However I'm going to register the patch on next CF.
I'm happy if you have interested in the patch. :)

Thanks,
Tatsuro Yamada

Re: ToDo: show size of partitioned table

2018-12-18 Thread Pavel Stehule

út 18. 12. 2018 v 8:49 odesílatel Amit Langote <
langote_amit...@lab.ntt.co.jp> napsal:

> Hi,
>
> Thank you for updating the patch.
>
> On 2018/12/17 17:48, Pavel Stehule wrote:
> > new update of this patch
>
> Documentation portion of this patch still contains some typos that I
> mentioned before here:
>
>
> https://www.postgresql.org/message-id/1c83bb5c-47cd-d796-226c-e95795b05551%40lab.ntt.co.jp
>
> + .. If the form \dP+
> +is used, the sum of size of related partitions (including the
> +table and indexes, if any) and a description
> +are also displayed.
>
> + ... If the form \dPi+
> +is used, the sum of size of related indexes and a description
> +are also displayed.
>
> + ... If the form \dPt+
> +is used, the sum of size of related tables and a description
> +are also displayed.
>
> In all of the three hunks:
>
> the sum of size of -> the sum of "sizes" of
>
> and a description -> and associated description
>

fixed


>
> > changes:
> >
> > 1. only root partitioned tables are displayed - you can see quickly total
> > allocated space. It is not duplicated due nested partitions.
>
> +1
>
> If one wants to see a non-root partitioned table's details, they can use
> \dP+ .
>
> > I can imagine new additional flag - line "n" nested - and then we can
> > display nested partitioned tables with parent table info. Some like
> >
> > \dPt - show only root partition tables
> > \dPnt or \dPtn - show root and nested partitioned tables
>
> Too much complication maybe?
>

I wrote it - the setup query is more complex, but not too much. I fixed the
size calculation, when nested partitions tables are visible - it calculate
partitions only from level1 group. Then the displayed size is same as total
size

postgres=# \dP+
List of partitioned relations
┌┬┬───┬┬─┐
│ Schema │Name│ Owner │  Size  │ Description │
╞╪╪═══╪╪═╡
│ public │ parent_tab │ pavel │ 120 kB │ │
└┴┴───┴┴─┘
(1 row)

postgres=# \dPn+
   List of partitioned relations
┌┬─┬───┬─┬───┬─┐
│ Schema │Name │ Owner │ Parent name │ Size  │ Description │
╞╪═╪═══╪═╪═══╪═╡
│ public │ child_30_40 │ pavel │ parent_tab  │ 48 kB │ │
│ public │ parent_tab  │ pavel │ │ 72 kB │ │
└┴─┴───┴─┴───┴─┘
(2 rows)

postgres=# \dPn+ *
   List of partitioned relations or indexes
┌┬┬───┬───┬──┬─┬───┬─┐
│ Schema │Name│ Owner │   Type│ Parent name  │
On table   │ Size  │ Description │
╞╪╪═══╪═══╪══╪═╪═══╪═╡
│ public │ child_30_40│ pavel │ partitioned table │ parent_tab
│ │ 16 kB │ │
│ public │ child_30_40_id_idx │ pavel │ partitioned index │ parent_index │
child_30_40 │ 32 kB │ │
│ public │ parent_index   │ pavel │ partitioned index │  │
parent_tab  │ 48 kB │ │
│ public │ parent_tab │ pavel │ partitioned table │
│ │ 24 kB │ │
└┴┴───┴───┴──┴─┴───┴─┘
(4 rows)




> > 2. \dP without pattern shows root partitioned tables + total relation
> size.
> > When pattern is defined, then shows tables and indexes + table size
> >
> > postgres=# \dP+
> > List of partitioned relations
> > ┌┬┬───┬┬─┐
> > │ Schema │Name│ Owner │  Size  │ Description │
> > ╞╪╪═══╪╪═╡
> > │ public │ parent_tab │ pavel │ 120 kB │ │
> > └┴┴───┴┴─┘
> > (1 row)
> >
> > postgres=# \dP+ *
> > List of partitioned relations or indexes
> >
> ┌┬──┬───┬───┬┬───┬─┐
> > │ Schema │ Name │ Owner │   Type│   Table│ Size
> │
> > Description │
> >
> ╞╪══╪═══╪═══╪╪═══╪═╡
> > │ public │ parent_index │ pavel │ partitioned index │ parent_tab │ 80 kB
> > │ │
> > │ public │ parent_tab   │ pavel │ partitioned table ││ 40 kB
> > │ │
> >
> └┴──┴───┴───┴┴───┴─┘
> > (2 rows)
> >
> > postgres=# \dP+ *index
> > List of partitioned relations or indexes
> >
> ┌┬──┬───┬───┬┬───┬─┐
>

Re: plpgsql plugin - stmt_beg/end is not called for top level block of statements

2018-12-18 Thread Pavel Stehule

st 19. 12. 2018 v 6:45 odesílatel Michael Paquier 
napsal:

> On Sun, Dec 16, 2018 at 10:33:38AM +0100, Pavel Stehule wrote:
> > Now, the statement's hook is not called for every plpgsql_stmt_block
> > statement. It is not big issue, but it is not consistent - and this
> > inconsistency should be repaired inside extension. Better to be
> consistent
> > and every plpgsql statement call identically.
> >
> > patch attached - all regress tests passed. This patch has a effect only
> on
> > plpgsql extensions.
>
> I can see the inconsistency in the code, still do you have a simple
> plpgsql extension where it is possible to see the difference in
> behavior?  This involves execution of functions, triggers and event
> triggers, and exec_stmt_block is used since the beginning of times.
>

What I know, there is only few plpgsql's extensions - 3 variations on
debugger support, 1 profiler and plpgsql_check. It was not a issue, because
these extensions doesn't hit this problem.

When I wrote a coverity  check to plpgsql_check I found it. The outer BEGIN
has a property of usual plpgsql statement - positive lineno, but it was not
executed newer.

I can imagine some tracking extension, that will do some initializations on
plpgsql_stmt_block statement hook - but the most important will not be
called ever.

Regards

Pavel

--
> Michael
>

Re: plpgsql plugin - stmt_beg/end is not called for top level block of statements

2018-12-18 Thread Michael Paquier

On Sun, Dec 16, 2018 at 10:33:38AM +0100, Pavel Stehule wrote:
> Now, the statement's hook is not called for every plpgsql_stmt_block
> statement. It is not big issue, but it is not consistent - and this
> inconsistency should be repaired inside extension. Better to be consistent
> and every plpgsql statement call identically.
> 
> patch attached - all regress tests passed. This patch has a effect only on
> plpgsql extensions.

I can see the inconsistency in the code, still do you have a simple
plpgsql extension where it is possible to see the difference in
behavior?  This involves execution of functions, triggers and event
triggers, and exec_stmt_block is used since the beginning of times.
--
Michael


signature.asc
Description: PGP signature

Re: Tab completion for ALTER INDEX|TABLE ALTER COLUMN SET STATISTICS

2018-12-18 Thread Michael Paquier

On Tue, Dec 04, 2018 at 02:31:51PM +0900, Tatsuro Yamada wrote:
>  - For now, there is no column_number column on "\d index_name" result.
>So, if tab-completion suggested column_numbers, user can't match
>these easily.

Well, it depends how many columns an index definition has.  If that's
only a few then it is not really a problem.  However I agree that we
could add that in the output of \d for indexes just before the
definition to help with an easy match.  The columns showing up are
relkind-dependent so that's not an issue.  This would create some noise
in some regression tests.

> So, we should better to vote to decide implementation of the tab-completion.
> 
> Which is better to use either column_names or column_numbers?
> I'd like to hear opinions from others. :)

There has been a discussion in this area this week where the conclusion
is that we had better use column numbers rather than names arbitrarily
generated by the server for pg_dump:
https://postgr.es/m/CAARsnT3UQ4V=ydnw468w8rqhfyiy9mpn2r_c5ukbj97naap...@mail.gmail.com

So my take on the matter is to use column numbers, and show only those
with an expression as index attributes with no expressions cannot have
statistics.

Looking at the patches, fix_manual_of_alter_index.patch does not matter
much as the documentation of ALTER INDEX already mentions some
compatibilities with ALTER TABLE.

+   /* ALTER TABLE ALTER [COLUMN]  SET STATISTICS */
+   else if (HeadMatches("ALTER", "TABLE") && TailMatches("SET", 
"STATISTICS")){
+   /* We don't complete after "SET STATISTICS" */
+   }
Okay, this one makes sense taken independently as the current completion
is confusing.  Could you also do the same for ALTER INDEX?
--
Michael

signature.asc
Description: PGP signature

Re: [PATCH] Improve tab completion for CREATE TABLE

2018-12-18 Thread Michael Paquier

On Fri, Nov 30, 2018 at 03:44:38PM +, Dagfinn Ilmari Mannsåker wrote:
> ilm...@ilmari.org (Dagfinn Ilmari Mannsåker) writes:
>> Please find attached a patch that adds the following tab completions for
>> CREATE TABLE:
> 
> Added to the 2019-01 commitfest: https://commitfest.postgresql.org/21/1895/

+   else if (TailMatches("CREATE", "TABLE", MatchAny) ||
+TailMatches("CREATE", "TEMP|TEMPORARY|UNLOGGED", "TABLE", 
MatchAny))
+   COMPLETE_WITH("(", "PARTITION OF");
This is missing the completion of "OF TYPE" (no need to print the type
matches).

+   else if (TailMatches("CREATE", "TABLE", MatchAny, "(*)") ||
+TailMatches("CREATE", "UNLOGGED", "TABLE", MatchAny, "(*)"))
+   COMPLETE_WITH("INHERITS (", "PARTITION BY", "WITH (", "TABLESPACE")
Temporary tables can be part of a partition tree, with the parent being
temporary or permanent.

+   /* Complete ON COMMIT actions for temp tables */
+   else if (TailMatches("ON", "COMMIT"))
+   COMPLETE_WITH("PRESERVE ROWS", "DELETE ROWS", "DROP");
This causes ON COMMIT to show up for permanent and unlogged tables.
--
Michael


signature.asc
Description: PGP signature

Re: pg_dumpall --exclude-database option

2018-12-18 Thread Michael Paquier

On Fri, Nov 30, 2018 at 04:26:41PM -0500, Andrew Dunstan wrote:
> On 11/18/18 1:41 PM, Andrew Dunstan wrote:
>> On 11/17/18 9:55 AM, Alvaro Herrera wrote:
>>> In the long run, I think we should add an option to processSQLNamePattern
>>> to use OR instead of AND, which would fix both this problem as well as
>>> pg_dump's.  I don't think that's important enough to stall this patch.

Agreed.  This patch is useful in itself.  This option would be nice to
have, and this routine interface would begin to grow too many boolean
switches to my taste so I'd rather use some flags instead.

The patch is doing its work, however I have spotted an issue in the
format of the dumps generated.  Each time an excluded database is
processed its set of SET queries (from _doSetFixedOutputState) as well
as the header "PostgreSQL database dump" gets generated.  I think that
this data should not show up.
--
Michael


signature.asc
Description: PGP signature

Re: Use an enum for RELKIND_*?

2018-12-18 Thread Andres Freund

Hi,

On 2018-12-18 23:17:54 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > It'd be nice if there were an easy way to write a switch() where the
> > compiler enforces that all enum values are checked, but still had the
> > possibility to have a 'default:' block for error checking... I can't
> > quite come up with a good approach to emulate that though.
> 
> Yeah, that would sure make things better.  I was considering
> something like
> 
>   switch (enumvalue)
>   {
>   case A: ...
>   case B: ...
>   ...
> 
> #ifndef USE_ASSERT_CHECKING
>   default:
>   elog(ERROR, ...);
> #endif
>   }
> 
> so that you get the runtime protection in production builds but not
> debug builds ... except that yes, you really want that protection in
> debug builds too.  Maybe the #if could be on some other symbol so that
> the default: is normally enabled in all builds, but we have some
> lonely buildfarm animal that disables it and builds with -Werror to
> get our attention for omitted cases?

Yea, that's the best I can come up with too.  I think we'd probably want
to have the warning available during development mainly, so I'm not sure
the "lonely buildfarm animal" approach buys us enough.

There's -Wswitch-enum which causes a warning to emitted for missing enum
cases even if there's default:.  Obviously that's not generally usable,
because there's a lot of cases where intentionally do not want to be
exhaustive. We could just cast to int in those, or locally disable the
warnings, but neither sounds particularly enticing...

Greetings,

Andres Freund

Re: Use an enum for RELKIND_*?

2018-12-18 Thread Tom Lane

Andres Freund  writes:
> It'd be nice if there were an easy way to write a switch() where the
> compiler enforces that all enum values are checked, but still had the
> possibility to have a 'default:' block for error checking... I can't
> quite come up with a good approach to emulate that though.

Yeah, that would sure make things better.  I was considering
something like

switch (enumvalue)
{
case A: ...
case B: ...
...

#ifndef USE_ASSERT_CHECKING
default:
elog(ERROR, ...);
#endif
}

so that you get the runtime protection in production builds but not
debug builds ... except that yes, you really want that protection in
debug builds too.  Maybe the #if could be on some other symbol so that
the default: is normally enabled in all builds, but we have some
lonely buildfarm animal that disables it and builds with -Werror to
get our attention for omitted cases?

regards, tom lane

What to name the current heap after pluggable storage / what to rename?

2018-12-18 Thread Andres Freund

Hi,

The current pluggable table storage patchset [1] introduces the ability
to specify the access method of a table (CREATE TABLE ... USING
"ident"). The patchset currently names the current storage method
(i.e. heapam.c et al) "heap" (whereas e.g. zheap's in named, drumroll,
zheap).

I'm concerned that naming it heap, and the corresponding functions
fitting with that name, will be confusing. There's a lot of functions
that have a heap_ prefix (or similar) that aren't really dependent on
how the storage works, but much more general infrastructure (consider
e.g. heap_create_with_catalog()).

One solution is to just live with the confusion, add a few header
comments to files like src/backend/catalog/heap.c, explaining that
they're more general than heapam.c (and in the patch heapam_handler.c,
which mostly dispatches to heapam.c).

Another approach would be to come up with a different, potentially
witty, name for the current postgres table access method. Then either
rename a few of the heapam.c et functions, or live with the slightly
more confusing names.

Another would be to be aggressive in renaming, and deconflict by
renaming functions like heap_create[_with_catalog] etc to sound more
accurate. I think that has some appeal, because a lot of those names
aren't describing their tasks particularly well.

Greetings,

Andres Freund

[1] https://postgr.es/m/20180703070645.wchpu5muyto5n647%40alap3.anarazel.de

Re: Use an enum for RELKIND_*?

2018-12-18 Thread Andres Freund

Hi,

On 2018-12-18 22:50:57 -0500, Tom Lane wrote:
> Kyotaro HORIGUCHI  writes:
> > At Tue, 18 Dec 2018 17:13:08 -0800, Andres Freund  
> > wrote in <20181219011308.mopzyvc73nwjz...@alap3.anarazel.de>
> >> Right now there's no easy way to use the compiler to ensure that all
> >> places that need to deal with all kinds of relkinds check a new
> >> relkind.  I think we should make that easier by moving RELKIND_* to an
> >> enum, with the existing letters as the value.
> 
> > I think we cannot use enums having base-type, so it will work
> > unless we forget the cast within switch(). However, I don't think
> > it is not a reason not to do so.
> >
> > switch ((RelationRelkind) rel->rd_rel->relkind)
> 
> Hm.  This example doesn't seem very compelling.  If we put a
> "default: elog(ERROR...)" into the switch, compilers will not warn
> us about omitted values.  On the other hand, if we remove the default:
> then we lose robustness against corrupted relkind values coming from disk,
> which I think is going to be a net negative.

I don't quite see the from-disk bit being particularly critical, that
seems like a fairly minor safety net. Most of those switches are at
places considerably later than where on-disk corruption normally would
be apparent.  If we really are concerned with that, it'd not be too hard
to add a relkind_as_enum() wrapper function that errored out on an
inpermissible value.

> Not to mention that we get no help at all unless we remember to add
> the cast to every such switch.

That doesn't seem too bad. A manual search and replace would probably
find the majority within an hour or two of effort.

> So, while I understand Andres' desire to make this more bulletproof,
> I'm not quite sure how this proposal helps in practice.

We've repeatedly forgotten to add new relkinds for checks that we
already have. Manually adding the cast won't be bulletproof, but it sure
would be more robust than what we have now.  There's plenty switches
where we do know that we want the exhaustive list, adding the cast there
would already make things more robust, even if we miss other places.

It'd be nice if there were an easy way to write a switch() where the
compiler enforces that all enum values are checked, but still had the
possibility to have a 'default:' block for error checking... I can't
quite come up with a good approach to emulate that though.

Greetings,

Andres Freund

Re: Use an enum for RELKIND_*?

2018-12-18 Thread Tom Lane

Kyotaro HORIGUCHI  writes:
> At Tue, 18 Dec 2018 17:13:08 -0800, Andres Freund  wrote 
> in <20181219011308.mopzyvc73nwjz...@alap3.anarazel.de>
>> Right now there's no easy way to use the compiler to ensure that all
>> places that need to deal with all kinds of relkinds check a new
>> relkind.  I think we should make that easier by moving RELKIND_* to an
>> enum, with the existing letters as the value.

> I think we cannot use enums having base-type, so it will work
> unless we forget the cast within switch(). However, I don't think
> it is not a reason not to do so.
>
> switch ((RelationRelkind) rel->rd_rel->relkind)

Hm.  This example doesn't seem very compelling.  If we put a
"default: elog(ERROR...)" into the switch, compilers will not warn
us about omitted values.  On the other hand, if we remove the default:
then we lose robustness against corrupted relkind values coming from disk,
which I think is going to be a net negative.  Not to mention that we get
no help at all unless we remember to add the cast to every such switch.

So, while I understand Andres' desire to make this more bulletproof,
I'm not quite sure how this proposal helps in practice.

regards, tom lane

Re: Use an enum for RELKIND_*?

2018-12-18 Thread Kyotaro HORIGUCHI

Hello.

At Tue, 18 Dec 2018 17:13:08 -0800, Andres Freund  wrote in 
<20181219011308.mopzyvc73nwjz...@alap3.anarazel.de>
> Hi,
> 
> Right now there's no easy way to use the compiler to ensure that all
> places that need to deal with all kinds of relkinds check a new
> relkind.  I think we should make that easier by moving RELKIND_* to an
> enum, with the existing letters as the value.

I feel the same pain and I had thought of a kind of that before.

> Obviously we cannot really do that for FormData_pg_class.relkind, but
> switch() statements can easily cast that to RelationRelkind (or whatever
> we name it).
> 
> Does anybody see a reason not to do so?

I think we cannot use enums having base-type, so it will work
unless we forget the cast within switch(). However, I don't think
it is not a reason not to do so.

switch ((RelationRelkind) rel->rd_rel->relkind)
{
  ...
}

char is compatible with integer under our usage there. FWIW I
don't mind explict assignments in the enum definition since we
already do the similar thing there.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Use an enum for RELKIND_*?

2018-12-18 Thread Andres Freund

Hi,

Right now there's no easy way to use the compiler to ensure that all
places that need to deal with all kinds of relkinds check a new
relkind.  I think we should make that easier by moving RELKIND_* to an
enum, with the existing letters as the value.

Obviously we cannot really do that for FormData_pg_class.relkind, but
switch() statements can easily cast that to RelationRelkind (or whatever
we name it).

Does anybody see a reason not to do so?

Greetings,

Andres Freund

Re: cfbot run pgindent?

2018-12-18 Thread Thomas Munro

On Wed, Dec 19, 2018 at 8:08 AM Peter Eisentraut
 wrote:
> On 18/12/2018 19:50, Robbie Harwood wrote:
> > Second, since the project does have a defined style checker, do you
> > think it would be possible to run it as part of cfbot and report errors?
>
> pgindent is a code formatter, not a style checker.  Using it to check
> whether just the newly added code is correctly formatted would probably
> require that all existing code is correctly formatted at all times,
> which isn't the case.
>
> What cfbot could do is spit out a diff at the end between the current
> code and the pgindent'ed version.  That might be useful in some cases.

pgindent has the problem of that typename list that is maintained by
mysterious manual means, so it might be noisy.

I think that in general any extra checks (for example perl scripts
that check that nodes have all the right boilerplate, read/write
functions etc, wait points have names and documentation, ... etc as
discussed a while back) might as well be in-tree and runnable from the
makefile so that everyone can run them directly if they want, and then
cfbot could too.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: cfbot run pgindent?

2018-12-18 Thread Michael Paquier

On Tue, Dec 18, 2018 at 10:08:33PM +0100, Peter Eisentraut wrote:
> What cfbot could do is spit out a diff at the end between the current
> code and the pgindent'ed version.  That might be useful in some cases.

Yes, that part would be useful.  Another thing is that some structures
need to be added to typedefs.list, which can also create noise diffs in
some patches.
--
Michael


signature.asc
Description: PGP signature

Re: Should new partitions inherit their tablespace from their parent?

2018-12-18 Thread Michael Paquier

On Tue, Dec 18, 2018 at 11:40:49AM -0500, Tom Lane wrote:
> Agreed, done.

Thanks Tom for stepping in!
--
Michael


signature.asc
Description: PGP signature

Re: ATTACH/DETACH PARTITION CONCURRENTLY

2018-12-18 Thread Michael Paquier

On Tue, Dec 18, 2018 at 01:41:06PM -0500, Robert Haas wrote:
> OK.  I'll post what I have by the end of the week.

Thanks, Robert.
--
Michael


signature.asc
Description: PGP signature

Re: Doc typo?

2018-12-18 Thread Tatsuo Ishii

>> I'd move the comma not remove it; and I think "the pg_read_file()" is
>> pretty bad English too.  So perhaps
>> 
>>  Note that granting users the EXECUTE privilege on
>>  pg_read_file(), or related functions, allows them 
>> the
>>  ability to read any file on the server which the database can read and
>>  that those reads bypass all in-database privilege checks.
> 
> Thanks. I will commit this.

Done.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: Doc typo?

2018-12-18 Thread Tatsuo Ishii

>> It seems there's an extra comma between "related" and "functions". Am I 
>> correct?
> 
> I'd move the comma not remove it; and I think "the pg_read_file()" is
> pretty bad English too.  So perhaps
> 
>  Note that granting users the EXECUTE privilege on
>  pg_read_file(), or related functions, allows them 
> the
>  ability to read any file on the server which the database can read and
>  that those reads bypass all in-database privilege checks.

Thanks. I will commit this.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2018-12-18 Thread Tomas Vondra

Hi Alexey,

Attached is an updated version of the patches, with all the fixes I've
done in the past. I believe it should fix at least some of the issues
you reported - certainly the problem with stream_cleanup_files, but
perhaps some of the other issues too.

I'm a bit confused by the changes to TAP tests. Per the patch summary,
some .pl files get renamed (nor sure why), a new one is added, etc. So
I've instead enabled streaming subscriptions in all tests, which with
this patch produces two failures:

Test Summary Report
---
t/004_sync.pl(Wstat: 7424 Tests: 1 Failed: 0)
  Non-zero exit status: 29
  Parse errors: Bad plan.  You planned 7 tests but ran 1.
t/011_stream_ddl.pl  (Wstat: 256 Tests: 2 Failed: 1)
  Failed test:  2
  Non-zero exit status: 1

So yeah, there's more stuff to fix. But I can't directly apply your
fixes because the updated patches are somewhat different.


On 12/18/18 3:07 PM, Alexey Kondratov wrote:
> On 18.12.2018 1:28, Tomas Vondra wrote:
>>> 4) There was a problem with marking top-level transaction as having
>>> catalog changes if one of its subtransactions has. It was causing a
>>> problem with DDL statements just after subtransaction start (savepoint),
>>> so data from new columns is not replicated.
>>>
>>> 5) Similar issue with schema send. You send schema only once per each
>>> sub/transaction (IIRC), while we have to update schema on each catalog
>>> change: invalidation execution, snapshot rebuild, adding new tuple cids.
>>> So I ended up with adding is_schema_send flag to ReorderBufferTXN, since
>>> it is easy to set it inside RB and read in the output plugin. Probably,
>>> we have to choose a better place for this flag.
>>>
>> Hmm. Can you share an example how to trigger these issues?
> 
> Test cases inside 014_stream_tough_ddl.pl and old ones (with
> streaming=true option added) should reproduce all these issues. In
> general, it happens in a txn like:
> 
> INSERT
> SAVEPOINT
> ALTER TABLE ... ADD COLUMN
> INSERT
> 
> then the second insert may discover old version of catalog.
> 

Yeah, that's the issue I've discovered before and thought it got fixed.

>> Interesting. Any idea where does the extra overhead in this particular
>> case come from? It's hard to deduce that from the single flame graph,
>> when I don't have anything to compare it with (i.e. the flame graph for
>> the "normal" case).
> 
> I guess that bottleneck is in disk operations. You can check
> logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
> writes (~26%) take around 35% of CPU time in summary. To compare,
> please, see attached flame graph for the following transaction:
> 
> INSERT INTO large_text
> SELECT (SELECT string_agg('x', ',')
> FROM generate_series(1, 2000)) FROM generate_series(1, 100);
> 
> Execution Time: 44519.816 ms
> Time: 98333,642 ms (01:38,334)
> 
> where disk IO is only ~7-8% in total. So we get very roughly the same
> ~x4-5 performance drop here. JFYI, I am using a machine with SSD for tests.
> 
> Therefore, probably you may write changes on receiver in bigger chunks,
> not each change separately.
> 

Possibly, I/O is certainly a possible culprit, although we should be
using buffered I/O and there certainly are not any fsyncs here. So I'm
not sure why would it be cheaper to do the writes in batches.

BTW does this mean you see the overhead on the apply side? Or are you
running this on a single machine, and it's difficult to decide?

>> So I'm not particularly worried, but I'll look into that. I'd be much
>> more worried if there was measurable overhead in cases when there's no
>> streaming happening (either because it's disabled or the memory limit
>> was not hit).
> 
> What I have also just found, is that if a table row is large enough to
> be TOASTed, e.g.:
> 
> INSERT INTO large_text
> SELECT (SELECT string_agg('x', ',')
> FROM generate_series(1, 100)) FROM generate_series(1, 1000);
> 
> then logical_work_mem limit is not hit and we neither stream, nor spill
> to disk this transaction, while it is still large. In contrast, the
> transaction above (with 100 smaller rows) being comparable in size
> is streamed. Not sure, that it is easy to add proper accounting of
> TOAST-able columns, but it worth it.
> 

That's certainly strange and possibly a bug in the memory accounting
code. I'm not sure why would that happen, though, because TOAST data
look just like regular INSERT changes. Interesting. I wonder if it's
already fixed in this updated version, but it's a bit too late to
investigate that today.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


0001-Add-logical_work_mem-to-limit-ReorderBuffer-20181219.patch.gz
Description: application/gzip


0002-Immediately-WAL-log-assignments-20181219.patch.gz
Description: application/gzip


0003-Issue-individual-invalidations-with-wal_lev-20181219.patch.gz
Description:

Re: Doc typo?

2018-12-18 Thread Tom Lane

David Fetter  writes:
> Is there a useful distinction to be drawn between the files readable
> by the system user who owns the database and those the database itself
> can read?

Probably not.  It's possible to create such a distinction with SELinux
or other security tools, but not in plain Unix, and I don't think we
want to wade into non-standard stuff.

regards, tom lane

Re: Doc typo?

2018-12-18 Thread David Fetter

On Tue, Dec 18, 2018 at 06:16:14PM -0500, Tom Lane wrote:
> Tatsuo Ishii  writes:
> > While translating the manual into Japanese, I had a hard time to
> > parse following sentence in func.sgml:
> 
> > Note that granting users the EXECUTE privilege on the
> > pg_read_file(), or related, functions allows them 
> > the
> > ability to read any file on the server which the database can read and
> > that those reads bypass all in-database privilege checks.
> 
> > It seems there's an extra comma between "related" and "functions". Am I 
> > correct?
> 
> I'd move the comma not remove it; and I think "the pg_read_file()" is
> pretty bad English too.  So perhaps
> 
>  Note that granting users the EXECUTE privilege on
>  pg_read_file(), or related functions, allows them 
> the
>  ability to read any file on the server which the database can read and
>  that those reads bypass all in-database privilege checks.

Maintaining parallelism:

Note that granting users the EXECUTE privilege on
pg_read_file(), or on related functions, allows them 
the
ability to read any file on the server which the database can read and
that those reads bypass all in-database privilege checks.

Is there a useful distinction to be drawn between the files readable
by the system user who owns the database and those the database itself
can read?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: [HACKERS] CLUSTER command progress monitor

2018-12-18 Thread Peter Geoghegan

On Tue, Dec 18, 2018 at 2:47 PM Alvaro Herrera  wrote:
> Well, if you think about individual blocks in terms of storage space,
> maybe that's true, but I meant in an Heraclitus way of men never
> stepping into the same river -- the second time you write the block,
> it's not the same block you wrote before, so you count it twice.  It's
> not the actual disk space utilization that matters, but how much I/O
> have you done (even if it is just to kernel cache, I suppose).

Right.

> I suppose mapping such numbers to actual progress is a bit of an art (or
> intuition as you say), but it seems to be the best we can do, if we do
> anything at all.

I think that it's fairly useful. I suspect that you don't have to have
my theoretical grounding in sorting to be able to do almost as well.
All you need is a little bit of experience.

> How good are those predictions?  The feeling I get from this thread is
> that if the estimation of the number of passes is unreliable, it's
> better not to report it at all; just return how many we've done thus
> far.  It's undesirable to report that we're about 150% done (or take
> hours to get to 40% done, then suddenly be over).

Maybe it isn't that reliable. But on second thought I think that it
might not matter, and maybe we should just not do that.

"How slow can I make this sort go by subtracting work_mem?" is a game
that I like to play sometimes. This blogpost plays that game, and
reaches some pretty amusing conclusions:

https://www.cybertec-postgresql.com/en/postgresql-improving-sort-performance/

It says that sorting numeric is 60% slower when you do an external
sort. But it's an apples to asteroids comparison, because the
comparison is made between 4MB of work_mem, and 1GB. I think that it's
pretty damn impressive that it's only 60% slower! Besides, even that
difference is probably on the high side of average, because numeric
abbreviated keys work particularly well, and you won't get the same
benefit with a unique numeric values when you happen to be doing a lot
of merging. If you tried the same experiment with integers, or even
text + abbreviated keys, I bet the difference would be a lot smaller.
Despite the huge gap in the amount of memory used.

On modern hardware, where doing some amount of random I/O is not that
noticeable, you'll have a very hard time finding a case where even a
paltry amount of memory with many passes does all that much worse than
an internal sort (OTOH, it's not hard to find cases where an external
sort is *faster*). Even if you make a generic estimate, it's still
probably going to be pretty good, because there just isn't that much
variation in how long the sort will take as you vary the amount of
memory it can use. Some people will be surprised at this, but it's a
pretty robust effect. (This is why I think that a hash_mem GUC might
be a good medium term solution that improves upon work_mem -- the
situation is dramatically different when it comes to hashing.)

My point is that you could offer users the kind of insight they'd find
very useful with only a very crude estimate of the amount of merging.
Even if it was 60% slower than initially projected, that's still not
an awful estimate to most users. That just leaves initial run
generation, but it's relatively easy to accurately estimate the amount
of initial runs. I rarely see a case where merging takes more than 40%
of the total, barring parallel CREATE INDEX.

> I wonder if internal sorts are really all that interesting from the PoV
> of progress reporting.  Also, I have the impression that quicksort isn't
> very amenable to letting you know how much work is left.

It is hard to predict the duration of one massive quicksort, but it's
seems fairly easy to recognize a kind of cadence across multiple
quicksorts/runs that each take seconds to a couple of minutes. That's
going to be the vast, vast majority of cases we care about.

-- 
Peter Geoghegan

Re: Doc typo?

2018-12-18 Thread Tom Lane

Tatsuo Ishii  writes:
> While translating the manual into Japanese, I had a hard time to
> parse following sentence in func.sgml:

> Note that granting users the EXECUTE privilege on the
> pg_read_file(), or related, functions allows them the
> ability to read any file on the server which the database can read and
> that those reads bypass all in-database privilege checks.

> It seems there's an extra comma between "related" and "functions". Am I 
> correct?

I'd move the comma not remove it; and I think "the pg_read_file()" is
pretty bad English too.  So perhaps

 Note that granting users the EXECUTE privilege on
 pg_read_file(), or related functions, allows them the
 ability to read any file on the server which the database can read and
 that those reads bypass all in-database privilege checks.

regards, tom lane

Doc typo?

2018-12-18 Thread Tatsuo Ishii

While translating the manual into Japanese, I had a hard time to
parse following sentence in func.sgml:

Note that granting users the EXECUTE privilege on the
pg_read_file(), or related, functions allows them the
ability to read any file on the server which the database can read and
that those reads bypass all in-database privilege checks.

It seems there's an extra comma between "related" and "functions". Am I correct?

Patch attached.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index b3336ea9be..aa0c4cc89d 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -20399,7 +20399,7 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());
 

 Note that granting users the EXECUTE privilege on the
-pg_read_file(), or related, functions allows them the
+pg_read_file(), or related functions allows them the
 ability to read any file on the server which the database can read and
 that those reads bypass all in-database privilege checks.  This means that,
 among other things, a user with this access is able to read the contents of the

Re: [HACKERS] CLUSTER command progress monitor

2018-12-18 Thread Alvaro Herrera

On 2018-Dec-18, Peter Geoghegan wrote:

> On Tue, Dec 18, 2018 at 1:02 PM Alvaro Herrera  
> wrote:
> > If we see this in terms of tapes and merges, we can report the total
> > number of each of those that we have completed.  As far as I understand,
> > we write one tape to completion, and only then start another one, right?
> > Since there's no way to know how many tapes/merges are needed in total,
> > it's not possible to compute a percentage of completion.  That's seems
> > okay -- we're just telling the user that progress is being made, and we
> > only report facts not theory.  Perhaps we can (also?) indicate disk I/O
> > utilization, in terms of the number of blocks written by tuplesort.
> 
> The number of blocks tuplesort uses is constant from the end of
> initial run generation, since logtape.c will recycle blocks.

Well, if you think about individual blocks in terms of storage space,
maybe that's true, but I meant in an Heraclitus way of men never
stepping into the same river -- the second time you write the block,
it's not the same block you wrote before, so you count it twice.  It's
not the actual disk space utilization that matters, but how much I/O
have you done (even if it is just to kernel cache, I suppose).

> > I suppose that in order to have tuplesort.c report progress, we would
> > have to have some kind of API that tuplesort would invoke internally to
> > indicate events such as "tape started/completed", "merge started/completed".
> > One idea is to use a callback system; each tuplesort caller could
> > optionally pass a callback to the "begin" function, for progress
> > reporting purposes.  Initially only cluster.c would use it, but I
> > suppose eventually every tuplesort caller would want that.
> 
> I think that you could have a callback that did something with the
> information currently reported by trace_sort. That's not a bad way of
> scoping the problem. That's how I myself monitor the progress of a
> sort, and it works pretty well (whether or not that means other people
> can do it is not exactly clear to me).

Thanks, that looks useful.

I suppose mapping such numbers to actual progress is a bit of an art (or
intuition as you say), but it seems to be the best we can do, if we do
anything at all.

> We predict the number of merge passes within cost_sort() already. That
> doesn't seem all that hard to generalize, so that you report the
> expected number of passes against the current pass. Some passes are
> much quicker than others, but you generally don't have that many with
> realistic cases. I don't expect that it will work very well with an
> internal sort, but in the case of CLUSTER that almost seems
> irrelevant. And maybe even in all cases.

How good are those predictions?  The feeling I get from this thread is
that if the estimation of the number of passes is unreliable, it's
better not to report it at all; just return how many we've done thus
far.  It's undesirable to report that we're about 150% done (or take
hours to get to 40% done, then suddenly be over).

I wonder if internal sorts are really all that interesting from the PoV
of progress reporting.  Also, I have the impression that quicksort isn't
very amenable to letting you know how much work is left.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Fixing findDependentObjects()'s dependency on scan order (regressions in DROP diagnostic messages)

2018-12-18 Thread Peter Geoghegan

On Tue, Dec 18, 2018 at 2:26 PM Tom Lane  wrote:
> Hm, that definitely leads me to feel that we've got bug(s) in
> dependency.c.  I'll take a look sometime soon.

Thanks!

I'm greatly relieved that I probably won't have to become an expert on
dependency.c after all.   :-)
-- 
Peter Geoghegan

Re: Fixing findDependentObjects()'s dependency on scan order (regressions in DROP diagnostic messages)

2018-12-18 Thread Tom Lane

Peter Geoghegan  writes:
> On Tue, Dec 18, 2018 at 2:11 PM Tom Lane  wrote:
>> Do you mean "same" output, or "sane" output?  I'd certainly expect
>> the latter.

> I meant sane.

> --ignore-system-indexes leads to slightly wrong answers in a number of
> the diagnostic messages run by the regression tests (I recall that the
> number of objects affected by CASCADE sometimes differed, and I think
> that there was also a certain amount of this DEPENDENCY_INTERNAL_AUTO
> business that Alvaro looked into). I think that this must have always
> been true.

Hm, that definitely leads me to feel that we've got bug(s) in
dependency.c.  I'll take a look sometime soon.

regards, tom lane

Re: [HACKERS] CLUSTER command progress monitor

2018-12-18 Thread Peter Geoghegan

On Tue, Dec 18, 2018 at 1:02 PM Alvaro Herrera  wrote:
> If we see this in terms of tapes and merges, we can report the total
> number of each of those that we have completed.  As far as I understand,
> we write one tape to completion, and only then start another one, right?
> Since there's no way to know how many tapes/merges are needed in total,
> it's not possible to compute a percentage of completion.  That's seems
> okay -- we're just telling the user that progress is being made, and we
> only report facts not theory.  Perhaps we can (also?) indicate disk I/O
> utilization, in terms of the number of blocks written by tuplesort.

The number of blocks tuplesort uses is constant from the end of
initial run generation, since logtape.c will recycle blocks.

> I suppose that in order to have tuplesort.c report progress, we would
> have to have some kind of API that tuplesort would invoke internally to
> indicate events such as "tape started/completed", "merge started/completed".
> One idea is to use a callback system; each tuplesort caller could
> optionally pass a callback to the "begin" function, for progress
> reporting purposes.  Initially only cluster.c would use it, but I
> suppose eventually every tuplesort caller would want that.

I think that you could have a callback that did something with the
information currently reported by trace_sort. That's not a bad way of
scoping the problem. That's how I myself monitor the progress of a
sort, and it works pretty well (whether or not that means other people
can do it is not exactly clear to me).

We predict the number of merge passes within cost_sort() already. That
doesn't seem all that hard to generalize, so that you report the
expected number of passes against the current pass. Some passes are
much quicker than others, but you generally don't have that many with
realistic cases. I don't expect that it will work very well with an
internal sort, but in the case of CLUSTER that almost seems
irrelevant. And maybe even in all cases.

I think that the user is going to have to be willing to develop some
intuition about the progress for it to be all that useful. They're
really looking for something that gives a clue if they'll have to wait
an hour, a day, or a week, which it seems like trace_sort-like
information gives you some idea of. (BTW, dtrace probes can already
give the user much the same information -- I think that more people
should use those, since tracing technology on Linux has improved
drastically in the last few years.)

-- 
Peter Geoghegan

Re: Fixing findDependentObjects()'s dependency on scan order (regressions in DROP diagnostic messages)

2018-12-18 Thread Peter Geoghegan

On Tue, Dec 18, 2018 at 2:11 PM Tom Lane  wrote:
> > Interesting.
> > Note that if the standard that we're going to hold a solution to here
> > is "must produce sane output with  --ignore-system-indexes", then my
> > solution will not meet that standard.
>
> Do you mean "same" output, or "sane" output?  I'd certainly expect
> the latter.

I meant sane.

--ignore-system-indexes leads to slightly wrong answers in a number of
the diagnostic messages run by the regression tests (I recall that the
number of objects affected by CASCADE sometimes differed, and I think
that there was also a certain amount of this DEPENDENCY_INTERNAL_AUTO
business that Alvaro looked into). I think that this must have always
been true.

My patch will not improve matters with --ignore-system-indexes. It
merely makes the currently expected output when scanning pg_depend
indexes occur reliably. For better or for worse.

-- 
Peter Geoghegan

Re: Fixing findDependentObjects()'s dependency on scan order (regressions in DROP diagnostic messages)

2018-12-18 Thread Tom Lane

Peter Geoghegan  writes:
> On Tue, Dec 18, 2018 at 1:20 PM Alvaro Herrera  
> wrote:
>> I can reproduce the
>> reported problem without your patch by using that flag.  Here's a
>> recipe:

> Interesting.
> Note that if the standard that we're going to hold a solution to here
> is "must produce sane output with  --ignore-system-indexes", then my
> solution will not meet that standard.

Do you mean "same" output, or "sane" output?  I'd certainly expect
the latter.

I think though that Alvaro was just offering this as a way to poke
at the posited bug in dependency.c without having to install your
whole patch.

regards, tom lane

Re: still use IndexIsValid() etc. macros?

2018-12-18 Thread Tom Lane

Peter Eisentraut  writes:
> In another patch discussion it was brought up why the patch doesn't use
> the IndexIsValid() etc. macros.

> They are defined thus:

> /*
>  * Use of these macros is recommended over direct examination of the state
>  * flag columns where possible; this allows source code compatibility with
>  * the hacky representation used in 9.2.
>  */
> #define IndexIsValid(indexForm) ((indexForm)->indisvalid)
> #define IndexIsReady(indexForm) ((indexForm)->indisready)
> #define IndexIsLive(indexForm)  ((indexForm)->indislive)

> I don't see them used consistently.  Obviously, some low-level code
> needs to bypass them, but it's hard to see where to draw the line.  Is
> it worth keeping these?  Is there still code that maintains
> compatibility with 9.2 from a single source?

We have a general problem with people ignoring accessor macros;
it's hardly limited to these.  The relcache accessor macros such as
RelationGetDescr, for instance, tend to get bypassed.  I'm not sure
how worthwhile it is to be picky about that.

I agree that now that 9.2 is a year out of support, there's not
too much value in the original cross-branch-compatibility argument
for these particular macros.

regards, tom lane

Re: Fixing findDependentObjects()'s dependency on scan order (regressions in DROP diagnostic messages)

2018-12-18 Thread Peter Geoghegan

On Tue, Dec 18, 2018 at 1:20 PM Alvaro Herrera  wrote:
>
> On 2018-Dec-18, Peter Geoghegan wrote:
> Hmm, interesting.  I wonder if this is just a case of never testing this
> code under "postgres --ignore-system-indexes".

I suppose that you could say that. The regression tests will fail at
many points with --ignore-system-indexes, almost all of which are due
to well understood issues. For example, you'll get load of WARNINGs
about needing to use a system index despite the server being run under
--ignore-system-indexes.

> I can reproduce the
> reported problem without your patch by using that flag.  Here's a
> recipe:
>
> create extension cube;
> create table dep as select ctid as tid,* from pg_depend;
> create extension earthdistance;
> select tid, deptype, (dep).type, (dep).identity, (ref).type, (ref).identity
>   from (select tid, deptype, pg_identify_object(classid, objid, objsubid) as 
> dep,
>pg_identify_object(refclassid, refobjid, refobjsubid) as ref
>   from (select ctid as tid, * from pg_depend except select * from 
> dep) a
>) b;

Interesting.

Note that if the standard that we're going to hold a solution to here
is "must produce sane output with  --ignore-system-indexes", then my
solution will not meet that standard. However, I fear that it's going
to be really hard to accomplish that goal some other way (besides
which, as I said, the tests will still fail with
--ignore-system-indexes for reasons that have nothing to do with
dependency management).

-- 
Peter Geoghegan

still use IndexIsValid() etc. macros?

2018-12-18 Thread Peter Eisentraut

In another patch discussion it was brought up why the patch doesn't use
the IndexIsValid() etc. macros.

They are defined thus:

/*
 * Use of these macros is recommended over direct examination of the state
 * flag columns where possible; this allows source code compatibility with
 * the hacky representation used in 9.2.
 */
#define IndexIsValid(indexForm) ((indexForm)->indisvalid)
#define IndexIsReady(indexForm) ((indexForm)->indisready)
#define IndexIsLive(indexForm)  ((indexForm)->indislive)

I don't see them used consistently.  Obviously, some low-level code
needs to bypass them, but it's hard to see where to draw the line.  Is
it worth keeping these?  Is there still code that maintains
compatibility with 9.2 from a single source?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

insensitive collations

2018-12-18 Thread Peter Eisentraut

With various patches and discussions around collations going on, I
figured I'd send in my in-progress patch for insensitive collations.

This adds a flag "insensitive" to collations.  Such a collation disables
various optimizations that assume that strings are equal only if they
are byte-wise equal.  That then allows use cases such as
case-insensitive or accent-insensitive comparisons or handling of
strings with different Unicode normal forms.

So this doesn't actually make the collation case-insensitive or
anything, it just allows a library-provided collation that is, say,
case-insensitive to actually work that way.  So maybe "insensitive"
isn't the right name for this flag, but we can think about that.

The jobs of this patch, aside from some DDL extensions, are to track
collation assignment in plan types whether they have so far been
ignored, and then make the various collation-aware functions take the
insensitive flag into account.  In comparison functions this just means
skipping past the memcmp() optimizations.  In hashing functions, this
means converting the string to a sort key (think strxfrm()) before hashing.

Various pieces are incomplete, but the idea should be clear from this.
I have only implemented the ICU implementation in hashtext(); the libc
provider branch needs to be added (or maybe we won't want to).  All the
changes around the "name" type haven't been taken into account.  Foreign
key support (see ri_GenerateQualCollation()) needs to be addressed.
More tests for all the different plans need to be added.  But in
principle it works quite well, as you can see in the tests added so far.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From d63df8170ec8e9afedcb894f87f73a409aa27e6d Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 18 Dec 2018 22:17:53 +0100
Subject: [PATCH v1] Insensitive collations

This adds a flag "insensitive" to collations.  Such a collation
disables various optimizations that assume that strings are equal only
if they are byte-wise equal.  That then allows use cases such as
case-insensitive or accent-insensitive comparisons or handling of
strings with different Unicode normal forms.
---
 contrib/bloom/bloom.h |   1 +
 contrib/bloom/blutils.c   |   3 +-
 doc/src/sgml/catalogs.sgml|   7 +
 doc/src/sgml/charset.sgml |  11 +-
 doc/src/sgml/ref/create_collation.sgml|  18 +++
 src/backend/access/hash/hashfunc.c|  45 ++
 src/backend/catalog/pg_collation.c|   2 +
 src/backend/commands/collationcmds.c  |  15 +-
 src/backend/executor/execExpr.c   |   4 +-
 src/backend/executor/execGrouping.c   |  12 +-
 src/backend/executor/execPartition.c  |   1 +
 src/backend/executor/nodeAgg.c|   4 +
 src/backend/executor/nodeGroup.c  |   1 +
 src/backend/executor/nodeHash.c   |  14 +-
 src/backend/executor/nodeHashjoin.c   |   5 +
 src/backend/executor/nodeRecursiveunion.c |   1 +
 src/backend/executor/nodeSetOp.c  |   2 +
 src/backend/executor/nodeSubplan.c|   8 +
 src/backend/executor/nodeUnique.c |   1 +
 src/backend/executor/nodeWindowAgg.c  |   2 +
 src/backend/nodes/copyfuncs.c |   7 +
 src/backend/nodes/outfuncs.c  |  29 
 src/backend/nodes/readfuncs.c |   7 +
 src/backend/optimizer/plan/createplan.c   |  54 ++-
 src/backend/optimizer/util/tlist.c|  25 
 src/backend/partitioning/partbounds.c |   4 +-
 src/backend/partitioning/partprune.c  |   3 +-
 src/backend/utils/adt/arrayfuncs.c|   2 +-
 src/backend/utils/adt/orderedsetaggs.c|   1 +
 src/backend/utils/adt/pg_locale.c |   1 +
 src/backend/utils/adt/varchar.c   |  14 ++
 src/backend/utils/adt/varlena.c   |  45 +-
 src/backend/utils/cache/catcache.c|   2 +-
 src/bin/initdb/initdb.c   |   4 +-
 src/include/catalog/pg_collation.h|   2 +
 src/include/executor/executor.h   |   3 +
 src/include/executor/hashjoin.h   |   1 +
 src/include/executor/nodeHash.h   |   2 +-
 src/include/nodes/execnodes.h |   3 +
 src/include/nodes/plannodes.h |   7 +
 src/include/optimizer/planmain.h  |   2 +-
 src/include/optimizer/tlist.h |   1 +
 src/include/partitioning/partbounds.h |   1 +
 src/include/utils/pg_locale.h |   1 +
 .../regress/expected/collate.icu.utf8.out | 138 +-
 src/test/regress/sql/collate.icu.utf8.sql |  39 +
 46 files changed, 516 insertions(+), 39 deletions(-)

diff --git a/contrib/bloom/bloom.h b/contrib/bloom/bloom.h
index 3973ac75e8..0b05f9796d 100644
---

Re: Fixing findDependentObjects()'s dependency on scan order (regressions in DROP diagnostic messages)

2018-12-18 Thread Alvaro Herrera

On 2018-Dec-18, Peter Geoghegan wrote:

> Well, you also have cases like this:
> 
> --- a/contrib/earthdistance/expected/earthdistance.out
> +++ b/contrib/earthdistance/expected/earthdistance.out
> @@ -972,7 +972,7 @@ SELECT abs(cube_distance(ll_to_earth(-30,-90),
> '(0)'::cube) / earth() - 1) <
> 
>  drop extension cube;  -- fail, earthdistance requires it
>  ERROR:  cannot drop extension cube because other objects depend on it
> -DETAIL:  extension earthdistance depends on extension cube
> +DETAIL:  extension earthdistance depends on function cube_out(cube)
> 
> This is a further example of "wrong, not just annoying". Technically
> this is a broader problem than DEPENDENCY_INTERNAL_AUTO, I think,
> though perhaps not too much broader.

Hmm, interesting.  I wonder if this is just a case of never testing this
code under "postgres --ignore-system-indexes".  I can reproduce the
reported problem without your patch by using that flag.  Here's a
recipe:

create extension cube;
create table dep as select ctid as tid,* from pg_depend;
create extension earthdistance;
select tid, deptype, (dep).type, (dep).identity, (ref).type, (ref).identity
  from (select tid, deptype, pg_identify_object(classid, objid, objsubid) as 
dep,
   pg_identify_object(refclassid, refobjid, refobjsubid) as ref
  from (select ctid as tid, * from pg_depend except select * from dep) a
   ) b;

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

WRITE_*_ARRAY macros for outfuncs.c

2018-12-18 Thread Peter Eisentraut

In readfuncs.c, we have READ_ATTRNUMBER_ARRAY, READ_OID_ARRAY,
READ_INT_ARRAY, READ_BOOL_ARRAY, but the writing side in outfuncs.c is
coded by hand in each case.  Any reason for this?

Here is a patch that adds WRITE_ATTRNUMBER_ARRAY, WRITE_OID_ARRAY,
WRITE_INT_ARRAY, WRITE_BOOL_ARRAY.  That seems much nicer.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 62239b0d488a00bcca1229332caad98960240761 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 13 Dec 2018 21:18:18 +0100
Subject: [PATCH] Add WRITE_*_ARRAY macros

---
 src/backend/nodes/outfuncs.c | 248 ++-
 1 file changed, 67 insertions(+), 181 deletions(-)

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 6edc7f2359..be6b4ca2f4 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -110,6 +110,34 @@ static void outChar(StringInfo str, char c);
(appendStringInfoString(str, " :" CppAsString(fldname) " "), \
 outBitmapset(str, node->fldname))
 
+#define WRITE_ATTRNUMBER_ARRAY(fldname, len) \
+   do { \
+   appendStringInfoString(str, " :" CppAsString(fldname) " "); \
+   for (int i = 0; i < len; i++) \
+   appendStringInfo(str, " %d", node->fldname[i]); \
+   } while(0)
+
+#define WRITE_OID_ARRAY(fldname, len) \
+   do { \
+   appendStringInfoString(str, " :" CppAsString(fldname) " "); \
+   for (int i = 0; i < len; i++) \
+   appendStringInfo(str, " %u", node->fldname[i]); \
+   } while(0)
+
+#define WRITE_INT_ARRAY(fldname, len) \
+   do { \
+   appendStringInfoString(str, " :" CppAsString(fldname) " "); \
+   for (int i = 0; i < len; i++) \
+   appendStringInfo(str, " %d", node->fldname[i]); \
+   } while(0)
+
+#define WRITE_BOOL_ARRAY(fldname, len) \
+   do { \
+   appendStringInfoString(str, " :" CppAsString(fldname) " "); \
+   for (int i = 0; i < len; i++) \
+   appendStringInfo(str, " %s", 
booltostr(node->fldname[i])); \
+   } while(0)
+
 
 #define booltostr(x)  ((x) ? "true" : "false")
 
@@ -411,55 +439,30 @@ _outAppend(StringInfo str, const Append *node)
 static void
 _outMergeAppend(StringInfo str, const MergeAppend *node)
 {
-   int i;
-
WRITE_NODE_TYPE("MERGEAPPEND");
 
_outPlanInfo(str, (const Plan *) node);
 
WRITE_NODE_FIELD(mergeplans);
-
WRITE_INT_FIELD(numCols);
-
-   appendStringInfoString(str, " :sortColIdx");
-   for (i = 0; i < node->numCols; i++)
-   appendStringInfo(str, " %d", node->sortColIdx[i]);
-
-   appendStringInfoString(str, " :sortOperators");
-   for (i = 0; i < node->numCols; i++)
-   appendStringInfo(str, " %u", node->sortOperators[i]);
-
-   appendStringInfoString(str, " :collations");
-   for (i = 0; i < node->numCols; i++)
-   appendStringInfo(str, " %u", node->collations[i]);
-
-   appendStringInfoString(str, " :nullsFirst");
-   for (i = 0; i < node->numCols; i++)
-   appendStringInfo(str, " %s", booltostr(node->nullsFirst[i]));
-
+   WRITE_ATTRNUMBER_ARRAY(sortColIdx, node->numCols);
+   WRITE_OID_ARRAY(sortOperators, node->numCols);
+   WRITE_OID_ARRAY(collations, node->numCols);
+   WRITE_BOOL_ARRAY(nullsFirst, node->numCols);
WRITE_NODE_FIELD(part_prune_info);
 }
 
 static void
 _outRecursiveUnion(StringInfo str, const RecursiveUnion *node)
 {
-   int i;
-
WRITE_NODE_TYPE("RECURSIVEUNION");
 
_outPlanInfo(str, (const Plan *) node);
 
WRITE_INT_FIELD(wtParam);
WRITE_INT_FIELD(numCols);
-
-   appendStringInfoString(str, " :dupColIdx");
-   for (i = 0; i < node->numCols; i++)
-   appendStringInfo(str, " %d", node->dupColIdx[i]);
-
-   appendStringInfoString(str, " :dupOperators");
-   for (i = 0; i < node->numCols; i++)
-   appendStringInfo(str, " %u", node->dupOperators[i]);
-
+   WRITE_ATTRNUMBER_ARRAY(dupColIdx, node->numCols);
+   WRITE_OID_ARRAY(dupOperators, node->numCols);
WRITE_LONG_FIELD(numGroups);
 }
 
@@ -501,8 +504,6 @@ _outGather(StringInfo str, const Gather *node)
 static void
 _outGatherMerge(StringInfo str, const GatherMerge *node)
 {
-   int i;
-
WRITE_NODE_TYPE("GATHERMERGE");
 
_outPlanInfo(str, (const Plan *) node);
@@ -510,23 +511,10 @@ _outGatherMerge(StringInfo str, const GatherMerge *node)
WRITE_INT_FIELD(num_workers);
WRITE_INT_FIELD(rescan_param);
WRITE_INT_FIELD(numCols);
-
-   appendStringInfoString(str, " :sortColIdx");
-   for (i = 0; i < node->numCols; i++)
-   appendStringInfo(str, " %d", node->sortColIdx[i]);
-
-

Re: [HACKERS] CLUSTER command progress monitor

2018-12-18 Thread Alvaro Herrera

On 2017-Nov-21, Peter Geoghegan wrote:

> On Mon, Oct 2, 2017 at 6:04 AM, Robert Haas  wrote:
> > Progress reporting on sorts seems like a tricky problem to me, as I
> > said before.  In most cases, a sort is going to involve an initial
> > stage where it reads all the input tuples and writes out quicksorted
> > runs, and then a merge phase where it merges all the output tapes into
> > a sorted result.  There are some complexities; for example, if the
> > number of tapes is really large, then we might need multiple merge
> > phases, only the last of which will produce tuples.
> 
> This would ordinarily be the point at which I'd say "but you're very
> unlikely to require multiple passes for an external sort these days".
> But I won't say that on this thread, because CLUSTER generally has
> unusually wide tuples, and so is much more likely to be I/O bound, to
> require multiple passes, etc. (I bet the v10 enhancements
> disproportionately improved CLUSTER performance.)

When the seqscan-and-sort strategy is used, we feed tuplesort with every
tuple from the scan.  Once that's completed, we call `performsort`, then
retrieve tuples.

If we see this in terms of tapes and merges, we can report the total
number of each of those that we have completed.  As far as I understand,
we write one tape to completion, and only then start another one, right?
Since there's no way to know how many tapes/merges are needed in total,
it's not possible to compute a percentage of completion.  That's seems
okay -- we're just telling the user that progress is being made, and we
only report facts not theory.  Perhaps we can (also?) indicate disk I/O
utilization, in terms of the number of blocks written by tuplesort.

I suppose that in order to have tuplesort.c report progress, we would
have to have some kind of API that tuplesort would invoke internally to
indicate events such as "tape started/completed", "merge started/completed".
One idea is to use a callback system; each tuplesort caller could
optionally pass a callback to the "begin" function, for progress
reporting purposes.  Initially only cluster.c would use it, but I
suppose eventually every tuplesort caller would want that.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

[PATCH v20] GSSAPI encryption support

2018-12-18 Thread Robbie Harwood

Hello friends,

Attached please find version 20 of the GSSAPI encryption support.  This
has been rebased onto master (thanks Stephen for calling out ab69ea9).
Other changes since v19 from Stephen's review:

- About 100 lines of new comments

- pgindent run over code (only the stuff I'm changing; it reports other
  problems on master, but if I understand correctly, that's not on me to
  fix here)

Thanks for the feedback!  And speaking of feedback, this patch is in the
"Needs Review" state so if people have additional feedback, that would
be appreciated.  I've CC'd people who I can remember reviewing before;
they should of course feel no obligation to review again if time
commitments/interests do not allow.

Thanks,
--Robbie


signature.asc
Description: PGP signature
>From 6915ae2507bf7910c5eecfbd0b84805531c16a07 Mon Sep 17 00:00:00 2001
From: Robbie Harwood 
Date: Thu, 10 May 2018 16:12:03 -0400
Subject: [PATCH] libpq GSSAPI encryption support

On both the frontend and backend, prepare for GSSAPI encryption
support by moving common code for error handling into a separate file.
Fix a TODO for handling multiple status messages in the process.
Eliminate the OIDs, which have not been needed for some time.

Add frontend and backend encryption support functions.  Keep the
context initiation for authentication-only separate on both the
frontend and backend in order to avoid concerns about changing the
requested flags to include encryption support.

In postmaster, pull GSSAPI authorization checking into a shared
function.  Also share the initiator name between the encryption and
non-encryption codepaths.

Modify pqsecure_write() to take a non-const pointer.  The pointer will
not be modified, but this change (or a const-discarding cast, or a
malloc()+memcpy()) is necessary for GSSAPI due to const/struct
interactions in C.

For HBA, add "hostgss" and "hostnogss" entries that behave similarly
to their SSL counterparts.  "hostgss" requires either "gss", "trust",
or "reject" for its authentication.

Simiarly, add a "gssmode" parameter to libpq.  Supported values are
"disable", "require", and "prefer".  Notably, negotiation will only be
attempted if credentials can be acquired.  Move credential acquisition
into its own function to support this behavior.

Finally, add documentation for everything new, and update existing
documentation on connection security.

Thanks to Michael Paquier for the Windows fixes.
---
 doc/src/sgml/client-auth.sgml   |  75 -
 doc/src/sgml/libpq.sgml |  57 +++-
 doc/src/sgml/runtime.sgml   |  77 -
 src/backend/libpq/Makefile  |   4 +
 src/backend/libpq/auth.c| 116 +++
 src/backend/libpq/be-gssapi-common.c|  74 +
 src/backend/libpq/be-gssapi-common.h|  26 ++
 src/backend/libpq/be-secure-gssapi.c| 374 ++
 src/backend/libpq/be-secure.c   |  16 +
 src/backend/libpq/hba.c |  51 ++-
 src/backend/postmaster/pgstat.c |   3 +
 src/backend/postmaster/postmaster.c |  69 -
 src/include/libpq/hba.h |   4 +-
 src/include/libpq/libpq-be.h|  13 +-
 src/include/libpq/libpq.h   |   3 +
 src/include/libpq/pqcomm.h  |   5 +-
 src/include/pgstat.h|   3 +-
 src/interfaces/libpq/Makefile   |   4 +
 src/interfaces/libpq/fe-auth.c  |  82 +
 src/interfaces/libpq/fe-connect.c   | 241 ++-
 src/interfaces/libpq/fe-gssapi-common.c | 130 
 src/interfaces/libpq/fe-gssapi-common.h |  23 ++
 src/interfaces/libpq/fe-secure-gssapi.c | 394 
 src/interfaces/libpq/fe-secure.c|  16 +-
 src/interfaces/libpq/libpq-fe.h |   3 +-
 src/interfaces/libpq/libpq-int.h|  30 +-
 src/tools/msvc/Mkvcbuild.pm |  10 +
 27 files changed, 1707 insertions(+), 196 deletions(-)
 create mode 100644 src/backend/libpq/be-gssapi-common.c
 create mode 100644 src/backend/libpq/be-gssapi-common.h
 create mode 100644 src/backend/libpq/be-secure-gssapi.c
 create mode 100644 src/interfaces/libpq/fe-gssapi-common.c
 create mode 100644 src/interfaces/libpq/fe-gssapi-common.h
 create mode 100644 src/interfaces/libpq/fe-secure-gssapi.c

diff --git a/doc/src/sgml/client-auth.sgml b/doc/src/sgml/client-auth.sgml
index c2114021c3..6f9f2b7560 100644
--- a/doc/src/sgml/client-auth.sgml
+++ b/doc/src/sgml/client-auth.sgml
@@ -108,6 +108,8 @@ hostnossl  database  user
 host   database  user  IP-address  IP-mask  auth-method  auth-options
 hostssldatabase  user  IP-address  IP-mask  auth-method  auth-options
 hostnossl  database  user  IP-address  IP-mask  auth-method  auth-options
+hostgssdatabase  user  IP-address  IP-mask  auth-method  auth-options
+hostnogss  database  user  IP-address  IP-mask  auth-method  auth-options
 
The meaning of the fields is as follows:
 
@@ -128,9 +130,10 @@ hostnossl  database  user

cfbot run pgindent?

2018-12-18 Thread Robbie Harwood

Hi friends,

First, I've found cfbot really useful as tool for improving code
correctness.  So thanks for that.

Second, since the project does have a defined style checker, do you
think it would be possible to run it as part of cfbot and report errors?

Thanks,
--Robbie


signature.asc
Description: PGP signature

Re: Fixing findDependentObjects()'s dependency on scan order (regressions in DROP diagnostic messages)

2018-12-18 Thread Peter Geoghegan

On Tue, Dec 18, 2018 at 10:26 AM Tom Lane  wrote:
> Yeah, I've been wondering about that as well.  The original intention
> for dependency traversal was that it'd work independently of the ordering
> of entries in pg_depend.  If it's not doing so, I'd call that a bug in
> dependency traversal rather than something the index code needs to be
> responsible for.

It's clearly not doing so. Is somebody else actually going to fix it,
though? I'm quite happy to stand aside to make way for a better
solution, but I don't consider myself particularly qualified to come
up with that better solution. The findDependentObjects() code is
pretty subtle.

> (Note that this statement doesn't disagree with our issues about needing
> to suppress dependency reports in the regression tests; that's because
> the order of reports about independent objects can legitimately depend on
> the index order.  But there shouldn't be any semantic differences.)

The advantage of my admittedly kludgy approach is that it will almost
completely eliminate any further need to suppress acceptable
regression test differences -- those non-semantic output differences
that we're already suppressing semi-regularly. In practice, adding a
trailing attribute to each of the two pg_depend indexes almost
entirely eliminates the need to play whack-a-mole. That has to be an
advantage for the kind of approach that I've suggested.

-- 
Peter Geoghegan

Re: ATTACH/DETACH PARTITION CONCURRENTLY

2018-12-18 Thread Robert Haas

On Mon, Dec 17, 2018 at 6:44 PM Michael Paquier  wrote:
> Agreed.  This patch has value, and somebody else could always take it
> from the point where you were.

OK.  I'll post what I have by the end of the week.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Fixing findDependentObjects()'s dependency on scan order (regressions in DROP diagnostic messages)

2018-12-18 Thread Peter Geoghegan

On Tue, Dec 18, 2018 at 10:07 AM Alvaro Herrera
 wrote:
> Is there any case of this that doesn't involve DEPENDENCY_INTERNAL_AUTO
> entries?  I wonder if I just haven't broken the algorithm when
> introducing that, and I worry that we're adding a complicated kludge to
> paper over that problem.  Maybe instead of the depcreate contortions we
> need to adjust the algorithm to deal with INTERNAL_AUTO objects in a
> different way.

Well, you also have cases like this:

--- a/contrib/earthdistance/expected/earthdistance.out
+++ b/contrib/earthdistance/expected/earthdistance.out
@@ -972,7 +972,7 @@ SELECT abs(cube_distance(ll_to_earth(-30,-90),
'(0)'::cube) / earth() - 1) <

 drop extension cube;  -- fail, earthdistance requires it
 ERROR:  cannot drop extension cube because other objects depend on it
-DETAIL:  extension earthdistance depends on extension cube
+DETAIL:  extension earthdistance depends on function cube_out(cube)

This is a further example of "wrong, not just annoying". Technically
this is a broader problem than DEPENDENCY_INTERNAL_AUTO, I think,
though perhaps not too much broader.
-- 
Peter Geoghegan

Re: Fixing findDependentObjects()'s dependency on scan order (regressions in DROP diagnostic messages)

2018-12-18 Thread Tom Lane

Alvaro Herrera  writes:
> On 2018-Nov-05, Peter Geoghegan wrote:
>> I've realized that my patch to make nbtree keys unique by treating
>> heap TID as a tie-breaker attribute must use ASC ordering, for reasons
>> that I won't go into here. Now that I'm not using DESC ordering, there
>> are changes to a small number of DROP...CASCADE messages that leave
>> users with something much less useful than what they'll see today --
>> see attached patch for full details. Some of these problematic cases
>> involve partitioning:

> Is there any case of this that doesn't involve DEPENDENCY_INTERNAL_AUTO
> entries?  I wonder if I just haven't broken the algorithm when
> introducing that, and I worry that we're adding a complicated kludge to
> paper over that problem.  Maybe instead of the depcreate contortions we
> need to adjust the algorithm to deal with INTERNAL_AUTO objects in a
> different way.

Yeah, I've been wondering about that as well.  The original intention
for dependency traversal was that it'd work independently of the ordering
of entries in pg_depend.  If it's not doing so, I'd call that a bug in
dependency traversal rather than something the index code needs to be
responsible for.

(Note that this statement doesn't disagree with our issues about needing
to suppress dependency reports in the regression tests; that's because
the order of reports about independent objects can legitimately depend on
the index order.  But there shouldn't be any semantic differences.)

regards, tom lane

Re: Fixing findDependentObjects()'s dependency on scan order (regressions in DROP diagnostic messages)

2018-12-18 Thread Alvaro Herrera

On 2018-Nov-05, Peter Geoghegan wrote:

> I've realized that my patch to make nbtree keys unique by treating
> heap TID as a tie-breaker attribute must use ASC ordering, for reasons
> that I won't go into here. Now that I'm not using DESC ordering, there
> are changes to a small number of DROP...CASCADE messages that leave
> users with something much less useful than what they'll see today --
> see attached patch for full details. Some of these problematic cases
> involve partitioning:

Is there any case of this that doesn't involve DEPENDENCY_INTERNAL_AUTO
entries?  I wonder if I just haven't broken the algorithm when
introducing that, and I worry that we're adding a complicated kludge to
paper over that problem.  Maybe instead of the depcreate contortions we
need to adjust the algorithm to deal with INTERNAL_AUTO objects in a
different way.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] CLUSTER command progress monitor

2018-12-18 Thread Alvaro Herrera

Hello Yamada-san,

On 2018-Dec-03, Tatsuro Yamada wrote:

> Thank you for managing the CF and Sorry for the late reply.
> I'll rebase it for the next CF and also I'll clear my head because the patch
> needs design change to address the feedbacks, I guess. Therefore, the status 
> is
> reconsidering the design of the patch. :)

Do you have a new version of this patch?  If not, do you think you'll
have something in time for the upcoming commitfest?

Thanks

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Should new partitions inherit their tablespace from their parent?

2018-12-18 Thread Tom Lane

Michael Paquier  writes:
> On Mon, Dec 17, 2018 at 10:13:42PM -0300, Alvaro Herrera wrote:
>> I think we need to accept the new output.  It is saying that previously
>> we would not invoke the hook on SET TABLESPACE for a partitioned table,
>> which makes sense, and now we do invoke it, which also makes sense.

> Indeed, that looks right.

Agreed, done.

regards, tom lane

Re: dropdb --force

2018-12-18 Thread Tom Lane

Marti Raudsepp  writes:
> I think Filip's approach of setting pg_database.datallowconn='false'
> is pretty clever to avoid the synchronization problem.

Some bull-in-a-china-shop has recently added logic that allows ignoring
datallowconn and connecting anyway, so I'm not sure that that'd provide
a production-quality solution.  On the other hand, maybe we could revert
BGWORKER_BYPASS_ALLOWCONN.

regards, tom lane

Re: psql exit status with multiple -c or -f

2018-12-18 Thread Justin Pryzby

On Tue, Dec 18, 2018 at 05:13:40PM +0900, Kyotaro HORIGUCHI wrote:
> $  psql postgres -v ON_ERROR_STOP=0 -f ~/work/y.txt ; echo $?
> $  psql postgres -v ON_ERROR_STOP=0 < ~/work/y.txt ; echo $?

> c) psql postgres -v ON_ERROR_STOP=0 -c foo -c 'select 1'; echo $?
> d) psql postgres -v ON_ERROR_STOP=0 -c 'select 1' -c foo; echo $?
> 
> (c) returns 0 and (d) returns 1, but both return 1 when
> ON_ERROR_STOP=1.

> The attached second patch lets (c) and (d) behave the same as (a)
> and (b).

I don't think the behavior in the single command case should be changed:

|[pryzbyj@database postgresql]$ ./src/bin/psql/psql ts -c FOO 2>/dev/null ; 
echo $? 
|1
|[pryzbyj@database postgresql]$ patch -p1 
/dev/null
|[pryzbyj@database postgresql]$ ./src/bin/psql/psql ts -c FOO 2>/dev/null ; 
echo $? 
|0

Also, unpatched v11 psql returns status of last command
|[pryzbyj@database postgresql]$ psql ts -xtc 'SELECT 1' -c FOO 2>/dev/null ; 
echo $? 
|?column? | 1
|
|1

patched psql returns 0:
|[pryzbyj@database postgresql]$ ./src/bin/psql/psql ts -xtc 'SELECT 1' -c FOO 
2>/dev/null ; echo $? 
|?column? | 1
|
|0

I'd prefer to see the exit status of the -f scripts (your cases a and b)
changed to 3 if the last command failed.  I realized that's not what the
documentation says so possibly not backpatchable.
|3 if an error occurred in a script and the variable ON_ERROR_STOP was set.

Think of:

$ cat /tmp/sql 
begin;
foo;
select 1;

$ psql ts -f /tmp/sql ; echo ret=$?
BEGIN
psql:/tmp/sql:2: ERROR:  syntax error at or near "foo"
LINE 1: foo;
^
psql:/tmp/sql:3: ERROR:  current transaction is aborted, commands ignored until 
end of transaction block
ret=0

I think ON_ERROR_STOP would control whether the script stops, but it should
fail the exit status should reflect any error in the last command.  The shell
does that even without set -e.

Justin

Re: dropdb --force

2018-12-18 Thread David Fetter

On Tue, Dec 18, 2018 at 01:25:32PM +0100, Filip Rembiałkowski wrote:
> Hi,
> 
> I propose a simple patch (works-for-me), which adds --force (-f)
> option to dropdb utility.

Nice!

I did something like this in user space back in 2010.

https://people.planetpostgresql.org/dfetter/index.php?/archives/63-Kick-em-out,-and-keep-em-out!.html

so there appears to be interest in such a feature.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-18 Thread Tom Lane

John Naylor  writes:
> On 12/17/18, Tom Lane  wrote:
>> Also, wouldn't we also adopt this technology for its unreserved keywords,
>> too?

> We wouldn't be forced to, but there might be other reasons to do so.
> Were you thinking of code consistency (within pl_scanner.c or
> globally)? Or something else?

> If we did adopt this setup for plpgsql unreserved keywords,
> ecpg/preproc/ecpg_keywords.c and ecpg/preproc/c_keywords.c would be
> left using the current ScanKeyword struct for search. Using offset
> search for all 5 types of keywords would be globally consistent, but
> it also means additional headers, generated headers, and makefile
> rules.

I'd be kind of inclined to convert all uses of ScanKeyword to the new way,
if only for consistency's sake.  On the other hand, I'm not the one
volunteering to do the work.

regards, tom lane

Re: dropdb --force

2018-12-18 Thread Marti Raudsepp

Hi

> út 18. 12. 2018 v 16:11 odesílatel Filip Rembiałkowski 
>  napsal:
>> Please share opinions if this makes sense at all, and has any chance
>> going upstream.

Clearly since Pavel has another implementation of the same concept,
there is some interest in this feature. :)

On Tue, Dec 18, 2018 at 5:20 PM Pavel Stehule  wrote:
> Still one my customer use a patch that implement FORCE on SQL level. It is 
> necessary under higher load when is not easy to synchronize clients.

I think Filip's approach of setting pg_database.datallowconn='false'
is pretty clever to avoid the synchronization problem. But it's also a
good idea to expose this functionality via DROP DATABASE in SQL, like
Pavel's patch, not just the 'dropdb' binary.

If this is to be accepted into PostgreSQL core, I think the two
approaches should be combined on the server side.

Regards,
Marti Raudsepp

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2018-12-18 Thread John Naylor

On 12/17/18, Tom Lane  wrote:
> John Naylor  writes:
>> Since PL/pgSQL uses the core scanner, we'd need to use offsets in its
>> reserved_keywords[], too. Those don't change much, so we can probably
>> get away with hard-coding the offsets and the giant string in that
>> case. (If that's not acceptable, we could separate that out to
>> pl_reserved_kwlist.h and reuse the above tooling to generate
>> pl_reserved_kwlist_{offset,string}.h, but that's more complex.)
>
> plpgsql isn't as stable as all that: people propose new syntax for it
> all the time.  I do not think a hand-maintained array would be pleasant
> at all.

Okay.

> Also, wouldn't we also adopt this technology for its unreserved keywords,
> too?

We wouldn't be forced to, but there might be other reasons to do so.
Were you thinking of code consistency (within pl_scanner.c or
globally)? Or something else?

If we did adopt this setup for plpgsql unreserved keywords,
ecpg/preproc/ecpg_keywords.c and ecpg/preproc/c_keywords.c would be
left using the current ScanKeyword struct for search. Using offset
search for all 5 types of keywords would be globally consistent, but
it also means additional headers, generated headers, and makefile
rules.

-John Naylor

Re: Some memory allocations in gin fastupdate code are a bit brain dead

2018-12-18 Thread Tom Lane

David Rowley  writes:
> I recently stumbled upon the following code in ginfast.c:
> while (collector->ntuples + nentries > collector->lentuples)
> {
> collector->lentuples *= 2;
> collector->tuples = (IndexTuple *) repalloc(collector->tuples,
>   sizeof(IndexTuple) * collector->lentuples);
> }

I agree that's pretty stupid, but I wonder whether we should also try
harder in the initial-allocation path.  Right now, if the initial
pass through this code sees say 3 tuples to insert, it will then work
with 3 * 2^n entries in subsequent enlargements, which is likely to
be pretty inefficient.  Should we try to force the initial allocation
to be a power of 2?

Also, do we need to worry about integer overflow?

regards, tom lane

Re: dropdb --force

2018-12-18 Thread Pavel Stehule

Hi

út 18. 12. 2018 v 16:11 odesílatel Filip Rembiałkowski <
filip.rembialkow...@gmail.com> napsal:

> Hi,
>
> I propose a simple patch (works-for-me), which adds --force (-f)
> option to dropdb utility.
>
> Pros:  This seems to be a desired option for many sysadmins, as this
> thread proves:
> https://dba.stackexchange.com/questions/11893/force-drop-db-while-others-may-be-connected
> Cons: another possible foot-gun for the unwary.
>
> Obviously this patch needs some more work (see TODO note inside).
>
> Please share opinions if this makes sense at all, and has any chance
> going upstream.
>
> The regression tests is simplistic, please help with an example of
> multi-session test so I can follow.
>

Still one my customer use a patch that implement FORCE on SQL level. It is
necessary under higher load when is not easy to synchronize clients.

Regards

Pavel

>
>
> Thanks,
> Filip
>
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 68c43620ef..008cac25cc 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -496,7 +496,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 		 * to wait 5 sec. We try to raise warning after 1 minute and and raise
 		 * a error after 5 minutes.
 		 */
-		if (!CountOtherDBBackends(src_dboid, , , true))
+		if (!CountOtherDBBackends(src_dboid, , , true, false))
 			break;
 
 		if (loops++ % 12 == 0)
@@ -798,7 +798,7 @@ createdb_failure_callback(int code, Datum arg)
  * DROP DATABASE
  */
 void
-dropdb(const char *dbname, bool missing_ok)
+dropdb(const char *dbname, bool missing_ok, bool force)
 {
 	Oid			db_id;
 	bool		db_istemplate;
@@ -806,6 +806,7 @@ dropdb(const char *dbname, bool missing_ok)
 	HeapTuple	tup;
 	int			notherbackends;
 	int			npreparedxacts;
+	int			loops = 0;
 	int			nslots,
 nslots_active;
 	int			nsubscriptions;
@@ -889,12 +890,33 @@ dropdb(const char *dbname, bool missing_ok)
 	 *
 	 * As in CREATE DATABASE, check this after other error conditions.
 	 */
-	if (CountOtherDBBackends(db_id, , , false))
-		ereport(ERROR,
-(errcode(ERRCODE_OBJECT_IN_USE),
- errmsg("database \"%s\" is being accessed by other users",
-		dbname),
- errdetail_busy_db(notherbackends, npreparedxacts)));
+	for (;;)
+	{
+		/*
+		 * CountOtherDBBackends check usage of database by other backends and try
+		 * to wait 5 sec. We try to raise warning after 1 minute and and raise
+		 * a error after 5 minutes.
+		 */
+		if (!CountOtherDBBackends(db_id, , , true, force))
+			break;
+
+		if (force && loops++ % 12 == 0)
+			ereport(WARNING,
+	(errcode(ERRCODE_OBJECT_IN_USE),
+errmsg("source database \"%s\" is being accessed by other users",
+	   dbname),
+	 errdetail_busy_db(notherbackends, npreparedxacts)));
+
+		/* without "force" flag raise exception immediately, or after 5 minutes */
+		if (!force || loops % 60 == 0)
+			ereport(ERROR,
+	(errcode(ERRCODE_OBJECT_IN_USE),
+errmsg("source database \"%s\" is being accessed by other users",
+	   dbname),
+	 errdetail_busy_db(notherbackends, npreparedxacts)));
+
+		CHECK_FOR_INTERRUPTS();
+	}
 
 	/*
 	 * Check if there are subscriptions defined in the target database.
@@ -1053,7 +1075,7 @@ RenameDatabase(const char *oldname, const char *newname)
 	 *
 	 * As in CREATE DATABASE, check this after other error conditions.
 	 */
-	if (CountOtherDBBackends(db_id, , , false))
+	if (CountOtherDBBackends(db_id, , , false, false))
 		ereport(ERROR,
 (errcode(ERRCODE_OBJECT_IN_USE),
  errmsg("database \"%s\" is being accessed by other users",
@@ -1183,7 +1205,7 @@ movedb(const char *dbname, const char *tblspcname)
 	 *
 	 * As in CREATE DATABASE, check this after other error conditions.
 	 */
-	if (CountOtherDBBackends(db_id, , , false))
+	if (CountOtherDBBackends(db_id, , , false, false))
 		ereport(ERROR,
 (errcode(ERRCODE_OBJECT_IN_USE),
  errmsg("database \"%s\" is being accessed by other users",
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 8c8384cd6b..e147e594ab 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -3755,6 +3755,7 @@ _copyDropdbStmt(const DropdbStmt *from)
 
 	COPY_STRING_FIELD(dbname);
 	COPY_SCALAR_FIELD(missing_ok);
+	COPY_SCALAR_FIELD(force);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 68d38fcba1..ba7bee4bb0 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1655,6 +1655,7 @@ _equalDropdbStmt(const DropdbStmt *a, const DropdbStmt *b)
 {
 	COMPARE_STRING_FIELD(dbname);
 	COMPARE_SCALAR_FIELD(missing_ok);
+	COMPARE_SCALAR_FIELD(force);
 
 	return true;
 }
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 7d0de99baf..358e878c70 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -9817,12 +9817,30 @@ DropdbStmt: DROP DATABASE database_name
 	n->dbname = $3;
 	n->missing_ok = FALSE;

RE: Shared Memory: How to use SYSV rather than MMAP ?

2018-12-18 Thread REIX, Tony

Hi Thomas,


Here is the patch we are using now on AIX for enabling SysV shm for AIX, which 
improves greatly the performance on AIX.

It is compile time.

It seems to me that you'd like this to become a shared_memory_type GUC. 
Correct? However, I do not know how to do.


Even as-is, this patch would greatly improve the performance of PostgreSQL 
v11.1 in the field on AIX machines. So, we'd like this change to be available 
for AIX asap.


What are the next steps to get this patch accepted? or What are your 
suggestions for improving it?


Thanks/Regards


Cordialement,

Tony Reix

tony.r...@atos.net

ATOS / Bull SAS
ATOS Expert
IBM Coop Architect & Technical Leader
Office : +33 (0) 4 76 29 72 67
1 rue de Provence - 38432 Échirolles - France
www.atos.net

De : REIX, Tony
Envoyé : lundi 26 novembre 2018 18:00:15
À : Thomas Munro
Cc : Andres Freund; Robert Haas; Pg Hackers; EMPEREUR-MOT, SYLVIE; BERGAMINI, 
DAMIEN
Objet : RE: Shared Memory: How to use SYSV rather than MMAP ?


Hi Thomas,


You said:


I think we should respect the huge_pages GUC, as we do on Linux and
Windows (since there are downsides to using large pages, maybe not
everyone would want that).  It could even be useful to allow different
page sizes to be requested by GUC (I see that DB2 has an option to use
16GB pages -- yikes).  It also seems like a good idea to have a
shared_memory_type GUC as Andres proposed (see his link), instead of
using a compile time option.  I guess it was made a compile time
option because nobody could imagine wanting to go back to SysV shm!
(I'm still kinda surprised that MAP_ANONYMOUS memory can't be coaxed
into large pages by environment variables or loader controls, since
apparently other things like data segments etc apparently can, though
I can't find any text that says that's the case and I have no AIX
system).


I guess that you are talking about CPP & C variables:
  #ifndef MAP_HUGETLB
  HUGE_PAGES_ON
  HUGE_PAGES_TRY)
in addition to : huge_pages =  in postgresql.conf file.

For now, these variables for Huge Pages are used only for MMAP.
About SysV shared memory, as far as I know, shmget() options for AIX and Linux 
are different.
Moreover, AIX also provides Large Pages (16MB).

About Andres proposal, I've read his link. However, the patch he proposed:
0001-Add-shared_memory_type-GUC.patch
is no more available (Attachment not found).

I confirm that I got the SysV Shared Memory by means of a "compile time option".

About "still kinda surprised that MAP_ANONYMOUS memory can't be coaxed
into large pages by environment variables or loader controls" I confirm that,
on AIX, only 4K pages are available for mmap().

I do agree that options in the postgresql.conf file would be the best solution,
since the code for SysV shared memory and MMAP shared memory seems always 
present.

Regards,

Tony
--- ./src/backend/port/sysv_shmem.c.ORIGIN	2018-11-23 11:05:31 +0100
+++ ./src/backend/port/sysv_shmem.c	2018-11-23 11:16:04 +0100
@@ -63,9 +63,14 @@
  * developer use, this shouldn't be a big problem.  Because of this, we do
  * not worry about supporting anonymous shmem in the EXEC_BACKEND cases below.
  */
+#if !defined(_AIX)
 #ifndef EXEC_BACKEND
 #define USE_ANONYMOUS_SHMEM
 #endif
+// On AIX, 64K pages can be used only with SysV shared memory.
+// Not defining USE_ANONYMOUS_SHMEM on AIX leads to SysV shared memory.
+#endif
+
 
 
 typedef key_t IpcMemoryKey;		/* shared memory key passed to shmget(2) */
@@ -125,7 +130,13 @@
 	}
 #endif
 
-	shmid = shmget(memKey, size, IPC_CREAT | IPC_EXCL | IPCProtection);
+	shmid = shmget(memKey, size, IPC_CREAT | IPC_EXCL | IPCProtection
+#if !defined(_AIX)
+		);
+#else
+	// On AIX, SHM_LGPAGE & SHM_PIN are required in order to be able to use Large Pages
+	  | SHM_LGPAGE | SHM_PIN | S_IRUSR | S_IWUSR);
+#endif
 
 	if (shmid < 0)
 	{
@@ -155,7 +166,13 @@
 		 */
 		if (shmget_errno == EINVAL)
 		{
-			shmid = shmget(memKey, 0, IPC_CREAT | IPC_EXCL | IPCProtection);
+			shmid = shmget(memKey, 0, IPC_CREAT | IPC_EXCL | IPCProtection
+#if !defined(_AIX)
+			);
+#else
+	// On AIX, SHM_LGPAGE & SHM_PIN are required in order to be able to use Large Pages
+			| SHM_LGPAGE | SHM_PIN | S_IRUSR | S_IWUSR);
+#endif
 
 			if (shmid < 0)
 			{
--- ./src/include/storage/dsm_impl.h.ORIGIN	2018-11-23 11:33:45 +0100
+++ ./src/include/storage/dsm_impl.h	2018-11-23 11:34:40 +0100
@@ -30,7 +30,12 @@
 #else
 #ifdef HAVE_SHM_OPEN
 #define USE_DSM_POSIX
+#if !defined(_AIX)
 #define DEFAULT_DYNAMIC_SHARED_MEMORY_TYPE		DSM_IMPL_POSIX
+#else
+// On AIX, 64K pages can be used only with SysV shared memory
+#define DEFAULT_DYNAMIC_SHARED_MEMORY_TYPE		DSM_IMPL_SYSV
+#endif // AIX
 #endif
 #define USE_DSM_SYSV

dropdb --force

2018-12-18 Thread Filip Rembiałkowski

Hi,

I propose a simple patch (works-for-me), which adds --force (-f)
option to dropdb utility.

Pros:  This seems to be a desired option for many sysadmins, as this
thread proves: 
https://dba.stackexchange.com/questions/11893/force-drop-db-while-others-may-be-connected
Cons: another possible foot-gun for the unwary.

Obviously this patch needs some more work (see TODO note inside).

Please share opinions if this makes sense at all, and has any chance
going upstream.

The regression tests is simplistic, please help with an example of
multi-session test so I can follow.


Thanks,
Filip
diff --git a/doc/src/sgml/ref/drop_database.sgml b/doc/src/sgml/ref/drop_database.sgml
index 3ac06c984a..797763c28f 100644
--- a/doc/src/sgml/ref/drop_database.sgml
+++ b/doc/src/sgml/ref/drop_database.sgml
@@ -79,7 +79,8 @@ DROP DATABASE [ IF EXISTS ] name
This command cannot be executed while connected to the target
database. Thus, it might be more convenient to use the program
 instead,
-   which is a wrapper around this command.
+   which is a wrapper around this command, and is also able
+   to terminate existing connections.
   
  
 
diff --git a/doc/src/sgml/ref/dropdb.sgml b/doc/src/sgml/ref/dropdb.sgml
index 38f38f01ce..29fb098e19 100644
--- a/doc/src/sgml/ref/dropdb.sgml
+++ b/doc/src/sgml/ref/dropdb.sgml
@@ -42,8 +42,9 @@ PostgreSQL documentation
   
dropdb is a wrapper around the
SQL command .
-   There is no effective difference between dropping databases via
-   this utility and via other methods for accessing the server.
+   Other than the --force option, there is no effective
+   difference between dropping databases via this utility and via other
+   methods for accessing the server.
   
 
  
@@ -86,6 +87,22 @@ PostgreSQL documentation
   
  
 
+ 
+  -f
+  --force
+  
+   
+   Force termination of connected backends before removing the database.
+   
+	   
+	   This will update the datallowconn attribute to
+	   disallow incoming connections to target database, send
+	   SIGTERM to target backends, and sleep some time
+	   to allow the sessions to terminate.
+	   
+  
+ 
+
  
-V
--version
diff --git a/src/bin/scripts/dropdb.c b/src/bin/scripts/dropdb.c
index ba0038891d..4bfd8a9397 100644
--- a/src/bin/scripts/dropdb.c
+++ b/src/bin/scripts/dropdb.c
@@ -15,6 +15,10 @@
 #include "fe_utils/string_utils.h"
 
 
+/* Time to sleep after isuing SIGTERM to backends */
+#define TERMINATE_SLEEP_TIME 1
+
+
 static void help(const char *progname);
 
 
@@ -33,6 +37,7 @@ main(int argc, char *argv[])
 		{"interactive", no_argument, NULL, 'i'},
 		{"if-exists", no_argument, _exists, 1},
 		{"maintenance-db", required_argument, NULL, 2},
+		{"force", no_argument, NULL, 'f'},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -48,6 +53,7 @@ main(int argc, char *argv[])
 	enum trivalue prompt_password = TRI_DEFAULT;
 	bool		echo = false;
 	bool		interactive = false;
+	bool		force = false;
 
 	PQExpBufferData sql;
 
@@ -59,7 +65,7 @@ main(int argc, char *argv[])
 
 	handle_help_version_opts(argc, argv, "dropdb", help);
 
-	while ((c = getopt_long(argc, argv, "h:p:U:wWei", long_options, )) != -1)
+	while ((c = getopt_long(argc, argv, "h:p:U:wWeif", long_options, )) != -1)
 	{
 		switch (c)
 		{
@@ -84,6 +90,9 @@ main(int argc, char *argv[])
 			case 'i':
 interactive = true;
 break;
+			case 'f':
+force = true;
+break;
 			case 0:
 /* this covers the long options */
 break;
@@ -121,9 +130,6 @@ main(int argc, char *argv[])
 
 	initPQExpBuffer();
 
-	appendPQExpBuffer(, "DROP DATABASE %s%s;",
-	  (if_exists ? "IF EXISTS " : ""), fmtId(dbname));
-
 	/* Avoid trying to drop postgres db while we are connected to it. */
 	if (maintenance_db == NULL && strcmp(dbname, "postgres") == 0)
 		maintenance_db = "template1";
@@ -132,6 +138,64 @@ main(int argc, char *argv[])
 	  host, port, username, prompt_password,
 	  progname, echo);
 
+	if (force)
+	{
+		/* TODO: revert this UPDATE in case removal fails for any reason */
+		appendPQExpBufferStr(,
+			 "UPDATE pg_catalog.pg_database\n"
+			 " SET datallowconn = 'false'\n"
+			 " WHERE datname OPERATOR(pg_catalog.=) ");
+		appendStringLiteralConn(, dbname, conn);
+		appendPQExpBufferStr(, ";\n");
+		if (echo)
+			printf("%s\n", sql.data);
+		result = PQexec(conn, sql.data);
+		if (PQresultStatus(result) != PGRES_COMMAND_OK)
+		{
+			fprintf(stderr, _("%s: database removal failed: %s"),
+	progname, PQerrorMessage(conn));
+			PQfinish(conn);
+			exit(1);
+		}
+
+		PQclear(result);
+
+		resetPQExpBuffer();
+
+		appendPQExpBufferStr(,
+			"SELECT pg_catalog.pg_terminate_backend(pid)\n"
+			" FROM pg_catalog.pg_stat_activity\n"
+			" WHERE pid OPERATOR(pg_catalog.<>) pg_catalog.pg_backend_pid()\n"
+			" AND datname OPERATOR(pg_catalog.=) ");
+		appendStringLiteralConn(, dbname, conn);
+		appendPQExpBufferStr(, ";\n");
+
+		if (echo)
+			printf("%s\n",

Re: explain plans with information about (modified) gucs

2018-12-18 Thread Tomas Vondra

On 12/18/18 12:43 AM, Andres Freund wrote:
> Hi,
> 
> On 2018-12-18 00:38:16 +0100, Tomas Vondra wrote:
>> On 12/17/18 11:16 PM, Tom Lane wrote:
>>> Tomas Vondra  writes:
 Yeah, I've been thinking about that too. Currently it gets filtered out
 because it's in the CLIENT_CONN_STATEMENT group, but the code only
 includes QUERY_TUNING_*.
 But we don't want to include everything from CLIENT_CONN_STATEMENT,
 because that would include various kinds of timeouts, vacuuming
 parameters, etc.
 And the same issue applies to work_mem, which is in RESOURCES_MEM. And
 of course, there's a lot of unrelated stuff in RESOURCES_MEM.
 So I guess we'll need to enable QUERY_TUNING_* and then selectively a
 couple of individual options from other groups.
>>>
>>> This is putting way too much policy into the mechanism, if you ask me.
>>>
>>
>> Doesn't that depend on how it'd be implemented? I have not envisioned to
>> make these decisions in explain.c, but rather to keep them in guc.c
>> somehow. Say in a function like this:
>>
>> bool guc_affects_query_planning(config_generic *conf);
>>
>> which would be a fairly simple check outlined before (QUERY_TUNING_*
>> plus a couple of individual GUCs). Other use cases might provide similar
>> filters.
> 
> If we were to do that, I'd suggest implementing a GUC_* flag specified
> in the definition of the GUC, rather than a separate function containing
> all the knowledge (but such a function could obviously still be used to
> return such a fact).
> 

Seems reasonable.

> 
>> An alternative would be to somehow track this in the GUC definitions
>> directly (similarly to how we track the group), but that seems rather
>> inflexible and I'm not sure how would that handle ad-hoc use cases.
> 
> Not sure what problem you see here?
> 

My main concern with that was how many flags could we need, if we use
this as the way to implement this and similar use cases. There are 32
bits available, and 24 of them are already used for GUC_* flags. So if
we use this as an "official" way to support similar use cases, we could
run out of space pretty fast.

The callback would also allow ad hoc filtering, which is not very
practical from extensions (e.g. you can't define a new flag, because it
might conflict with something defined in another extension).

But I think that's nonsense - so far we have not seen any such use case,
so it's pretty pointless to design for it. If that changes and some new
use case is proposed in the future, we can rethink this based on it.

I'll go with a new flag, marking all GUCs related to query planning, and
post a new patch soon.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2018-12-18 Thread Alexey Kondratov


On 18.12.2018 1:28, Tomas Vondra wrote:

4) There was a problem with marking top-level transaction as having
catalog changes if one of its subtransactions has. It was causing a
problem with DDL statements just after subtransaction start (savepoint),
so data from new columns is not replicated.

5) Similar issue with schema send. You send schema only once per each
sub/transaction (IIRC), while we have to update schema on each catalog
change: invalidation execution, snapshot rebuild, adding new tuple cids.
So I ended up with adding is_schema_send flag to ReorderBufferTXN, since
it is easy to set it inside RB and read in the output plugin. Probably,
we have to choose a better place for this flag.


Hmm. Can you share an example how to trigger these issues?


Test cases inside 014_stream_tough_ddl.pl and old ones (with 
streaming=true option added) should reproduce all these issues. In 
general, it happens in a txn like:


INSERT
SAVEPOINT
ALTER TABLE ... ADD COLUMN
INSERT

then the second insert may discover old version of catalog.


Interesting. Any idea where does the extra overhead in this particular
case come from? It's hard to deduce that from the single flame graph,
when I don't have anything to compare it with (i.e. the flame graph for
the "normal" case).


I guess that bottleneck is in disk operations. You can check 
logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and 
writes (~26%) take around 35% of CPU time in summary. To compare, 
please, see attached flame graph for the following transaction:


INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 2000)) FROM generate_series(1, 100);

Execution Time: 44519.816 ms
Time: 98333,642 ms (01:38,334)

where disk IO is only ~7-8% in total. So we get very roughly the same 
~x4-5 performance drop here. JFYI, I am using a machine with SSD for tests.


Therefore, probably you may write changes on receiver in bigger chunks, 
not each change separately.



So I'm not particularly worried, but I'll look into that. I'd be much
more worried if there was measurable overhead in cases when there's no
streaming happening (either because it's disabled or the memory limit
was not hit).


What I have also just found, is that if a table row is large enough to 
be TOASTed, e.g.:


INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 100)) FROM generate_series(1, 1000);

then logical_work_mem limit is not hit and we neither stream, nor spill 
to disk this transaction, while it is still large. In contrast, the 
transaction above (with 100 smaller rows) being comparable in size 
is streamed. Not sure, that it is easy to add proper accounting of 
TOAST-able columns, but it worth it.


--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

<>

Some memory allocations in gin fastupdate code are a bit brain dead

2018-12-18 Thread David Rowley

I recently stumbled upon the following code in ginfast.c:

while (collector->ntuples + nentries > collector->lentuples)
{
collector->lentuples *= 2;
collector->tuples = (IndexTuple *) repalloc(collector->tuples,
  sizeof(IndexTuple) * collector->lentuples);
}

it does not seem very smart to perform the repalloc() inside the loop
here as there could be many loops before we double the lentuples so
that it's large enough to allow storage of all the required values.

The attached patch changes things around so that the repalloc() is
done outside of the loop. i.e. done only once, after we've determined
the correct size to reallocate it to. I've also added an else
condition so that we only bother checking this case when the tuples[]
array is not already allocated.

I tested with the following:

create table t1 (a int[], b int[]);
create index on t1 using gin (a,b) with (fastupdate = on);
truncate t1; insert into t1 select '{1}'::int[],('{' ||
string_agg(y::text, ',') || '}')::int[] from
generate_Series(1,100) x, generate_Series(1,10) y group by x order
by x desc;

In the above case with an unpatched master, I measured the repalloc()
to only consume 0.6% of the total runtime of the INSERT, so this does
not really improve performance by much, but I thought it was worth
fixing never-the-less.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


fix_gin_fast_insert.patch
Description: Binary data

Re: psql exit status with multiple -c or -f

2018-12-18 Thread Pavel Stehule

út 18. 12. 2018 v 9:14 odesílatel Kyotaro HORIGUCHI <
horiguchi.kyot...@lab.ntt.co.jp> napsal:

> Hello.
>
> At Mon, 17 Dec 2018 11:58:41 -0600, Justin Pryzby 
> wrote in <20181217175841.gs13...@telsasoft.com>
> > Our deployment script failed to notice dozens of commands failed in a
> > transaction block and I only noticed due to keeping full logs and
> monitoring
> > for error_severity>'LOG'.  I would have thought that exit status would be
> > nonzero had an error occured in an earlier script.
> >
> > The docs since PG9.6 say:
> > https://www.postgresql.org/docs/9.6/app-psql.html
> > |Exit Status
> > |
> > |psql returns 0 to the shell if it finished normally, 1 if a fatal error
> of its
> > |own occurs (e.g. out of memory, file not found), 2 if the connection to
> the
> > |server went bad and the session was not interactive, and 3 if an error
> occurred
> > |in a script and the variable ON_ERROR_STOP was set.
> >
> > d5563d7df94488bf0ab52ac0678e8a07e5b8297e
> > psql: Support multiple -c and -f options, and allow mixing them.
> >
> > If there's an error, it returns 1 (although that's not "a fatal error of
> its
> > own").
> >
> > |[pryzbyj@database ~]$ psql ts -c foo 2>/dev/null ; echo $?
> > |1
> >
> > But the error is "lost" if another script or -c runs without failing:
> >
> > |[pryzbyj@database ~]$ psql ts -txqc foo -c SELECT 2>/dev/null ; echo $?
> > |0
>
> As written in the documentation[1]:
>
> > Because of this behavior, putting more than one SQL command in a
> > single -c string often has unexpected results. It's better to use
> > repeated -c commands or feed multiple commands to psql's standard
> > input,
>
> This seems saying that -c is equvalent to each line fed from
> stdin or a line in a script.
>
> The attached 0001 patch makes it clear in app-psql.html.
>
>
> > Note, this works as expected:
> >
> > |[pryzbyj@database ~]$ psql ts -v ON_ERROR_STOP=1 -txqc foo -f
> /dev/null 2>/dev/null ; echo $?
> > |1
>
> The status should be 3, according to the documentation. Addition
> to that, the current behavior looks inconsistent (if) considering
> the equivalency between -c and a script line.
>
> y.txt:
> a)
>   foo;
>   select 1;
> b)
>   select 1;
>   foo;
>
> $  psql postgres -v ON_ERROR_STOP=0 -f ~/work/y.txt ; echo $?
> $  psql postgres -v ON_ERROR_STOP=0 < ~/work/y.txt ; echo $?
>
> All combinations return 0, and 3 when ON_ERROR_STOP=1.
>
> But,
>
> c) psql postgres -v ON_ERROR_STOP=0 -c foo -c 'select 1'; echo $?
> d) psql postgres -v ON_ERROR_STOP=0 -c 'select 1' -c foo; echo $?
>
> (c) returns 0 and (d) returns 1, but both return 1 when
> ON_ERROR_STOP=1.
>
> The attached second patch lets (c) and (d) behave the same as (a)
> and (b).
>
> Does it make sense?
>

+1

Pavel


> regards.
>
> [1]: https://www.postgresql.org/docs/11/app-psql.html
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>

Re: Catalog views failed to show partitioned table information.

2018-12-18 Thread Suraj Kharage

Thank you for review and commit.

On Tue, Dec 18, 2018 at 1:12 PM Michael Paquier  wrote:

> On Mon, Dec 17, 2018 at 11:01:59AM +0900, Michael Paquier wrote:
> > On Mon, Dec 17, 2018 at 10:22:28AM +0900, Amit Langote wrote:
> >> On 2018/12/15 8:00, Michael Paquier wrote:
> >>> I tend to agree with your comment here.  pg_tables lists partitioned
> >>> tables, but pg_indexes is forgotting about partitioned indexes.  So
> this
> >>> is a good thing to add.
> >>
> >> +1
> >
> > I'll go commit something close to what Suraj is proposing if there are
> > no objections from others.  At least we agree on that part ;)
>
> And this part is done.
> --
> Michael
>

-- 
--

Thanks & Regards,
Suraj kharage,
EnterpriseDB Corporation,
The Postgres Database Company.

*Are you updated: Latest version of EnterpriseDB Postgres Advanced Server
are 10.6.13, 9.6.11.18, 9.5.15.21, 9.4.19.28*

To reach Support Call:
US +1-732-331-1320 or 1-800-235-5891
UK +44-2033 7198 20 - BRAZIL+55-2129 5813 71 -
INDIA+91-20-66449612 Australia: +61 26145 2339.

Website: www.enterprisedb.com
EnterpriseDB Blog : http://blogs.enterprisedb.com
Follow us on Twitter : http://www.twitter.com/enterprisedb

PRIVACY & CONFIDENTIALITY NOTICE

This e-mail message (and any attachment) is intended for the use of the
individual or entity to whom it is addressed. This message contains
information from EnterpriseDB Corporation that may be privileged,
confidential, or exempt from disclosure under applicable law. If you are
not the intended recipient or authorized to receive this for the intended
recipient, any use, dissemination, distribution,retention, archiving, or
copying of this communication is strictly prohibited. If you have received
this e-mail in error, please notify the sender immediately by reply e-mail
and delete this message.

"repliation" as database name

2018-12-18 Thread Kyotaro HORIGUCHI

Hello.

We can create a database named "replication".

$ createdb replication

A pg_hba.conf entry with DATABASE="all" is described as 'does not
match "replication"' in the comment there, but actually it
matches and we can connect to the database
"replication". (Documentation doesn't mention the restriction)

$ psql replication -At -c 'select current_database()'
replication

We can specify the name replication by quoting and it does not
match a replication connection. It is not documented at all.

pg_hba.conf
> local "replication" all trust
> #local replication all trust  ## commented out

> FATAL:  could not connect to the primary server: FATAL:  no pg_hba.conf entry 
> for replication connection from host "[local]", user "horiguti", SSL off

> $ psql replication -At -c 'select current_database()'
> replication

The same can be said to sameuser, samerole and even all. I think
this is absolutely sane behavior and worth documentation in any
extent if it doesn't become complex.

I think that at least the following amendments would be needed.

- Remove ""all" does not match "replication"". Instead "The "all"
  keyword does not match replication connections."

- double-quoted database name is taken literally.

Is it worth doing?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] logical decoding of two-phase transactions

2018-12-18 Thread Arseny Sher



Tomas Vondra  writes:

> Hi Nikhil,
>
> Thanks for the updated patch - I've started working on a review, with
> the hope of getting it committed sometime in 2019-01. But the patch
> bit-rotted again a bit (probably due to d3c09b9b), which broke the last
> part. Can you post a fixed version?

Please also note that at some time the thread was torn and continued in
another place:
https://www.postgresql.org/message-id/flat/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3%2BQ%40mail.gmail.com

And now we have two branches =(

I hadn't checked whether my concerns where addressed in the latest
version though.


--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: psql exit status with multiple -c or -f

2018-12-18 Thread Kyotaro HORIGUCHI

Hello.

At Mon, 17 Dec 2018 11:58:41 -0600, Justin Pryzby  wrote 
in <20181217175841.gs13...@telsasoft.com>
> Our deployment script failed to notice dozens of commands failed in a
> transaction block and I only noticed due to keeping full logs and monitoring
> for error_severity>'LOG'.  I would have thought that exit status would be
> nonzero had an error occured in an earlier script.
> 
> The docs since PG9.6 say:
> https://www.postgresql.org/docs/9.6/app-psql.html
> |Exit Status
> |
> |psql returns 0 to the shell if it finished normally, 1 if a fatal error of 
> its
> |own occurs (e.g. out of memory, file not found), 2 if the connection to the
> |server went bad and the session was not interactive, and 3 if an error 
> occurred
> |in a script and the variable ON_ERROR_STOP was set.
> 
> d5563d7df94488bf0ab52ac0678e8a07e5b8297e
> psql: Support multiple -c and -f options, and allow mixing them.
> 
> If there's an error, it returns 1 (although that's not "a fatal error of its
> own").
> 
> |[pryzbyj@database ~]$ psql ts -c foo 2>/dev/null ; echo $?
> |1
> 
> But the error is "lost" if another script or -c runs without failing:
> 
> |[pryzbyj@database ~]$ psql ts -txqc foo -c SELECT 2>/dev/null ; echo $?
> |0

As written in the documentation[1]:

> Because of this behavior, putting more than one SQL command in a
> single -c string often has unexpected results. It's better to use
> repeated -c commands or feed multiple commands to psql's standard
> input,

This seems saying that -c is equvalent to each line fed from
stdin or a line in a script.

The attached 0001 patch makes it clear in app-psql.html.


> Note, this works as expected:
> 
> |[pryzbyj@database ~]$ psql ts -v ON_ERROR_STOP=1 -txqc foo -f /dev/null 
> 2>/dev/null ; echo $?
> |1

The status should be 3, according to the documentation. Addition
to that, the current behavior looks inconsistent (if) considering
the equivalency between -c and a script line.

y.txt:
a)
  foo;
  select 1;
b)
  select 1;
  foo;

$  psql postgres -v ON_ERROR_STOP=0 -f ~/work/y.txt ; echo $?
$  psql postgres -v ON_ERROR_STOP=0 < ~/work/y.txt ; echo $?

All combinations return 0, and 3 when ON_ERROR_STOP=1.

But, 

c) psql postgres -v ON_ERROR_STOP=0 -c foo -c 'select 1'; echo $?
d) psql postgres -v ON_ERROR_STOP=0 -c 'select 1' -c foo; echo $?

(c) returns 0 and (d) returns 1, but both return 1 when
ON_ERROR_STOP=1.

The attached second patch lets (c) and (d) behave the same as (a)
and (b).

Does it make sense?

regards.

[1]: https://www.postgresql.org/docs/11/app-psql.html

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 227c6014f3c50445ca1c46c4293085e6e635e91f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Tue, 18 Dec 2018 16:44:28 +0900
Subject: [PATCH 1/2] Add description on a behavior of psql

psql's '-c command' is equivalent to a line fed from stdin or a line
written in a script. It is suggested in the documentation but not
explicitly described. Add a such description.
---
 doc/src/sgml/ref/psql-ref.sgml | 4 
 1 file changed, 4 insertions(+)

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 6c76cf2f00..4e94e8b6cf 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -147,6 +147,10 @@ psql EOF
 SELECT * FROM foo;
 EOF
 
+
+		Every -c command is equivalent to a line in a
+		script. All the commands are executed ignoring command errors and psql
+		returns the exit status 0 unless ON_ERROR_STOP is set.
   
   
 
-- 
2.16.3

>From 449ee0b1003fed38ebd29fec6a19dec1cf12cf5e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Tue, 18 Dec 2018 16:50:11 +0900
Subject: [PATCH 2/2] Unify psql's behavior on -c with scripts

Each -c command is equivalent to a line in a script, but they behaves
differently on failure. Fix it.
---
 src/bin/psql/startup.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/src/bin/psql/startup.c b/src/bin/psql/startup.c
index e7536a8a06..fc50301fdc 100644
--- a/src/bin/psql/startup.c
+++ b/src/bin/psql/startup.c
@@ -347,7 +347,7 @@ main(int argc, char *argv[])
 	puts(cell->val);
 
 successResult = SendQuery(cell->val)
-	? EXIT_SUCCESS : EXIT_FAILURE;
+	? EXIT_SUCCESS : EXIT_USER;
 			}
 			else if (cell->action == ACT_SINGLE_SLASH)
 			{
@@ -368,7 +368,7 @@ main(int argc, char *argv[])
 cond_stack,
 NULL,
 NULL) != PSQL_CMD_ERROR
-	? EXIT_SUCCESS : EXIT_FAILURE;
+	? EXIT_SUCCESS : EXIT_USER;
 
 psql_scan_destroy(scan_state);
 conditional_stack_destroy(cond_stack);
@@ -383,7 +383,11 @@ main(int argc, char *argv[])
 Assert(false);
 			}
 
-			if (successResult != EXIT_SUCCESS && pset.on_error_stop)
+			/* EXIT_USER shuld be forgotten if ON_ERROR_STOP is not set */
+			if (successResult == EXIT_USER && !pset.on_error_stop)
+successResult = EXIT_SUCCESS;
+
+			if (successResult != EXIT_SUCCESS)
 break;
 		}
 
-- 
2.16.3

Re: [HACKERS] Block level parallel vacuum

2018-12-18 Thread Masahiko Sawada

On Tue, Nov 27, 2018 at 11:26 AM Amit Kapila  wrote:
>
> On Mon, Nov 26, 2018 at 2:08 PM Masahiko Sawada  wrote:
> >
> > On Sun, Nov 25, 2018 at 2:35 PM Amit Kapila  wrote:
> > >
> > > On Sat, Nov 24, 2018 at 5:47 PM Amit Kapila  
> > > wrote:
> > > > On Tue, Oct 30, 2018 at 2:04 PM Masahiko Sawada  
> > > > wrote:
> > > > >
> > > >
> >
> > Thank you for the comment.
> >
> > > > I could see that you have put a lot of effort on this patch and still
> > > > we are not able to make much progress mainly I guess because of
> > > > relation extension lock problem.  I think we can park that problem for
> > > > some time (as already we have invested quite some time on it), discuss
> > > > a bit about actual parallel vacuum patch and then come back to it.
> > > >
> > >
> > > Today, I was reading this and previous related thread [1] and it seems
> > > to me multiple people Andres [2], Simon [3] have pointed out that
> > > parallelization for index portion is more valuable.  Also, some of the
> > > results [4] indicate the same.  Now, when there are no indexes,
> > > parallelizing heap scans also have benefit, but I think in practice we
> > > will see more cases where the user wants to vacuum tables with
> > > indexes.  So how about if we break this problem in the following way
> > > where each piece give the benefit of its own:
> > > (a) Parallelize index scans wherein the workers will be launched only
> > > to vacuum indexes.  Only one worker per index will be spawned.
> > > (b) Parallelize per-index vacuum.  Each index can be vacuumed by
> > > multiple workers.
> > > (c) Parallelize heap scans where multiple workers will scan the heap,
> > > collect dead TIDs and then launch multiple workers for indexes.
> > >
> > > I think if we break this problem into multiple patches, it will reduce
> > > the scope of each patch and help us in making progress.   Now, it's
> > > been more than 2 years that we are trying to solve this problem, but
> > > still didn't make much progress.  I understand there are various
> > > genuine reasons and all of that work will help us in solving all the
> > > problems in this area.  How about if we first target problem (a) and
> > > once we are done with that we can see which of (b) or (c) we want to
> > > do first?
> >
> > Thank you for suggestion. It seems good to me. We would get a nice
> > performance scalability even by only (a), and vacuum will get more
> > powerful by (b) or (c). Also, (a) would not require to resovle the
> > relation extension lock issue IIUC.
> >
>
> Yes, I also think so.  We do acquire 'relation extension lock' during
> index vacuum, but as part of (a), we are talking one worker per-index,
> so there shouldn't be a problem with respect to deadlocks.
>
> > I'll change the patch and submit
> > to the next CF.
> >
>
> Okay.
>

Attached the updated patches. I scaled back the scope of this patch.
The patch now includes only feature (a), that is it execute both index
vacuum and cleanup index in parallel. It also doesn't include
autovacuum support for now.

The PARALLEL option works alomst same as before patch. In VACUUM
command, we can specify 'PARALLEL n' option where n is the number of
parallel workers to request. If the n is omitted the number of
parallel worekrs would be # of indexes -1. Also we can specify
parallel degree by parallel_worker reloption. The number or parallel
workers is capped by Min(# of indexes - 1,
max_maintenance_parallel_workers). That is, parallel vacuum can be
executed for a table if it has more than one indexes.

For internal design, the details are written at the top of comment in
vacuumlazy.c file. In parallel vacuum mode, we allocate DSM at the
beginning of lazy vacuum which stores shared information as well as
dead tuples. When starting either index vacuum or cleanup vacuum we
launch parallel workers. The parallel workers perform either index
vacuum or clenaup vacuum for each indexes, and then exit after done
all indexes. Then the leader process re-initialize DSM and re-launch
at the next time, not destroy parallel context here. After done lazy
vacuum, the leader process exits the parallel mode and updates index
statistics since we are not allowed any writes during parallel mode.

Also I've attached 0002 patch to support parallel lazy vacuum for
vacuumdb command.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
From 33a9a44fa4f090d7dd6dd319edcb1cb754064de8 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada 
Date: Tue, 18 Dec 2018 16:08:24 +0900
Subject: [PATCH v9 2/2] Add -P option to vacuumdb command.

---
 doc/src/sgml/ref/vacuumdb.sgml| 16 +
 src/bin/scripts/t/100_vacuumdb.pl | 10 +++-
 src/bin/scripts/vacuumdb.c| 50 ++-
 3 files changed, 74 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/ref/vacuumdb.sgml b/doc/src/sgml/ref/vacuumdb.sgml
index 955a17a..0d085a6 100644
--- a/doc/src/sgml/ref/vacuumdb.sgml
+++

69 matches

Mail list logo