date:20240201

Re: Improve eviction algorithm in ReorderBuffer

2024-02-01 Thread Masahiko Sawada

Hi,

On Wed, Jan 31, 2024 at 5:32 PM vignesh C  wrote:
>
> On Tue, 30 Jan 2024 at 13:37, Masahiko Sawada  wrote:
> >
> > On Fri, Jan 26, 2024 at 5:36 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Wed, Dec 20, 2023 at 12:11 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada  
> > > > wrote:
> > > > >
> > > > > On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila  
> > > > > wrote:
> > > > > >
> > > > > > On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > The individual transactions shouldn't cross
> > > > > > > > 'logical_decoding_work_mem'. I got a bit confused by your 
> > > > > > > > proposal to
> > > > > > > > maintain the lists: "...splitting it into two lists: 
> > > > > > > > transactions
> > > > > > > > consuming 5% < and 5% >=  of the memory limit, and checking the 
> > > > > > > > 5% >=
> > > > > > > > list preferably.". In the previous sentence, what did you mean 
> > > > > > > > by
> > > > > > > > transactions consuming 5% >= of the memory limit? I got the 
> > > > > > > > impression
> > > > > > > > that you are saying to maintain them in a separate transaction 
> > > > > > > > list
> > > > > > > > which doesn't seems to be the case.
> > > > > > >
> > > > > > > I wanted to mean that there are three lists in total: the first 
> > > > > > > one
> > > > > > > maintain the transactions consuming more than 10% of
> > > > > > > logical_decoding_work_mem,
> > > > > > >
> > > > > >
> > > > > > How can we have multiple transactions in the list consuming more 
> > > > > > than
> > > > > > 10% of logical_decoding_work_mem? Shouldn't we perform serialization
> > > > > > before any xact reaches logical_decoding_work_mem?
> > > > >
> > > > > Well, suppose logical_decoding_work_mem is set to 64MB, transactions
> > > > > consuming more than 6.4MB are added to the list. So for example, it's
> > > > > possible that the list has three transactions each of which are
> > > > > consuming 10MB while the total memory usage in the reorderbuffer is
> > > > > still 30MB (less than logical_decoding_work_mem).
> > > > >
> > > >
> > > > Thanks for the clarification. I misunderstood the list to have
> > > > transactions greater than 70.4 MB (64 + 6.4) in your example. But one
> > > > thing to note is that maintaining these lists by default can also have
> > > > some overhead unless the list of open transactions crosses a certain
> > > > threshold.
> > > >
> > >
> > > On further analysis, I realized that the approach discussed here might
> > > not be the way to go. The idea of dividing transactions into several
> > > subgroups is to divide a large number of entries into multiple
> > > sub-groups so we can reduce the complexity to search for the
> > > particular entry. Since we assume that there are no big differences in
> > > entries' sizes within a sub-group, we can pick the entry to evict in
> > > O(1). However, what we really need to avoid here is that we end up
> > > increasing the number of times to evict entries because serializing an
> > > entry to the disk is more costly than searching an entry on memory in
> > > general.
> > >
> > > I think that it's no problem in a large-entries subgroup but when it
> > > comes to the smallest-entries subgroup, like for entries consuming
> > > less than 5% of the limit, it could end up evicting many entries. For
> > > example, there would be a huge difference between serializing 1 entry
> > > consuming 5% of the memory limit and serializing 5000 entries
> > > consuming 0.001% of the memory limit. Even if we can select 5000
> > > entries quickly, I think the latter would be slower in total. The more
> > > subgroups we create, the more the algorithm gets complex and the
> > > overheads could cause. So I think we need to search for the largest
> > > entry in order to minimize the number of evictions anyway.
> > >
> > > Looking for data structures and algorithms, I think binaryheap with
> > > some improvements could be promising. I mentioned before why we cannot
> > > use the current binaryheap[1]. The missing pieces are efficient ways
> > > to remove the arbitrary entry and to update the arbitrary entry's key.
> > > The current binaryheap provides binaryheap_remove_node(), which is
> > > O(log n), but it requires the entry's position in the binaryheap. We
> > > can know the entry's position just after binaryheap_add_unordered()
> > > but it might be changed after heapify. Searching the node's position
> > > is O(n). So the improvement idea is to add a hash table to the
> > > binaryheap so that it can track the positions for each entry so that
> > > we can remove the arbitrary entry in O(log n) and also update the
> > > arbitrary entry's key in O(log n). This is known as the indexed
> > > priority queue. I've attached the patch for that (0001 and 0002).
> > >
> > > That way, in terms of

Re: Commitfest 2024-01 is now closed

2024-02-01 Thread Richard Guo

On Fri, Feb 2, 2024 at 2:41 PM Michael Paquier  wrote:

> On Fri, Feb 02, 2024 at 11:56:36AM +0530, Amit Kapila wrote:
> > On Fri, Feb 2, 2024 at 10:58 AM Tom Lane  wrote:
> >> Thanks for all the work you did running it!  CFM is typically a
> >> thankless exercise in being a nag, but I thought you put in more
> >> than the usual amount of effort to keep things moving.
> >>
> >
> > +1. Thanks a lot for all the hard work.
>
> +1.  Thanks!


+1.  Thanks!

Thanks
Richard

Re: Make COPY format extendable: Extract COPY TO format implementations

2024-02-01 Thread Sutou Kouhei

Hi,

In 
  "Re: Make COPY format extendable: Extract COPY TO format implementations" on 
Fri, 2 Feb 2024 15:27:15 +0800,
  Junwang Zhao  wrote:

> I agree CopyToRoutine should be placed into CopyToStateData, but
> why set it after ProcessCopyOptions, the implementation of
> CopyToGetRoutine doesn't make sense if we want to support custom
> format in the future.
> 
> Seems the refactor of v11 only considered performance but not
> *extendable copy format*.

Right.
We focus on performance for now. And then we will focus on
extendability. [1]

[1] 
https://www.postgresql.org/message-id/flat/20240130.171511.2014195814665030502.kou%40clear-code.com#757a48c273f140081656ec8eb69f502b

> If V7 and V10 have no performance reduction, then I think V6 is also
> good with performance, since most of the time goes to CopyToOneRow
> and CopyFromOneRow.

Don't worry. I'll re-submit changes in the v6 patch set
again after the current patch set that focuses on
performance is merged.

> I just think we should take the *extendable* into consideration at
> the beginning.

Introducing Copy{To,From}Routine is also valuable for
extendability. We can improve extendability later. Let's
focus on only performance for now to introduce
Copy{To,From}Routine.

Thanks,
-- 
kou

Re: An improvement on parallel DISTINCT

2024-02-01 Thread Richard Guo

On Fri, Feb 2, 2024 at 11:26 AM David Rowley  wrote:

> In light of this, do you still think it's worthwhile making this change?
>
> For me, I think all it's going to result in is extra planner work
> without any performance gains.


Hmm, with the query below, I can see that the new plan is cheaper than
the old plan, and the cost difference exceeds STD_FUZZ_FACTOR.

create table t (a int, b int);
insert into t select i%10, i from generate_series(1,1000)i;
analyze t;

-- on master
explain (costs on) select distinct a from t order by a limit 1;
QUERY PLAN
--
 Limit  (cost=120188.50..120188.51 rows=1 width=4)
   ->  Sort  (cost=120188.50..120436.95 rows=99379 width=4)
 Sort Key: a
 ->  HashAggregate  (cost=118697.82..119691.61 rows=99379 width=4)
   Group Key: a
   ->  Gather  (cost=97331.33..118200.92 rows=198758 width=4)
 Workers Planned: 2
 ->  HashAggregate  (cost=96331.33..97325.12 rows=99379
width=4)
   Group Key: a
   ->  Parallel Seq Scan on t  (cost=0.00..85914.67
rows=417 width=4)
(10 rows)

-- on patched
explain (costs on) select distinct a from t order by a limit 1;
QUERY PLAN
--
 Limit  (cost=106573.93..106574.17 rows=1 width=4)
   ->  Unique  (cost=106573.93..130260.88 rows=99379 width=4)
 ->  Gather Merge  (cost=106573.93..129763.98 rows=198758 width=4)
   Workers Planned: 2
   ->  Sort  (cost=105573.91..105822.35 rows=99379 width=4)
 Sort Key: a
 ->  HashAggregate  (cost=96331.33..97325.12 rows=99379
width=4)
   Group Key: a
   ->  Parallel Seq Scan on t  (cost=0.00..85914.67
rows=417 width=4)
(9 rows)

It seems that including a LIMIT clause can potentially favor the new
plan.

Thanks
Richard

Re: Make COPY format extendable: Extract COPY TO format implementations

2024-02-01 Thread Sutou Kouhei

Hi,

In 
  "Re: Make COPY format extendable: Extract COPY TO format implementations" on 
Fri, 2 Feb 2024 15:21:31 +0900,
  Michael Paquier  wrote:

> I have done a review of v10, see v11 attached which is still WIP, with
> the patches for COPY TO and COPY FROM merged together.  Note that I'm
> thinking to merge them into a single commit.

OK. I don't have a strong opinion for commit unit.

> @@ -74,11 +75,11 @@ typedef struct CopyFormatOptions
>  boolconvert_selectively;/* do selective binary conversion? */
>  CopyOnErrorChoice on_error; /* what to do when error happened */
>  List   *convert_select; /* list of column names (can be NIL) */
> +constCopyToRoutine *to_routine;/* callback routines for COPY 
> TO */
>  } CopyFormatOptions;
> 
> Adding the routines to the structure for the format options is in my
> opinion incorrect.  The elements of this structure are first processed
> in the option deparsing path, and then we need to use the options to
> guess which routines we need.

This was discussed with Sawada-san a bit before. [1][2]

[1] 
https://www.postgresql.org/message-id/flat/CAD21AoBmNiWwrspuedgAPgbAqsn7e7NoZYF6gNnYBf%2BgXEk9Mg%40mail.gmail.com#bfd19262d261c67058fdb8d64e6a723c
[2] 
https://www.postgresql.org/message-id/flat/20240130.144531.1257430878438173740.kou%40clear-code.com#fc55392d77f400fc74e42686fe7e348a

I kept the routines in CopyFormatOptions for custom option
processing. But I should have not cared about it in this
patch set because this patch set doesn't include custom
option processing.

So I'm OK that we move the routines to
Copy{From,To}StateData.

> This also led to a strange separation with
> ProcessCopyOptionFormatFrom() and ProcessCopyOptionFormatTo() to fit
> the hole in-between.

They also for custom option processing. We don't need to
care about them in this patch set too.

> copyapi.h needs more documentation, like what is expected for
> extension developers when using these, what are the arguments, etc.  I
> have added what I had in mind for now.

Thanks! I'm not good at writing documentation in English...

> +typedef char *(*PostpareColumnValue) (CopyFromState cstate, char *string, 
> int m);
> 
> CopyReadAttributes and PostpareColumnValue are also callbacks specific
> to text and csv, except that they are used within the per-row
> callbacks.  The same can be said about CopyAttributeOutHeaderFunction.
> It seems to me that it would be less confusing to store pointers to
> them in the routine structures, where the final picture involves not
> having multiple layers of APIs like CopyToCSVStart,
> CopyAttributeOutTextValue, etc.  These *have* to be documented
> properly in copyapi.h, and this is much easier now that cstate stores
> the routine pointers.  That would also make simpler function stacks.
> Note that I have not changed that in the v11 attached.
> 
> This business with the extra callbacks required for csv and text is my
> main point of contention, but I'd be OK once the model of the APIs is
> more linear, with everything in Copy{From,To}State.  The changes would
> be rather simple, and I'd be OK to put my hands on it.  Just,
> Sutou-san, would you agree with my last point about these extra
> callbacks?

I'm OK with the approach. But how about adding the extra
callbacks to Copy{From,To}StateData not
Copy{From,To}Routines like CopyToStateData::data_dest_cb and
CopyFromStateData::data_source_cb? They are only needed for
"text" and "csv". So we don't need to add them to
Copy{From,To}Routines to keep required callback minimum.

What is the better next action for us? Do you want to
complete the WIP v11 patch set by yourself (and commit it)?
Or should I take over it?


Thanks,
-- 
kou

Re: Make COPY format extendable: Extract COPY TO format implementations

2024-02-01 Thread Junwang Zhao

On Fri, Feb 2, 2024 at 2:21 PM Michael Paquier  wrote:
>
> On Fri, Feb 02, 2024 at 09:40:56AM +0900, Sutou Kouhei wrote:
> > Thanks. It'll help us.
>
> I have done a review of v10, see v11 attached which is still WIP, with
> the patches for COPY TO and COPY FROM merged together.  Note that I'm
> thinking to merge them into a single commit.
>
> @@ -74,11 +75,11 @@ typedef struct CopyFormatOptions
>  boolconvert_selectively;/* do selective binary conversion? */
>  CopyOnErrorChoice on_error; /* what to do when error happened */
>  List   *convert_select; /* list of column names (can be NIL) */
> +constCopyToRoutine *to_routine;/* callback routines for COPY 
> TO */
>  } CopyFormatOptions;
>
> Adding the routines to the structure for the format options is in my
> opinion incorrect.  The elements of this structure are first processed
> in the option deparsing path, and then we need to use the options to
> guess which routines we need.  A more natural location is cstate
> itself, so as the pointer to the routines is isolated within copyto.c

I agree CopyToRoutine should be placed into CopyToStateData, but
why set it after ProcessCopyOptions, the implementation of
CopyToGetRoutine doesn't make sense if we want to support custom
format in the future.

Seems the refactor of v11 only considered performance but not
*extendable copy format*.

> and copyfrom_internal.h.  My point is: the routines are an
> implementation detail that the centralized copy.c has no need to know
> about.  This also led to a strange separation with
> ProcessCopyOptionFormatFrom() and ProcessCopyOptionFormatTo() to fit
> the hole in-between.
>
> The separation between cstate and the format-related fields could be
> much better, though I am not sure if it is worth doing as it
> introduces more duplication.  For example, max_fields and raw_fields
> are specific to text and csv, while binary does not care much.
> Perhaps this is just useful to be for custom formats.

I think those can be placed in format specific fields by utilizing the opaque
space, but yeah, this will introduce duplication.

>
> copyapi.h needs more documentation, like what is expected for
> extension developers when using these, what are the arguments, etc.  I
> have added what I had in mind for now.
>
> +typedef char *(*PostpareColumnValue) (CopyFromState cstate, char *string, 
> int m);
>
> CopyReadAttributes and PostpareColumnValue are also callbacks specific
> to text and csv, except that they are used within the per-row
> callbacks.  The same can be said about CopyAttributeOutHeaderFunction.
> It seems to me that it would be less confusing to store pointers to
> them in the routine structures, where the final picture involves not
> having multiple layers of APIs like CopyToCSVStart,
> CopyAttributeOutTextValue, etc.  These *have* to be documented
> properly in copyapi.h, and this is much easier now that cstate stores
> the routine pointers.  That would also make simpler function stacks.
> Note that I have not changed that in the v11 attached.
>
> This business with the extra callbacks required for csv and text is my
> main point of contention, but I'd be OK once the model of the APIs is
> more linear, with everything in Copy{From,To}State.  The changes would
> be rather simple, and I'd be OK to put my hands on it.  Just,
> Sutou-san, would you agree with my last point about these extra
> callbacks?
> --
> Michael

If V7 and V10 have no performance reduction, then I think V6 is also
good with performance, since most of the time goes to CopyToOneRow
and CopyFromOneRow.

I just think we should take the *extendable* into consideration at
the beginning.

-- 
Regards
Junwang Zhao

Re: Is this a problem in GenericXLogFinish()?

2024-02-01 Thread Michael Paquier

On Fri, Dec 01, 2023 at 03:27:33PM +0530, Amit Kapila wrote:
> Pushed!

Amit, this has been applied as of 861f86beea1c, and I got pinged about
the fact this triggers inconsistencies because we always set the LSN
of the write buffer (wbuf in _hash_freeovflpage) but
XLogRegisterBuffer() would *not* be called when the two following
conditions happen:
- When xlrec.ntups <= 0.
- When !xlrec.is_prim_bucket_same_wrt && !xlrec.is_prev_bucket_same_wrt

And it seems to me that there is still a bug here: there should be no
point in setting the LSN on the write buffer if we don't register it
in WAL at all, no?
--
Michael


signature.asc
Description: PGP signature

Re: Small fix on COPY ON_ERROR document

2024-02-01 Thread Yugo NAGATA

On Fri, 02 Feb 2024 11:29:41 +0900
torikoshia  wrote:

> On 2024-02-01 15:16, Yugo NAGATA wrote:
> > On Mon, 29 Jan 2024 15:47:25 +0900
> > Yugo NAGATA  wrote:
> > 
> >> On Sun, 28 Jan 2024 19:14:58 -0700
> >> "David G. Johnston"  wrote:
> >> 
> >> > > Also, I think "invalid input syntax" is a bit ambiguous. For example,
> >> > > COPY FROM raises an error when the number of input column does not 
> >> > > match
> >> > > to the table schema, but this error is not ignored by ON_ERROR while
> >> > > this seems to fall into the category of "invalid input syntax".
> >> >
> >> >
> >> >
> >> > It is literally the error text that appears if one were not to ignore it.
> >> > It isn’t a category of errors.  But I’m open to ideas here.  But being
> >> > explicit with what on actually sees in the system seemed preferable to
> >> > inventing new classification terms not otherwise used.
> >> 
> >> Thank you for explanation! I understood the words was from the error 
> >> messages
> >> that users actually see. However, as Torikoshi-san said in [1], errors 
> >> other
> >> than valid input syntax (e.g. range error) can be also ignored, 
> >> therefore it
> >> would be better to describe to be ignored errors more specifically.
> >> 
> >> [1] 
> >> https://www.postgresql.org/message-id/7f1457497fa3bf9dfe486f162d1c8ec6%40oss.nttdata.com
> >> 
> >> >
> >> > >
> >> > > So, keeping consistency with the existing description, we can say:
> >> > >
> >> > > "Specifies which how to behave when encountering an error due to
> >> > >  column values unacceptable to the input function of each attribute's
> >> > >  data type."
> >> >
> >> >
> >> > Yeah, I was considering something along those lines as an option as well.
> >> > But I’d rather add that wording to the glossary.
> >> 
> >> Although I am still be not convinced if we have to introduce the words
> >> "soft error" to the documentation, I don't care it if there are no 
> >> other
> >> opposite opinions.
> > 
> > Attached is a updated patch v3, which is a version that uses the above
> > wording instead of "soft error".
> > 
> >> >
> >> > > Currently, ON_ERROR doesn't support other soft errors, so it can 
> >> > > explain
> >> > > it more simply without introducing the new concept, "soft error" to 
> >> > > users.
> >> > >
> >> > >
> >> > Good point.  Seems we should define what user-facing errors are ignored
> >> > anywhere in the system and if we aren’t consistently leveraging these in
> >> > all areas/commands make the necessary qualifications in those specific
> >> > places.
> >> >
> >> 
> >> > > I think "left in a deleted state" is also unclear for users because 
> >> > > this
> >> > > explains the internal state but not how looks from user's view.How 
> >> > > about
> >> > > leaving the explanation "These rows will not be visible or accessible" 
> >> > > in
> >> > > the existing statement?
> >> > >
> >> >
> >> > Just visible then, I don’t like an “or” there and as tuples at least they
> >> > are accessible to the system, in vacuum especially.  But I expected the
> >> > user to understand “as if you deleted it” as their operational concept 
> >> > more
> >> > readily than visible.  I think this will be read by people who haven’t 
> >> > read
> >> > MVCC to fully understand what visible means but know enough to run vacuum
> >> > to clean up updated and deleted data as a rule.
> >> 
> >> Ok, I agree we can omit "or accessible". How do you like the 
> >> followings?
> >> Still redundant?
> >> 
> >>  "If the command fails, these rows are left in a deleted state;
> >>   these rows will not be visible, but they still occupy disk space. "
> > 
> > Also, the above statement is used in the patch.
> 
> Thanks for updating the patch!
> 
> I like your description which doesn't use the word soft error.

Thank you for your comments!

> 
> Here are minor comments:
> 
> > +  ignore means discard the input row and 
> > continue with the next one.
> > +  The default is stop
> 
> Is "." required at the end of the line?
>
> >  An NOTICE level context message containing the 
> > ignored row count is
> 
> Should 'An' be 'A'?
> 
> Also, I wasn't sure the necessity of 'context'.
> It might be possible to just say "A NOTICE message containing the 
> ignored row count.."
> considering below existing descriptions:
> 
>doc/src/sgml/pltcl.sgml: a NOTICE message each 
> time a supported command is
>doc/src/sgml/pltcl.sgml- executed:
> 
>doc/src/sgml/plpgsql.sgml: This example trigger simply raises a 
> NOTICE message
>doc/src/sgml/plpgsql.sgml- each time a supported command is 
> executed.

I attached a updated patch including fixes you pointed out above.

Regards,
Yugo Nagata

> -- 
> Regards,
> 
> --
> Atsushi Torikoshi
> NTT DATA Group Corporation


-- 
Yugo NAGATA 
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 21a5c4a052..3c2feaa11a 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -90,6 +90,13 @@ COPY {

Re: Synchronizing slots from primary to standby

2024-02-01 Thread Amit Kapila

On Thu, Feb 1, 2024 at 5:29 PM shveta malik  wrote:
>
> On Thu, Feb 1, 2024 at 2:35 PM Amit Kapila  wrote:
> >
> > Agreed, and I am fine with merging 0001, 0002, and 0004 as suggested
> > by you though I have a few minor comments on 0002 and 0004. I was
> > thinking about what will be a logical way to split the slot sync
> > worker patch (combined result of 0001, 0002, and 0004), and one idea
> > occurred to me is that we can have the first patch as
> > synchronize_solts() API and the functionality required to implement
> > that API then the second patch would be a slot sync worker which uses
> > that API to synchronize slots and does all the required validations.
> > Any thoughts?
>
> If we shift 'synchronize_slots()' to the first patch but there is no
> caller of it, we may have a compiler warning for the same. The only
> way it can be done is if we temporarily add SQL function on standby
> which uses 'synchronize_slots()'. This SQL function can then be
> removed in later patches where we actually have a caller for
> 'synchronize_slots'.
>

Can such a SQL function say pg_synchronize_slots() which can sync all
slots that have a failover flag set be useful in general apart from
just writing tests for this new API? I am thinking maybe users want
more control over when to sync the slots and write their bgworker or
simply do it just before shutdown once (sort of planned switchover) or
at some other pre-defined times. BTW, we also have
pg_log_standby_snapshot() which otherwise would be done periodically
by background processes.

>
> 1) Re-arranged the patches:
> 1.1) 'libpqrc' related changes (from v74-001 and v74-004) are
> separated out in v75-001 as those are independent changes.

Bertrand, Sawada-San, and others, do you see a problem with such a
split? Can we go ahead with v75_0001 separately after fixing the open
comments?

-- 
With Regards,
Amit Kapila.

Re: Synchronizing slots from primary to standby

2024-02-01 Thread Peter Smith

Here are some review comments for v750002.

(this is a WIP but this is what I found so far...)

==
doc/src/sgml/protocol.sgml

1.
> > 2. ALTER_REPLICATION_SLOT slot_name ( option [, ...] ) #
> >
> >   
> > -  If true, the slot is enabled to be synced to the standbys.
> > +  If true, the slot is enabled to be synced to the standbys
> > +  so that logical replication can be resumed after failover.
> >   
> >
> > This also should have the sentence "The default is false.", e.g. the
> > same as the same option in CREATE_REPLICATION_SLOT says.
>
> I have not added this. I feel the default value related details should
> be present in the 'CREATE' part, it is not meaningful for the "ALTER"
> part. ALTER does not have any defaults, it just modifies the options
> given by the user.

You are correct. My mistake.

==
src/backend/postmaster/bgworker.c

2.
 #include "replication/logicalworker.h"
+#include "replication/worker_internal.h"
 #include "storage/dsm.h"

Is this change needed when the rest of the code is removed?

==
src/backend/replication/logical/slotsync.c

3. synchronize_one_slot

> > 3.
> > + /*
> > + * Make sure that concerned WAL is received and flushed before syncing
> > + * slot to target lsn received from the primary server.
> > + *
> > + * This check will never pass if on the primary server, user has
> > + * configured standby_slot_names GUC correctly, otherwise this can hit
> > + * frequently.
> > + */
> > + latestFlushPtr = GetStandbyFlushRecPtr(NULL);
> > + if (remote_slot->confirmed_lsn > latestFlushPtr)
> >
> > BEFORE
> > This check will never pass if on the primary server, user has
> > configured standby_slot_names GUC correctly, otherwise this can hit
> > frequently.
> >
> > SUGGESTION (simpler way to say the same thing?)
> > This will always be the case unless the standby_slot_names GUC is not
> > correctly configured on the primary server.
>
> It is not true. It will not hit this condition "always" but has higher
> chances to hit it when standby_slot_names is not configured. I think
> you meant 'unless the standby_slot_names GUC is correctly configured'.
> I feel the current comment gives clear info (less confusing) and thus
> I have not changed it for the time being. I can consider if I get more
> comments there.

Hmm. I meant what I wrote. The "This" of my suggested text refers to
the previous sentence in the comment (not about "hitting" ?? your
condition).

TBH, regardless of the wording you choose, I think it will be much
clearer to move the comment to be inside the if.

SUGGESTION
/*
 * Make sure that concerned WAL is received and flushed before syncing
 * slot to target lsn received from the primary server.
 */
latestFlushPtr = GetStandbyFlushRecPtr(NULL);
if (remote_slot->confirmed_lsn > latestFlushPtr)
{
  /*
   * Can get here only when if GUC 'standby_slot_names' on the primary
   * server was not configured correctly.
   */
...
}

~~~

4.
+static bool
+validate_parameters_and_get_dbname(char **dbname)
+{
+ /*
+ * A physical replication slot(primary_slot_name) is required on the
+ * primary to ensure that the rows needed by the standby are not removed
+ * after restarting, so that the synchronized slot on the standby will not
+ * be invalidated.
+ */
+ if (PrimarySlotName == NULL || *PrimarySlotName == '\0')
+ {
+ ereport(LOG,
+ /* translator: %s is a GUC variable name */
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("bad configuration for slot synchronization"),
+ errhint("\"%s\" must be defined.", "primary_slot_name"));
+ return false;
+ }
+
+ /*
+ * hot_standby_feedback must be enabled to cooperate with the physical
+ * replication slot, which allows informing the primary about the xmin and
+ * catalog_xmin values on the standby.
+ */
+ if (!hot_standby_feedback)
+ {
+ ereport(LOG,
+ /* translator: %s is a GUC variable name */
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("bad configuration for slot synchronization"),
+ errhint("\"%s\" must be enabled.", "hot_standby_feedback"));
+ return false;
+ }
+
+ /*
+ * Logical decoding requires wal_level >= logical and we currently only
+ * synchronize logical slots.
+ */
+ if (wal_level < WAL_LEVEL_LOGICAL)
+ {
+ ereport(LOG,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("bad configuration for slot synchronization"),
+ errhint("\"wal_level\" must be >= logical."));
+ return false;
+ }
+
+ /*
+ * The primary_conninfo is required to make connection to primary for
+ * getting slots information.
+ */
+ if (PrimaryConnInfo == NULL || *PrimaryConnInfo == '\0')
+ {
+ ereport(LOG,
+ /* translator: %s is a GUC variable name */
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("bad configuration for slot synchronization"),
+ errhint("\"%s\" must be defined.", "primary_conninfo"));
+ return false;
+ }
+
+ /*
+ * The slot sync worker needs a database connection for walrcv_exec to
+ * work.
+ */
+ *dbname = walrcv_get_dbname_from_conninfo(PrimaryConnInfo);
+ if (*dbname ==

Re: Commitfest 2024-01 is now closed

2024-02-01 Thread Masahiko Sawada

On Fri, Feb 2, 2024 at 2:19 PM vignesh C  wrote:
>
> Hi,
>
> Thanks a lot to all the members who participated in the commitfest.
>
> Here are the final numbers at the end of the commitfest:
> status |   w1  |  w2  |   w3   |   w4|   End
> ++--++-+--
> Needs review:   |   238  | 213  |  181|  146   |   N/A
> Waiting on Author | 44  |46  |   52|65   |   N/A
> Ready for Committer| 27  |   27   |   26|24   |   N/A
> Committed:   | 36  |   46   |   57|71   |75
> Moved to next CF |   1  | 3   | 4|  4   |   208
> Withdrawn|   2  | 4   |   12|14   | 16
> RWF |   3  |   12   |   18|26  |  51
> Rejected   |   1  | 1   | 2|  2   |   
> 2
> Total  |   352 |  352   |  352   |  352  |352
>
> Committed patches comparison with previous January commitfests:
> 2024: 75 committed
> 2023: 70 committed
> 2022: 58 committed
> 2021: 56 committed
> 2020: 49 committed
>
>  A special thanks to the reviewers/committers who spent tireless
> effort in moving the patches forward.
>

Thank you for all the hard work, Vignesh!

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Avoid unncessary always true test (src/backend/storage/buffer/bufmgr.c)

2024-02-01 Thread Michael Paquier

On Fri, Feb 02, 2024 at 12:04:39AM +0530, vignesh C wrote:
> The patch which you submitted has been awaiting your attention for
> quite some time now.  As such, we have moved it to "Returned with
> Feedback" and removed it from the reviewing queue. Depending on
> timing, this may be reversible.  Kindly address the feedback you have
> received, and resubmit the patch to the next CommitFest.

Even with that, it seems to me that this is not required now that
21d9c3ee4ef7 outlines better how long SMgrRelation pointers should
live, no?
--
Michael


signature.asc
Description: PGP signature

Re: Improve eviction algorithm in ReorderBuffer

2024-02-01 Thread Masahiko Sawada

On Wed, Jan 31, 2024 at 2:18 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Sawada-san,
>
> I have started to read your patches. Here are my initial comments.
> At least, all subscription tests were passed on my env.

Thank you for the review comments!

>
> A comment for 0001:
>
> 01.
> ```
> +static void
> +bh_enlarge_node_array(binaryheap *heap)
> +{
> +if (heap->bh_size < heap->bh_space)
> +return;
> +
> +heap->bh_space *= 2;
> +heap->bh_nodes = repalloc(heap->bh_nodes,
> +  sizeof(bh_node_type) * heap->bh_space);
> +}
> ```
>
> I'm not sure it is OK to use repalloc() for enlarging bh_nodes. This data
> structure public one and arbitrary codes and extensions can directly refer
> bh_nodes. But if the array is repalloc()'d, the pointer would be updated so 
> that
> their referring would be a dangling pointer.

Hmm I'm not sure this is the case that we need to really worry about,
and cannot come up with a good use case where extensions refer to
bh_nodes directly rather than binaryheap. In PostgreSQL codes, many
Nodes already have pointers and are exposed.

> I think the internal of the structure should be a private one in this case.
>
> Comments for 0002:
>
> 02.
> ```
> +#include "utils/palloc.h"
> ```
>
> Is it really needed? I'm not sure who referrs it.

Seems not, will remove.

>
> 03.
> ```
> typedef struct bh_nodeidx_entry
> {
> bh_node_typekey;
> charstatus;
> int idx;
> } bh_nodeidx_entry;
> ```
>
> Sorry if it is a stupid question. Can you tell me how "status" is used?
> None of binaryheap and reorderbuffer components refer it.

It's required by simplehash.h

>
> 04.
> ```
>  extern binaryheap *binaryheap_allocate(int capacity,
> binaryheap_comparator compare,
> -   void *arg);
> +   bool indexed, void *arg);
> ```
>
> I felt pre-existing API should not be changed. How about adding
> binaryheap_allocate_extended() or something which can specify the `bool 
> indexed`?
> binaryheap_allocate() sets heap->bh_indexed to false.

I'm really not sure it's worth inventing a
binaryheap_allocate_extended() function just for preserving API
compatibility. I think it's generally a good idea to have
xxx_extended() function to increase readability and usability, for
example, for the case where the same (kind of default) arguments are
passed in most cases and the function is called from many places.
However, we have a handful binaryheap_allocate() callers, and I
believe that it would not hurt the existing callers.

>
> 05.
> ```
> +extern void binaryheap_update_up(binaryheap *heap, bh_node_type d);
> +extern void binaryheap_update_down(binaryheap *heap, bh_node_type d);
> ```
>
> IIUC, callers must consider whether the node should be shift up/down and use
> appropriate function, right? I felt it may not be user-friendly.

Right, I couldn't come up with a better interface.

Another idea I've considered was that the caller provides a callback
function where it can compare the old and new keys. For example, in
reorderbuffer case, we call like:

binaryheap_update(rb->txn_heap, PointerGetDatum(txn),
ReorderBufferTXNUpdateCompare, (void *) _size);

Then in ReorderBufferTXNUpdateCompare(),
ReorderBufferTXN *txn = (ReorderBufferTXN *) a;Size old_size = *(Size *) b;
(compare txn->size to "b" ...)

However it seems complicated...

>
> Comments for 0003:
>
> 06.
> ```
> This commit changes the eviction algorithm in ReorderBuffer to use
> max-heap with transaction size,a nd use two strategies depending on
> the number of transactions being decoded.
> ```
>
> s/a nd/ and/
>
> 07.
> ```
> It could be too expensive to pudate max-heap while preserving the heap
> property each time the transaction's memory counter is updated, as it
> could happen very frquently. So when the number of transactions being
> decoded is small, we add the transactions to max-heap but don't
> preserve the heap property, which is O(1). We heapify the max-heap
> just before picking the largest transaction, which is O(n). This
> strategy minimizes the overheads of updating the transaction's memory
> counter.
> ```
>
> s/pudate/update/

Will fix them.

>
> 08.
> IIUC, if more than 1024 transactions are running but they have small amount of
> changes, the performance may be degraded, right? Do you have a result in sucha
> a case?

I've run a benchmark test that I shared before[1]. Here are results of
decoding a transaction that has 1M subtransaction each of which has 1
INSERT:

HEAD:
1810.192 ms

HEAD w/ patch:
2001.094 ms

I set a large enough value to logical_decoding_work_mem not to evict
any transactions. I can see about about 10% performance regression in
this case.

Regards,

[1]

Re: Commitfest 2024-01 is now closed

2024-02-01 Thread Michael Paquier

On Fri, Feb 02, 2024 at 11:56:36AM +0530, Amit Kapila wrote:
> On Fri, Feb 2, 2024 at 10:58 AM Tom Lane  wrote:
>> Thanks for all the work you did running it!  CFM is typically a
>> thankless exercise in being a nag, but I thought you put in more
>> than the usual amount of effort to keep things moving.
>>
> 
> +1. Thanks a lot for all the hard work.

+1.  Thanks!
--
Michael


signature.asc
Description: PGP signature

Re: Synchronizing slots from primary to standby

2024-02-01 Thread Masahiko Sawada

On Fri, Feb 2, 2024 at 1:58 PM Amit Kapila  wrote:
>
> On Fri, Feb 2, 2024 at 6:46 AM Masahiko Sawada  wrote:
> >
> > On Thu, Feb 1, 2024 at 12:51 PM Amit Kapila  wrote:
> > >
> > >
> > > > BTW I've tested the following switch/fail-back scenario but it seems
> > > > not to work fine. Am I missing something?
> > > >
> > > > Setup:
> > > > node1 is the primary, node2 is the physical standby for node1, and
> > > > node3 is the subscriber connecting to node1.
> > > >
> > > > Steps:
> > > > 1. [node1]: create a table and a publication for the table.
> > > > 2. [node2]: set enable_syncslot = on and start (to receive WALs from 
> > > > node1).
> > > > 3. [node3]: create a subscription with failover = true for the 
> > > > publication.
> > > > 4. [node2]: promote to the new standby.
> > > > 5. [node3]: alter subscription to connect the new primary, node2.
> > > > 6. [node1]: stop, set enable_syncslot = on (and other required
> > > > parameters), then start as a new standby.
> > > >
> > > > Then I got the error "exiting from slot synchronization because same
> > > > name slot "test_sub" already exists on the standby".
> > > >
> > > > The logical replication slot that was created on the old primary
> > > > (node1) has been synchronized to the old standby (node2). Therefore on
> > > > node2, the slot's "synced" field is true. However, once node1 starts
> > > > as the new standby with slot synchronization, the slotsync worker
> > > > cannot synchronize the slot because the slot's "synced" field on the
> > > > primary is false.
> > > >
> > >
> > > Yeah, we avoided doing anything in this case because the user could
> > > have manually created another slot with the same name on standby.
> > > Unlike WAL slots can be modified on standby as we allow decoding on
> > > standby, so we can't allow to overwrite the existing slots. We won't
> > > be able to distinguish whether the existing slot was a slot that the
> > > user wants to sync with primary or a slot created on standby to
> > > perform decoding. I think in this case user first needs to drop the
> > > slot on new standby.
> >
> > Yes, but if we do a switch-back further (i.e. in above case, node1
> > backs to the primary again and node becomes the standby again), the
> > user doesn't need to remove failover slots since they are already
> > marked as "synced".
>
> But, I think in this case node-2's timeline will be ahead of node-1,
> so will we be able to make node-2 follow node-1 again without any
> additional steps? One thing is not clear to me after promotion the
> timeline changes in WAL, so the locations in slots will be as per new
> timelines, after that will it be safe to sync slots from the new
> primary to old-primary?

In order for node-1 to go back to the primary again, it needs to be
promoted. That is, the node-1's timeline increments and node-2 follows
node-1.

>
> In general, I think after failover, we recommend running pg_rewind if
> the old primary has to follow the new primary to account for
> divergence in WAL. So, not sure we can safely start syncing slots in
> old-primary from new-primary, consider that in the new primary, the
> same name slot may have dropped/re-created multiple times.

Right. And I missed the point that all replication slots are removed
after pg_rewind. It would not be a problem in a failover case. But
probably we still need to consider a switchover cas (i.e. switch roles
with clean shutdowns) since it doesn't require to run pg_rewind?

> We can
> probably reset all the fields of the existing slot the first time
> syncing for an existing slot or do something like that but I think it
> would be better to just re-create the slot.
>
> >
>  I wonder if we could do something automatically to
> > reduce the user's operation.
>
> One possibility is that we forcefully drop/re-create the slot or
> directly overwrite the slot contents but that would probably be better
> done via some GUC or slot-level parameter. I feel we should leave this
> for another day, for the first version, we can document that an error
> will occur if the same name slots on standby exist, so users need to
> ensure that there shouldn't be an existing same name slots on standby
> before sync.
>

Hmm, I'm afraid it might not be user-friendly. But probably we can
leave it for now as it's not impossible.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Commitfest 2024-01 is now closed

2024-02-01 Thread Amit Kapila

On Fri, Feb 2, 2024 at 10:58 AM Tom Lane  wrote:
>
> vignesh C  writes:
> > Thanks a lot to all the members who participated in the commitfest.
>
> Thanks for all the work you did running it!  CFM is typically a
> thankless exercise in being a nag, but I thought you put in more
> than the usual amount of effort to keep things moving.
>

+1. Thanks a lot for all the hard work.

-- 
With Regards,
Amit Kapila.

Re: Make COPY format extendable: Extract COPY TO format implementations

2024-02-01 Thread Michael Paquier

On Fri, Feb 02, 2024 at 09:40:56AM +0900, Sutou Kouhei wrote:
> Thanks. It'll help us.

I have done a review of v10, see v11 attached which is still WIP, with
the patches for COPY TO and COPY FROM merged together.  Note that I'm
thinking to merge them into a single commit.

@@ -74,11 +75,11 @@ typedef struct CopyFormatOptions
 boolconvert_selectively;/* do selective binary conversion? */
 CopyOnErrorChoice on_error; /* what to do when error happened */
 List   *convert_select; /* list of column names (can be NIL) */
+constCopyToRoutine *to_routine;/* callback routines for COPY 
TO */
 } CopyFormatOptions;

Adding the routines to the structure for the format options is in my
opinion incorrect.  The elements of this structure are first processed
in the option deparsing path, and then we need to use the options to
guess which routines we need.  A more natural location is cstate
itself, so as the pointer to the routines is isolated within copyto.c
and copyfrom_internal.h.  My point is: the routines are an
implementation detail that the centralized copy.c has no need to know
about.  This also led to a strange separation with
ProcessCopyOptionFormatFrom() and ProcessCopyOptionFormatTo() to fit
the hole in-between.

The separation between cstate and the format-related fields could be
much better, though I am not sure if it is worth doing as it
introduces more duplication.  For example, max_fields and raw_fields
are specific to text and csv, while binary does not care much.
Perhaps this is just useful to be for custom formats.

copyapi.h needs more documentation, like what is expected for
extension developers when using these, what are the arguments, etc.  I
have added what I had in mind for now.

+typedef char *(*PostpareColumnValue) (CopyFromState cstate, char *string, int 
m);

CopyReadAttributes and PostpareColumnValue are also callbacks specific
to text and csv, except that they are used within the per-row
callbacks.  The same can be said about CopyAttributeOutHeaderFunction.
It seems to me that it would be less confusing to store pointers to
them in the routine structures, where the final picture involves not
having multiple layers of APIs like CopyToCSVStart,
CopyAttributeOutTextValue, etc.  These *have* to be documented
properly in copyapi.h, and this is much easier now that cstate stores
the routine pointers.  That would also make simpler function stacks.
Note that I have not changed that in the v11 attached.

This business with the extra callbacks required for csv and text is my
main point of contention, but I'd be OK once the model of the APIs is
more linear, with everything in Copy{From,To}State.  The changes would
be rather simple, and I'd be OK to put my hands on it.  Just,
Sutou-san, would you agree with my last point about these extra
callbacks?
--
Michael
From 7aca58d0f7adfd146d287059b1d11b47acdfa758 Mon Sep 17 00:00:00 2001
From: Sutou Kouhei 
Date: Wed, 31 Jan 2024 13:37:02 +0900
Subject: [PATCH v11] Extract COPY FROM/TO format implementations

This doesn't change the current behavior.  This just introduces a set of
copy routines called CopyFromRoutine and CopyToRoutine, which just has
function pointers of format implementation like TupleTableSlotOps, and
use it for existing "text", "csv" and "binary" format implementations.
There are plans to extend that more in the future with custom formats.

This improves performance when a relation has many attributes as this
eliminates format-based if branches.  The more the attributes, the
better the performance gain.  Blah add some numbers.
---
 src/include/commands/copyapi.h   |  82 
 src/include/commands/copyfrom_internal.h |   8 +
 src/backend/commands/copyfrom.c  | 219 +++---
 src/backend/commands/copyfromparse.c | 439 
 src/backend/commands/copyto.c| 496 ---
 src/tools/pgindent/typedefs.list |   2 +
 6 files changed, 870 insertions(+), 376 deletions(-)
 create mode 100644 src/include/commands/copyapi.h

diff --git a/src/include/commands/copyapi.h b/src/include/commands/copyapi.h
new file mode 100644
index 00..3c0a6ba7a1
--- /dev/null
+++ b/src/include/commands/copyapi.h
@@ -0,0 +1,82 @@
+/*-
+ *
+ * copyapi.h
+ *	  API for COPY TO/FROM handlers
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/commands/copyapi.h
+ *
+ *-
+ */
+#ifndef COPYAPI_H
+#define COPYAPI_H
+
+#include "executor/tuptable.h"
+#include "nodes/execnodes.h"
+
+/* These are private in commands/copy[from|to].c */
+typedef struct CopyFromStateData *CopyFromState;
+typedef struct CopyToStateData *CopyToState;
+
+/*
+ * API structure for a COPY FROM format implementation.   Note this

Re: Synchronizing slots from primary to standby

2024-02-01 Thread Amit Kapila

On Fri, Feb 2, 2024 at 9:50 AM Peter Smith  wrote:
>
> Here are some review comments for v750001.
>
> ~~~
>
> 2.
> This patch also implements a new API libpqrcv_get_dbname_from_conninfo()
> to extract database name from the given connection-info
>
> ~
>
> /extract database name/the extract database name/
>

I think it should be "..extract the database name.."

> ==
> .../libpqwalreceiver/libpqwalreceiver.c
>

>
> 4. libpqrcv_connect
>
>  /*
> - * Establish the connection to the primary server for XLOG streaming
> + * Establish the connection to the primary server.
> + *
> + * The connection established could be either a replication one or
> + * a non-replication one based on input argument 'replication'. And further
> + * if it is a replication connection, it could be either logical or physical
> + * based on input argument 'logical'.
>
> That first comment ("could be either a replication one or...") seemed
> a bit meaningless (e.g. it like saying "this boolean argument can be
> true or false") because it doesn't describe what is the meaning of a
> "replication connection" versus what is a "non-replication
> connection".
>

The replication connection is a term already used in the code and
docs. For example, see the error message: "pg_hba.conf rejects
replication connection for host ..". It means that for communication
the connection would use replication protocol instead of the normal
(one used by queries) protocol. The other possibility could be to
individually explain each parameter but I think that is not what we
follow in this or related functions. I feel we can use a simple
comment like: "This API can be used for both replication and regular
connections."

> ~~~
>
> 5.
> /* We can not have logical without replication */
> Assert(replication || !logical);
>
> if (replication)
> {
> keys[++i] = "replication";
> vals[i] = logical ? "database" : "true";
>
> if (!logical)
> {
> /*
> * The database name is ignored by the server in replication mode,
> * but specify "replication" for .pgpass lookup.
> */
> keys[++i] = "dbname";
> vals[i] = "replication";
> }
> }
>
> keys[++i] = "fallback_application_name";
> vals[i] = appname;
> if (logical)
> {
> ...
> }
>
> ~
>
> The Assert already says we cannot be 'logical' if not 'replication',
> therefore IMO it seemed strange that the code was not refactored to
> bring that 2nd "if (logical)" code to within the scope of the "if
> (replication)".
>
> e.g. Can't you do something like this:
>
> Assert(replication || !logical);
>
> if (replication)
> {
>   ...
>   if (logical)
>   {
> ...
>   }
>   else
>   {
> ...
>   }
> }
> keys[++i] = "fallback_application_name";
> vals[i] = appname;
>

+1.

> ~~~
>
> 6. libpqrcv_get_dbname_from_conninfo
>
> + for (PQconninfoOption *opt = opts; opt->keyword != NULL; ++opt)
> + {
> + /*
> + * If multiple dbnames are specified, then the last one will be
> + * returned
> + */
> + if (strcmp(opt->keyword, "dbname") == 0 && opt->val &&
> + opt->val[0] != '\0')
> + dbname = pstrdup(opt->val);
> + }
>
> Should you also pfree the old dbname instead of gathering a bunch of
> strdups if there happened to be multiple dbnames specified ?
>
> SUGGESTION
> if (strcmp(opt->keyword, "dbname") == 0 && opt->val && *opt->val)
> {
>if (dbname)
> pfree(dbname);
>dbname = pstrdup(opt->val);
> }
>

makes sense and shouldn't we need to call PQconninfoFree(opts); at the
end of libpqrcv_get_dbname_from_conninfo() similar to
libpqrcv_check_conninfo()?

-- 
With Regards,
Amit Kapila.

Re: Add new COPY option REJECT_LIMIT

2024-02-01 Thread torikoshia

On 2024-01-27 00:20, David G. Johnston wrote:

Thanks for your comments!

On Fri, Jan 26, 2024 at 2:49 AM torikoshia
wrote:

Hi,

9e2d870 enabled the COPY command to skip soft error, and I think we
can
add another option which specifies the maximum tolerable number of
soft
errors.

I remember this was discussed in [1], and feel it would be useful
when
loading 'dirty' data but there is a limit to how dirty it can be.

Attached a patch for this.

What do you think?

I'm opposed to adding this particular feature.

When implementing this kind of business rule I'd need the option to
specify a percentage, not just an absolute value.

Yeah, it seems useful for some cases.
Actually, Greenplum enables to specify not only the max number of bad
rows but also its percentage[1].

I may be wrong, but considering some dataloaders support something like
reject_limit(Redshift supports MAXERROR[2], pg_bulkload supports
PARSE_ERRORS[3]), specifying the "number" of the bad row might also be
useful.

I think we can implement reject_limit specified by percentage simply
calculating the ratio of skipped and processed at the end of CopyFrom()
like this:

if (cstate->opts.reject_limit > 0 && (double) skipped / (processed +
skipped) > cstate->opts.reject_limit_percent)

ereport(ERROR,

(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
errmsg("exceeded the ratio specified
by ..

I would focus on trying to put the data required to make this kind of
determination into a place where applications implementing such
business rules and monitoring can readily get at it. The "ERRORS TO"
and maybe a corresponding "STATS TO" option where a table can be
specified for the system to place the problematic data and stats about
the copy itself.

It'd be nice to have such informative tables, but I believe the benefit
of reject_limit is it fails the entire loading when the threshold is
exceeded.
I imagine when we just have error and stats information tables for COPY,
users have to delete the rows when they confirmed too many errors in
these tables.

[1]https://docs.vmware.com/en/VMware-Greenplum/7/greenplum-database/admin_guide-load-topics-g-handling-load-errors.html
[2]https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-load.html
[3]https://ossc-db.github.io/pg_bulkload/pg_bulkload.html

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-01 Thread Dilip Kumar

On Thu, Feb 1, 2024 at 4:34 PM Dilip Kumar  wrote:
>
> On Thu, Feb 1, 2024 at 4:12 PM Dilip Kumar  wrote:
> >
> > On Thu, Feb 1, 2024 at 3:44 PM Alvaro Herrera  
> > wrote:
>
> > Okay.
> > >
> > > While I have your attention -- if you could give a look to the 0001
> > > patch I posted, I would appreciate it.
> > >
> >
> > I will look into it.  Thanks.
>
> Some quick observations,
>
> Do we need below two write barriers at the end of the function?
> because the next instruction is separated by the function boundary
>
> @@ -766,14 +766,11 @@ StartupCLOG(void)
>   ..
> - XactCtl->shared->latest_page_number = pageno;
> -
> - LWLockRelease(XactSLRULock);
> + pg_atomic_init_u64(>shared->latest_page_number, pageno);
> + pg_write_barrier();
>  }
>
> /*
>   * Initialize member's idea of the latest page number.
>   */
>   pageno = MXOffsetToMemberPage(offset);
> - MultiXactMemberCtl->shared->latest_page_number = pageno;
> + pg_atomic_init_u64(>shared->latest_page_number,
> +pageno);
> +
> + pg_write_barrier();
>  }
>

I have checked the patch and it looks fine to me other than the above
question related to memory barrier usage one more question about the
same, basically below to instances 1 and 2 look similar but in 1 you
are not using the memory write_barrier whereas in 2 you are using the
write_barrier, why is it so?  I mean why the reordering can not happen
in 1 and it may happen in 2?

1.
+ pg_atomic_write_u64(>shared->latest_page_number,
+ trunc->pageno);

  SimpleLruTruncate(CommitTsCtl, trunc->pageno);

vs
2.

  - shared->latest_page_number = pageno;
+ pg_atomic_write_u64(>latest_page_number, pageno);
+ pg_write_barrier();

  /* update the stats counter of zeroed pages */
  pgstat_count_slru_page_zeroed(shared->slru_stats_idx);



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Synchronizing slots from primary to standby

2024-02-01 Thread Bertrand Drouvot

Hi,

On Thu, Feb 01, 2024 at 05:29:15PM +0530, shveta malik wrote:
> Attached v75 patch-set. Changes are:
> 
> 1) Re-arranged the patches:
> 1.1) 'libpqrc' related changes (from v74-001 and v74-004) are
> separated out in v75-001 as those are independent changes.
> 1.2) 'Add logical slot sync capability', 'Slot sync worker as special
> process' and 'App-name changes' are now merged to single patch which
> makes v75-002.
> 1.3) 'Wait for physical Standby confirmation' and 'Failover Validation
> Document' patches are maintained as is (v75-003 and v75-004 now).

Thanks!

I only looked at the commit message for v75-0002 and see that it has changed
since the comment done in [1], but it still does not look correct to me.

"
If a logical slot on the primary is valid but is invalidated on the standby,
then that slot is dropped and recreated on the standby in next sync-cycle
provided the slot still exists on the primary server. It is okay to recreate
such slots as long as these are not consumable on the standby (which is the
case currently). This situation may occur due to the following reasons:
- The max_slot_wal_keep_size on the standby is insufficient to retain WAL
  records from the restart_lsn of the slot.
- primary_slot_name is temporarily reset to null and the physical slot is
  removed.
- The primary changes wal_level to a level lower than logical.
"

If a logical decoding slot "still exists on the primary server" then the primary
can not change the wal_level to lower than logical, one would get something 
like:

"FATAL:  logical replication slot "logical_slot" exists, but wal_level < 
logical"

and then slots won't get invalidated on the standby. I've the feeling that the
wal_level conflict part may need to be explained separately? (I think it's not
possible that they end up being re-created on the standby for this conflict, 
they
will be simply removed as it would mean the counterpart one on the primary 
does not exist anymore).

[1]: 
https://www.postgresql.org/message-id/ZYWdSIeAMQQcLmVT%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-02-01 Thread Alexander Lakhin


Hello Vignesh and Hou-san,

01.02.2024 07:59, vignesh C wrote:

Here is an updated patch which changes the boolean variable to a
tri-state enum and set stable state to valid only if no invalidations
have been occurred while the list is being prepared.



While testing the v3 patch, I observed another anomaly with the 014_binary
test. When I run the test in parallel, I see that sometimes one of the test
instances runs much longer than others. For example:
49  All tests successful.
49  Files=1, Tests=8,  4 wallclock secs ( 0.03 usr  0.01 sys + 0.43 cusr  
0.30 csys =  0.77 CPU)
49  Result: PASS
12  All tests successful.
12  Files=1, Tests=8, 184 wallclock secs ( 0.02 usr  0.01 sys + 0.46 cusr  
0.40 csys =  0.89 CPU)
12  Result: PASS

As far as I can see, this happens due to another race condition, this time
in launcher.c.
For such a three-minute case I see in _subscriber.log:
2024-02-01 14:33:13.604 UTC [949255] DEBUG:  relation 
"public.test_mismatching_types" does not exist
2024-02-01 14:33:13.604 UTC [949255] CONTEXT:  processing remote data for replication origin "pg_16398" during message 
type "INSERT" in transaction 757, finished at 0/153C838
2024-02-01 14:33:13.604 UTC [949255] ERROR:  logical replication target relation "public.test_mismatching_types" does 
not exist
2024-02-01 14:33:13.604 UTC [949255] CONTEXT:  processing remote data for replication origin "pg_16398" during message 
type "INSERT" in transaction 757, finished at 0/153C838

...
2024-02-01 14:33:13.605 UTC [949276] 014_binary.pl LOG:  statement: CREATE 
TABLE public.test_mismatching_types (
    a int PRIMARY KEY
    );
2024-02-01 14:33:13.605 UTC [942451] DEBUG:  unregistering background worker "logical replication apply worker for 
subscription 16398"
2024-02-01 14:33:13.605 UTC [942451] LOG:  background worker "logical replication apply worker" (PID 949255) exited with 
exit code 1

...
2024-02-01 14:33:13.607 UTC [949276] 014_binary.pl LOG:  statement: ALTER 
SUBSCRIPTION tsub REFRESH PUBLICATION;
...
2024-02-01 14:36:13.642 UTC [942527] DEBUG:  starting logical replication worker for 
subscription "tsub"
(there is no interesting activity between 14:33:13 and 14:36:13)

So logical replication apply worker exited because CREATE TABLE was not
executed on subscriber yet, and new replication worker started because of a
timeout occurred in WaitLatch(..., wait_time, ...) inside
ApplyLauncherMain() (wait_time in this case is DEFAULT_NAPTIME_PER_CYCLE
(180 sec)).

But in a normal (fast) case the same WaitLatch exits due to MyLatch set as
a result of:
logical replication apply worker| logicalrep_worker_onexit() ->
  ApplyLauncherWakeup() -> kill(LogicalRepCtx->launcher_pid, SIGUSR1) ->
launcher| procsignal_sigusr1_handler() -> SetLatch(MyLatch)).

In a bad case, I see that the SetLatch() called as well, but then the latch
gets reset by the following code in WaitForReplicationWorkerAttach():
    rc = WaitLatch(MyLatch,
   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
   10L, WAIT_EVENT_BGWORKER_STARTUP);

    if (rc & WL_LATCH_SET)
    {
    ResetLatch(MyLatch);
    CHECK_FOR_INTERRUPTS();
    }

With pg_usleep(30); added just before ResetLatch and
 $node_subscriber->safe_psql(
 'postgres', qq(
+SELECT pg_sleep(0.5);
 CREATE TABLE public.test_mismatching_types (
in 014_binary.pl, I can see the anomaly without running tests in parallel,
just when running that test in a loop:
for ((i=1;i<=10;i++)); do echo "iteration $i"; make -s check -C src/test/subscription/ 
PROVE_TESTS="t/014*"; done
...
iteration 2
# +++ tap check in src/test/subscription +++
t/014_binary.pl .. ok
All tests successful.
Files=1, Tests=8,  5 wallclock secs ( 0.00 usr  0.00 sys +  0.24 cusr  0.18 
csys =  0.42 CPU)
Result: PASS
iteration 3
# +++ tap check in src/test/subscription +++
t/014_binary.pl .. ok
All tests successful.
Files=1, Tests=8, 183 wallclock secs ( 0.02 usr  0.00 sys +  0.28 cusr  0.25 
csys =  0.55 CPU)
Result: PASS
...

In other words, the abnormal test execution takes place due to the
following events:
1. logicalrep worker launcher launches replication worker and waits for it
 to attach:
 ApplyLauncherMain() -> logicalrep_worker_launch() -> 
WaitForReplicationWorkerAttach()
2. logicalrep worker exits due to some error (logical replication target
 relation does not exist, in our case) and sends a signal to set the latch
 for launcher
3. launcher sets the latch in procsignal_sigusr1_handler(), but then resets
 it inside WaitForReplicationWorkerAttach()
4. launcher gets back to ApplyLauncherMain() where it waits for the latch
 (not set) or a timeout (which happens after DEFAULT_NAPTIME_PER_CYCLE ms).

Moreover, with that sleep in WaitForReplicationWorkerAttach() added, the
test 027_nosuperuser executes for 3+ minutes on each run for me.
And we can find such abnormal execution on the buildfarm too:

Re: SQL:2011 application time

2024-02-01 Thread jian he

On Mon, Jan 29, 2024 at 8:00 AM jian he  wrote:
>
> I fixed your tests, some of your tests can be simplified, (mainly
> primary key constraint is unnecessary for the failed tests)
> also your foreign key patch test table, temporal_rng is created at
> line 141, and we use it at around line 320.
> it's hard to get the definition of temporal_rng.  I drop the table
> and recreate it.
> So people can view the patch with tests more easily.
>
I've attached a new patch that further simplified the tests. (scope
v24 patch's 0002 and 0003)
Please ignore previous email attachments.

I've only applied the v24, 0002, 0003.
seems in doc/src/sgml/ref/create_table.sgml
lack the explanation of `temporal_interval`

since foreign key ON {UPDATE | DELETE} {CASCADE,SET NULL,SET DEFAULT}
not yet supported,
v24-0003 create_table.sgml should reflect that.

+ /*
+ * For FKs with PERIOD we need an operator and aggregate function
+ * to check whether the referencing row's range is contained
+ * by the aggregated ranges of the referenced row(s).
+ * For rangetypes this is fk.periodatt <@ range_agg(pk.periodatt).
+ * FKs will look these up at "runtime", but we should make sure
+ * the lookup works here.
+ */
+ if (is_temporal)
+ FindFKPeriodOpersAndProcs(opclasses[numpks - 1], ,
);

within the function ATAddForeignKeyConstraint, you called
FindFKPeriodOpersAndProcs,
but never used the computed outputs: periodoperoid, periodprocoid,
opclasses.
We validate these(periodoperoid, periodprocoid) at
lookupTRIOperAndProc, FindFKPeriodOpersAndProcs.
I'm not sure whether FindFKPeriodOpersAndProcs in
ATAddForeignKeyConstraint is necessary.

+ * Check if all key values in OLD and NEW are "equivalent":
+ * For normal FKs we check for equality.
+ * For temporal FKs we check that the PK side is a superset of its old
value,
+ * or the FK side is a subset.
"or the FK side is a subset."  is misleading, should it be something
like "or the FK side is a subset of X"?

+ if (indexStruct->indisexclusion) return i - 1;
+ else return i;

I believe our style should be (with proper indent)
if (indexStruct->indisexclusion)
return i - 1;
else
return i;

in transformFkeyCheckAttrs
+ if (found && is_temporal)
+ {
+ found = false;
+ for (j = 0; j < numattrs + 1; j++)
+ {
+ if (periodattnum == indexStruct->indkey.values[j])
+ {
+ opclasses[numattrs] = indclass->values[j];
+ found = true;
+ break;
+ }
+ }
+ }

can be simplified:
{
found = false;
if (periodattnum == indexStruct->indkey.values[numattrs])
{
opclasses[numattrs] = indclass->values[numattrs];
found = true;
}
}

Also wondering, at the end of the function transformFkeyCheckAttrs `if
(!found)` part:
do we need another error message handle is_temporal is true?


@@ -212,8 +213,11 @@ typedef struct NewConstraint
  ConstrType contype; /* CHECK or FOREIGN */
  Oid refrelid; /* PK rel, if FOREIGN */
  Oid refindid; /* OID of PK's index, if FOREIGN */
+ bool conwithperiod; /* Whether the new FOREIGN KEY uses PERIOD */
  Oid conid; /* OID of pg_constraint entry, if FOREIGN */
  Node   *qual; /* Check expr or CONSTR_FOREIGN Constraint */
+ Oid   *operoids; /* oper oids for FOREIGN KEY with PERIOD */
+ Oid   *procoids; /* proc oids for FOREIGN KEY with PERIOD */
  ExprState  *qualstate; /* Execution state for CHECK expr */
 } NewConstraint;
primary key can only one WITHOUT OVERLAPS,
so *operoids and *procoids
can be replaced with just
`operoids, procoids`.
Also these two elements in struct NewConstraint not used in v24, 0002, 0003.
From 8ba03aa6a442d57bd0f2117e32e703fb211b68fd Mon Sep 17 00:00:00 2001
From: jian he 
Date: Fri, 2 Feb 2024 10:31:16 +0800
Subject: [PATCH v1 1/1] refactor temporal FOREIGN KEYs test

make related tests closer, remove unnecessary primary key constraint
so it improve test's readability
---
 .../regress/expected/without_overlaps.out | 76 ++-
 src/test/regress/sql/without_overlaps.sql | 76 ++-
 2 files changed, 82 insertions(+), 70 deletions(-)

diff --git a/src/test/regress/expected/without_overlaps.out b/src/test/regress/expected/without_overlaps.out
index c633738c..2d0f5e21 100644
--- a/src/test/regress/expected/without_overlaps.out
+++ b/src/test/regress/expected/without_overlaps.out
@@ -421,53 +421,58 @@ DROP TABLE temporal3;
 --
 -- test FOREIGN KEY, range references range
 --
+--test table setup.
+DROP TABLE IF EXISTS temporal_rng;
+CREATE TABLE temporal_rng (id int4range, valid_at tsrange);
+ALTER TABLE temporal_rng
+	ADD CONSTRAINT temporal_rng_pk
+	PRIMARY KEY (id, valid_at WITHOUT OVERLAPS);
 -- Can't create a FK with a mismatched range type
 CREATE TABLE temporal_fk_rng2rng (
 	id int4range,
 	valid_at int4range,
 	parent_id int4range,
-	CONSTRAINT temporal_fk_rng2rng_pk2 PRIMARY KEY (id, valid_at WITHOUT OVERLAPS),
 	CONSTRAINT temporal_fk_rng2rng_fk2 FOREIGN KEY (parent_id, PERIOD valid_at)
 		REFERENCES temporal_rng (id, PERIOD valid_at)
 );
 ERROR:  foreign key constraint "temporal_fk_rng2rng_fk2" cannot be implemented
 DETAIL:  Key

Re: More new SQL/JSON item methods

2024-02-01 Thread Jeevan Chalke

On Thu, Feb 1, 2024 at 11:25 AM Kyotaro Horiguchi 
wrote:

> At Thu, 1 Feb 2024 09:22:22 +0530, Jeevan Chalke <
> jeevan.cha...@enterprisedb.com> wrote in
> > On Thu, Feb 1, 2024 at 7:24 AM Kyotaro Horiguchi <
> horikyota@gmail.com>
> > wrote:
> >
> > > At Thu, 01 Feb 2024 10:49:57 +0900 (JST), Kyotaro Horiguchi <
> > > horikyota@gmail.com> wrote in
> > > > By the way, while playing with this feature, I noticed the following
> > > > error message:
> > > >
> > > > > select jsonb_path_query('1.1' , '$.boolean()');
> > > > > ERROR:  numeric argument of jsonpath item method .boolean() is out
> of
> > > range for type boolean
> > > >
> > > > The error message seems a bit off to me. For example, "argument '1.1'
> > > > is invalid for type [bB]oolean" seems more appropriate for this
> > > > specific issue. (I'm not ceratin about our policy on the spelling of
> > > > Boolean..)
> > >
> > > Or, following our general convention, it would be spelled as:
> > >
> > > 'invalid argument for type Boolean: "1.1"'
> > >
> >
> > jsonpath way:
>
> Hmm. I see.
>
> > ERROR:  argument of jsonpath item method .boolean() is invalid for type
> > boolean
> >
> > or, if we add input value, then
> >
> > ERROR:  argument "1.1" of jsonpath item method .boolean() is invalid for
> > type boolean
> >
> > And this should work for all the error types, like out of range, not
> valid,
> > invalid input, etc, etc. Also, we don't need separate error messages for
> > string input as well, which currently has the following form:
> >
> > "string argument of jsonpath item method .%s() is not a valid
> > representation.."
>
> Agreed.
>

Attached are patches based on the discussion.


>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>


-- 
Jeevan Chalke

*Principal, ManagerProduct Development*



edbpostgres.com


v1-0002-Improve-error-message-for-NaN-or-Infinity-inputs-.patch
Description: Binary data


v1-0001-Merge-error-messages-in-the-same-pattern-in-jsonp.patch
Description: Binary data


v1-0003-Unify-error-messages-for-various-jsonpath-item-me.patch
Description: Binary data


v1-0004-Show-input-value-in-the-error-message-of-various-.patch
Description: Binary data

Re: Commitfest 2024-01 is now closed

2024-02-01 Thread Tom Lane

vignesh C  writes:
> Thanks a lot to all the members who participated in the commitfest.

Thanks for all the work you did running it!  CFM is typically a
thankless exercise in being a nag, but I thought you put in more
than the usual amount of effort to keep things moving.

regards, tom lane

Re: Synchronizing slots from primary to standby

2024-02-01 Thread Bertrand Drouvot

Hi,

On Thu, Feb 01, 2024 at 04:12:43PM +0530, Amit Kapila wrote:
> On Thu, Jan 25, 2024 at 11:26 AM Bertrand Drouvot
>  wrote:
> >
> > On Wed, Jan 24, 2024 at 04:09:15PM +0530, shveta malik wrote:
> > > On Wed, Jan 24, 2024 at 2:38 PM Bertrand Drouvot
> > >  wrote:
> > > >
> > > > I also see Sawada-San's point and I'd vote for 
> > > > "sync_replication_slots". Then for
> > > > the current feature I think "failover" and "on" should be the values to 
> > > > turn the
> > > > feature on (assuming "on" would mean "all kind of supported slots").
> > >
> > > Even if others agree and we change this GUC name to
> > > "sync_replication_slots", I feel we should keep the values as "on" and
> > > "off" currently, where "on" would mean 'sync failover slots' (docs can
> > > state that clearly).
> >
> > I gave more thoughts on it and I think the values should only be "failover" 
> > or
> > "off".
> >
> > The reason is that if we allow "on" and change the "on" behavior in future
> > versions (to support more than failover slots) then that would change the 
> > behavior
> > for the ones that used "on".
> >
> 
> I again thought on this point and feel that even if we start to sync
> say physical slots their purpose would also be to allow
> failover/switchover, otherwise, there is no use of syncing the slots.

Yeah, I think this is a good point.

> So, by that theory, we can just go for naming it as
> sync_failover_slots or simply sync_slots with values 'off' and 'on'.
> Now, if these are used for switchover then there is an argument that
> adding 'failover' in the GUC name could be confusing but I feel
> 'failover' is used widely enough that it shouldn't be a problem for
> users to understand, otherwise, we can go with simple name like
> sync_slots as well.
> 

I agree and "on"/"off" looks enough to me now. As far the GUC name I've the
feeling that "replication" should be part of it, and think that 
sync_replication_slots
is fine. The reason behind is that "sync_slots" could be confusing if in the 
future other kind of "slot" (other than replication ones) are added in the 
engine.

Thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

1 2 >

1 - 100 of 116 matches

Mail list logo