from:"\"Amit Kapila\""

Re: Allow logical failover slots to wait on synchronous replication

2024-09-23 Thread Amit Kapila

On Sat, Sep 21, 2024 at 6:34 AM John H  wrote:
>
> On Fri, Sep 20, 2024 at 2:44 AM shveta malik  wrote:
> > > >
> > >
> > > The difference is that the purpose of 'synchronized_standby_slots' is
> > > to mention slot names for which the user expects logical walsenders to
> > > wait before sending the logical changes to subscribers. OTOH,
> > > 'synchronous_standby_names' has a different purpose as well. It is not
> > > clear to me if the users would be interested in syncing failover slots
> > > to all the standbys mentioned in 'synchronous_standby_names'.
> > >
> >
> > Okay, I see your point. But not sure what could be the solution here
> > except documenting. But let me think more.
> >
>
> That's a great find. I didn't consider mixed physical and logical
> replicas in synchronous_standby_names.
> I wonder if there are users running synchronous_standby_names with a
> mix of logical and
> physical replicas and what the use case would be.
>

I am also not aware of the actual use cases of mixing physical and
logical synchronous standbys but as we provide that functionality, we
can't ignore it. BTW, I am also not sure if users would like the slots
to be synced on all the standbys mentioned in
synchronous_standby_names. and even, if they are, it is better to have
an explicit way of letting users specify it.

One possible approach is to extend the syntax of
"synchronized_standby_slots" similar to "synchronous_standby_names" so
that users can directly specify slots similarly and avoid waiting for
more than required standbys.

-- 
With Regards,
Amit Kapila.

Re: Clock-skew management in logical replication

2024-09-23 Thread Amit Kapila

On Sun, Sep 22, 2024 at 7:24 PM Joe Conway  wrote:
>
> On 9/21/24 01:31, shihao zhong wrote:
> > Nisha Moond  writes:
> >> Thoughts? Looking forward to hearing others' opinions!
> >
> > Had a productive conversation with Amit Kaplia today about time skew
> > in distributed systems, and wanted to share some thoughts.
> > Essentially, we're grappling with the classic distributed snapshot
> > problem. In a multi-active environment, where multiple nodes can
> > independently process transactions,  it becomes crucial to determine
> > the visibility of these transactions across the system.  Time skew,
> > where different machines have different timestamps make it a hard
> > problem. How can we ensure consistent transaction ordering and
> > visibility when time itself is unreliable?
> >
> > As you mentioned, there are several ways to tackle the time skew
> > problem in distributed systems. These approaches generally fall into
> > three main categories:
> >
> > 1. Centralized Timestamps (Timestamp Oracle)
> > 2. Atomic Clocks (True Time)
> > 3. Hybrid Logical Clocks
> > 4 Local Clocks
>
> > I recommend .. implement a pluggable time access method. This
> > allows users to integrate with different time services as needed.
>
> Huge +1
>

The one idea to provide user control over timestamps that are used for
'latest_write_wins' strategy could be to let users specify the values
in a special column in the table that will be used to resolve
conflicts.

CREATE TABLE foo(c1 int, c2 timestamp default conflict_fn, CHECK CONFLICTS(c2));

Now, for column c2 user can provide its function which can provide
value for each row that can be used to resolve conflict. If the
table_level conflict column is provided then that will be used to
resolve conflicts, otherwise, the default commit timestamp provided by
commit_ts module will be used to resolve conflict.

On the apply-side, we will use a condition like:
if ((source_new_column_value > replica_current_column_value) ||
operation.type == "delete")
  apply_update();

In the above example case, source_new_column_value and
replica_current_column_value will be column c2 on publisher and
subscriber. Note, that in the above case, we allowed deletes to always
win as the delete operation doesn't update the column values. We can
choose a different strategy to apply deletes like comparing the
existing column values as well.

Note that MYSQL [1] and Oracle's Timesten [2] provide a similar
strategy at the table level for conflict resolution to avoid reliance
on system clocks.

Though this provides a way for users to control values required for
conflict resolution, I prefer a simple approach at least for the first
version which is to document that users should ensure time
synchronization via NTP. Even Oracle mentions the same in their docs
[3] (See from: "It is critical to ensure that the clocks on all
databases are identical to one another and it’s recommended that all
database servers are configured to maintain accurate time through a
time server using the network time protocol (NTP). Even in
environments where databases span different time zones, all database
clocks must be set to the same time zone or Coordinated Universal Time
(UTC) must be used to maintain accurate time. Failure to maintain
accurate and synchronized time across the databases in an
active-active replication environment will result in data integrity
issues.")

[1] - 
https://dev.mysql.com/doc/refman/9.0/en/mysql-cluster-replication-schema.html#ndb-replication-ndb-replication
[2] - 
https://docs.oracle.com/en/database/other-databases/timesten/22.1/replication/configuring-timestamp-comparison.html#GUID-C8B0580B-B577-435F-8726-4AF341A09806
[3] - 
https://www.oracle.com/cn/a/tech/docs/technical-resources/wp-oracle-goldengate-activeactive-final2-1.pdf

-- 
With Regards,
Amit Kapila.

Re: Using per-transaction memory contexts for storing decoded tuples

2024-09-22 Thread Amit Kapila

On Fri, Sep 20, 2024 at 10:53 PM Masahiko Sawada  wrote:
>
> On Thu, Sep 19, 2024 at 10:46 PM Amit Kapila  wrote:
> >
> > On Fri, Sep 20, 2024 at 5:13 AM David Rowley  wrote:
> > >
> > > On Fri, 20 Sept 2024 at 05:03, Masahiko Sawada  
> > > wrote:
> > > > I've done other benchmarking tests while changing the memory block
> > > > sizes from 8kB to 8MB. I measured the execution time of logical
> > > > decoding of one transaction that inserted 10M rows. I set
> > > > logical_decoding_work_mem large enough to avoid spilling behavior. In
> > > > this scenario, we allocate many memory chunks while decoding the
> > > > transaction and resulting in calling more malloc() in smaller memory
> > > > block sizes. Here are results (an average of 3 executions):
> > >
> > > I was interested in seeing the memory consumption with the test that
> > > was causing an OOM due to the GenerationBlock fragmentation.
> > >
> >
> > +1. That test will be helpful.
>
> Sure. Here are results of peak memory usage in bytes reported by
> MemoryContextMemAllocated() (when rb->size shows 43MB):
>
> 8kB:   52,371,328
> 16kB: 52,887,424
> 32kB: 53,428,096
> 64kB: 55,099,264
> 128kB:   86,163,328
> 256kB: 149,340,032
> 512kB: 273,334,144
> 1MB:523,419,520
> 2MB: 1,021,493,120
> 4MB: 1,984,085,888
> 8MB: 2,130,886,528
>
> Probably we can increase the size to 64kB?
>

Yeah, but before deciding on a particular size, we need more testing
on different platforms as suggested by David.

-- 
With Regards,
Amit Kapila.

Re: Using per-transaction memory contexts for storing decoded tuples

2024-09-22 Thread Amit Kapila

On Sun, Sep 22, 2024 at 11:27 AM David Rowley  wrote:
>
> On Fri, 20 Sept 2024 at 17:46, Amit Kapila  wrote:
> >
> > On Fri, Sep 20, 2024 at 5:13 AM David Rowley  wrote:
> > > In general, it's a bit annoying to have to code around this
> > > GenerationContext fragmentation issue.
> >
> > Right, and I am also slightly afraid that this may not cause some
> > regression in other cases where defrag wouldn't help.
>
> Yeah, that's certainly a possibility. I was hoping that
> MemoryContextMemAllocated() being much larger than logical_work_mem
> could only happen when there is fragmentation, but certainly, you
> could be wasting effort trying to defrag transactions where the
> changes all arrive in WAL consecutively and there is no
> defragmentation. It might be some other large transaction that's
> causing the context's allocations to be fragmented. I don't have any
> good ideas on how to avoid wasting effort on non-problematic
> transactions. Maybe there's something that could be done if we knew
> the LSN of the first and last change and the gap between the LSNs was
> much larger than the WAL space used for this transaction. That would
> likely require tracking way more stuff than we do now, however.
>

With more information tracking, we could avoid some non-problematic
transactions but still, it would be difficult to predict that we
didn't harm many cases because to make the memory non-contiguous, we
only need a few interleaving small transactions. We can try to think
of ideas for implementing defragmentation in our code if we first can
prove that smaller block sizes cause problems.

> With the smaller blocks idea, I'm a bit concerned that using smaller
> blocks could cause regressions on systems that are better at releasing
> memory back to the OS after free() as no doubt malloc() would often be
> slower on those systems. There have been some complaints recently
> about glibc being a bit too happy to keep hold of memory after free()
> and I wondered if that was the reason why the small block test does
> not cause much of a performance regression. I wonder how the small
> block test would look on Mac, FreeBSD or Windows. I think it would be
> risky to assume that all is well with reducing the block size after
> testing on a single platform.
>

Good point. We need extensive testing on different platforms, as you
suggest, to verify if smaller block sizes caused any regressions.

-- 
With Regards,
Amit Kapila.

Re: Pgoutput not capturing the generated columns

2024-09-22 Thread Amit Kapila

On Mon, Sep 23, 2024 at 4:10 AM Peter Smith  wrote:
>
> On Fri, Sep 20, 2024 at 2:26 PM Amit Kapila  wrote:
> >
> > On Fri, Sep 20, 2024 at 4:16 AM Peter Smith  wrote:
> > >
> > > On Fri, Sep 20, 2024 at 3:26 AM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Thu, Sep 19, 2024 at 2:32 AM Amit Kapila  
> > > > wrote:
> > > > >
> > > > >
> > > > > Users can use a publication like "create publication pub1 for table
> > > > > t1(c1, c2), t2;" where they want t1's generated column to be published
> > > > > but not for t2. They can specify the generated column name in the
> > > > > column list of t1 in that case even though the rest of the tables
> > > > > won't publish generated columns.
> > > >
> > > > Agreed.
> > > >
> > > > I think that users can use the publish_generated_column option when
> > > > they want to publish all generated columns, instead of specifying all
> > > > the columns in the column list. It's another advantage of this option
> > > > that it will also include the future generated columns.
> > > >
> > >
> > > OK. Let me give some examples below to help understand this idea.
> > >
> > > Please correct me if these are incorrect.
> > >
> > > Examples, when publish_generated_columns=true:
> > >
> > > CREATE PUBLICATION pub1 FOR t1(a,b,gen2), t2 WITH
> > > (publish_generated_columns=true)
> > > t1 -> publishes a, b, gen2 (e.g. what column list says)
> > > t2 -> publishes c, d + ALSO gen1, gen2
> > >
> > > CREATE PUBLICATION pub1 FOR t1, t2(gen1) WITH 
> > > (publish_generated_columns=true)
> > > t1 -> publishes a, b + ALSO gen1, gen2
> > > t2 -> publishes gen1 (e.g. what column list says)
> > >
> >
> > These two could be controversial because one could expect that if
> > "publish_generated_columns=true" then publish generated columns
> > irrespective of whether they are mentioned in column_list. I am of the
> > opinion that column_list should take priority the results should be as
> > mentioned by you but let us see if anyone thinks otherwise.
> >
> > >
> > > ==
> > >
> > > The idea LGTM, although now the parameter name
> > > ('publish_generated_columns') seems a bit misleading since sometimes
> > > generated columns get published "irrespective of the option".
> > >
> > > So, I think the original parameter name 'include_generated_columns'
> > > might be better here because IMO "include" seems more like "add them
> > > if they are not already specified", which is exactly what this idea is
> > > doing.
> > >
> >
> > I still prefer 'publish_generated_columns' because it matches with
> > other publication option names. One can also deduce from
> > 'include_generated_columns' that add all the generated columns even
> > when some of them are specified in column_list.
> >
>
> Fair point. Anyway, to avoid surprises it will be important for the
> precedence rules to be documented clearly (probably with some
> examples),
>

Yeah, one or two examples would be good, but we can have a separate
doc patch that has clearly mentioned all the rules.

-- 
With Regards,
Amit Kapila.

Re: Pgoutput not capturing the generated columns

2024-09-22 Thread Amit Kapila

On Sat, Sep 21, 2024 at 3:19 AM Masahiko Sawada  wrote:
>
> On Thu, Sep 19, 2024 at 9:26 PM Amit Kapila  wrote:
> >
> > >
> > > OK. Let me give some examples below to help understand this idea.
> > >
> > > Please correct me if these are incorrect.
> > >
> > > Examples, when publish_generated_columns=true:
> > >
> > > CREATE PUBLICATION pub1 FOR t1(a,b,gen2), t2 WITH
> > > (publish_generated_columns=true)
> > > t1 -> publishes a, b, gen2 (e.g. what column list says)
> > > t2 -> publishes c, d + ALSO gen1, gen2
> > >
> > > CREATE PUBLICATION pub1 FOR t1, t2(gen1) WITH 
> > > (publish_generated_columns=true)
> > > t1 -> publishes a, b + ALSO gen1, gen2
> > > t2 -> publishes gen1 (e.g. what column list says)
> > >
> >
> > These two could be controversial because one could expect that if
> > "publish_generated_columns=true" then publish generated columns
> > irrespective of whether they are mentioned in column_list. I am of the
> > opinion that column_list should take priority the results should be as
> > mentioned by you but let us see if anyone thinks otherwise.
>
> I agree with Amit. We also publish t2's future generated column in the
> first example and t1's future generated columns in the second example.
>

Right, it would be good to have at least one test that shows future
generated columns also get published wherever applicable (like where
column_list is not given and publish_generated_columns is true).

-- 
With Regards,
Amit Kapila.

Re: Documentation to upgrade logical replication cluster

2024-09-20 Thread Amit Kapila

On Mon, May 6, 2024 at 10:40 AM vignesh C  wrote:
>
> The v9 version patch was not applying on top of HEAD because of few
> commits, the updated v10 version patch is rebased on top of HEAD.
>

Let's say publisher is in node1 and subscriber is
+  in node2. The subscriber node2 has
+  two subscriptions sub1_node1_node2 and
+  sub2_node1_node2 which are subscribing the changes
+  from node1.

Do we need to show multiple subscriptions? You are following the same
steps for both subscriptions, so it may not add much value to show
steps for two subscriptions. You can write steps for one and add a
note to say it has to be done for other subscriptions present.

-- 
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

2024-09-20 Thread Amit Kapila

On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu)
 wrote:
>
> Apart from the vacuum_defer_cleanup_age idea.
>

I think you meant to say vacuum_committs_age idea.

> we’ve given more thought to our
> approach for retaining dead tuples and have come up with another idea that can
> reliably detect conflicts without requiring users to choose a wise value for
> the vacuum_committs_age. This new idea could also reduce the performance
> impact. Thanks a lot to Amit for off-list discussion.
>
> The concept of the new idea is that, the dead tuples are only useful to detect
> conflicts when applying *concurrent* transactions from remotes. Any subsequent
> UPDATE from a remote node after removing the dead tuples should have a later
> timestamp, meaning it's reasonable to detect an update_missing scenario and
> convert the UPDATE to an INSERT when applying it.
>
> To achieve above, we can create an additional replication slot on the
> subscriber side, maintained by the apply worker. This slot is used to retain
> the dead tuples. The apply worker will advance the slot.xmin after confirming
> that all the concurrent transaction on publisher has been applied locally.
>
> The process of advancing the slot.xmin could be:
>
> 1) the apply worker call GetRunningTransactionData() to get the
> 'oldestRunningXid' and consider this as 'candidate_xmin'.
> 2) the apply worker send a new message to walsender to request the latest wal
> flush position(GetFlushRecPtr) on publisher, and save it to
> 'candidate_remote_wal_lsn'. Here we could introduce a new feedback message or
> extend the existing keepalive message(e,g extends the requestReply bit in
> keepalive message to add a 'request_wal_position' value)
> 3) The apply worker can continue to apply changes. After applying all the WALs
> upto 'candidate_remote_wal_lsn', the apply worker can then advance the
> slot.xmin to 'candidate_xmin'.
>
> This approach ensures that dead tuples are not removed until all concurrent
> transactions have been applied. It can be effective for both bidirectional and
> non-bidirectional replication cases.
>
> We could introduce a boolean subscription option (retain_dead_tuples) to
> control whether this feature is enabled. Each subscription intending to detect
> update-delete conflicts should set retain_dead_tuples to true.
>

As each apply worker needs a separate slot to retain deleted rows, the
requirement for slots will increase. The other possibility is to
maintain one slot by launcher or some other central process that
traverses all subscriptions, remember the ones marked with
retain_dead_rows (let's call this list as retain_sub_list). Then using
running_transactions get the oldest running_xact, and then get the
remote flush location from the other node (publisher node) and store
those as candidate values (candidate_xmin and
candidate_remote_wal_lsn) in slot. We can probably reuse existing
candidate variables of the slot. Next, we can check the remote_flush
locations from all the origins corresponding in retain_sub_list and if
all are ahead of candidate_remote_wal_lsn, we can update the slot's
xmin to candidate_xmin.

I think in the above idea we can an optimization to combine the
request for remote wal LSN from different subscriptions pointing to
the same node to avoid sending multiple requests to the same node. I
am not sure if using pg_subscription.subconninfo is sufficient for
this, if not we can probably leave this optimization.

If this idea is feasible then it would reduce the number of slots
required to retain the deleted rows but the launcher needs to get the
remote wal location corresponding to each publisher node. There are
two ways to achieve that (a) launcher requests one of the apply
workers corresponding to subscriptions pointing to the same publisher
node to get this information; (b) launcher launches another worker to
get the remote wal flush location.

-- 
With Regards,
Amit Kapila.

Re: Using per-transaction memory contexts for storing decoded tuples

2024-09-19 Thread Amit Kapila

On Fri, Sep 20, 2024 at 5:13 AM David Rowley  wrote:
>
> On Fri, 20 Sept 2024 at 05:03, Masahiko Sawada  wrote:
> > I've done other benchmarking tests while changing the memory block
> > sizes from 8kB to 8MB. I measured the execution time of logical
> > decoding of one transaction that inserted 10M rows. I set
> > logical_decoding_work_mem large enough to avoid spilling behavior. In
> > this scenario, we allocate many memory chunks while decoding the
> > transaction and resulting in calling more malloc() in smaller memory
> > block sizes. Here are results (an average of 3 executions):
>
> I was interested in seeing the memory consumption with the test that
> was causing an OOM due to the GenerationBlock fragmentation.
>

+1. That test will be helpful.

> > The fact that we're using rb->size and txn->size to choose the
> > transactions to evict could make this idea less attractive. Even if we
> > defragment transactions, rb->size and txn->size don't change.
> > Therefore, it doesn't mean we can avoid evicting transactions due to
> > full of logical_decoding_work_mem, but just mean the amount of
> > allocated memory might have been reduced.
>
> I had roughly imagined that you'd add an extra field to store
> txn->size when the last fragmentation was done and only defrag the
> transaction when the ReorderBufferTXN->size is, say, double the last
> size when the changes were last defragmented. Of course, if the first
> defragmentation was enough to drop MemoryContextMemAllocated() below
> the concerning threshold above logical_decoding_work_mem then further
> defrags wouldn't be done, at least, until such times as the
> MemoryContextMemAllocated() became concerning again.  If you didn't
> want to a full Size variable for the defragmentation size, you could
> just store a uint8 to store which power of 2 ReorderBufferTXN->size
> was when it was last defragmented. There are plenty of holds in that
> struct that could be filled without enlarging the struct.
>
> In general, it's a bit annoying to have to code around this
> GenerationContext fragmentation issue.
>

Right, and I am also slightly afraid that this may not cause some
regression in other cases where defrag wouldn't help.

-- 
With Regards,
Amit Kapila.

Re: Using per-transaction memory contexts for storing decoded tuples

2024-09-19 Thread Amit Kapila

On Thu, Sep 19, 2024 at 10:33 PM Masahiko Sawada  wrote:
>
> On Wed, Sep 18, 2024 at 8:55 PM Amit Kapila  wrote:
> >
> > On Thu, Sep 19, 2024 at 6:46 AM David Rowley  wrote:
> > >
> > > On Thu, 19 Sept 2024 at 11:54, Masahiko Sawada  
> > > wrote:
> > > > I've done some benchmark tests for three different code bases with
> > > > different test cases. In short, reducing the generation memory context
> > > > block size to 8kB seems to be promising; it mitigates the problem
> > > > while keeping a similar performance.
> > >
> > > Did you try any sizes between 8KB and 8MB?  1000x reduction seems
> > > quite large a jump. There is additional overhead from having more
> > > blocks. It means more malloc() work and more free() work when deleting
> > > a context. It would be nice to see some numbers with all powers of 2
> > > between 8KB and 8MB.  I imagine the returns are diminishing as the
> > > block size is reduced further.
> > >
> >
> > Good idea.
>
> Agreed.
>
> I've done other benchmarking tests while changing the memory block
> sizes from 8kB to 8MB. I measured the execution time of logical
> decoding of one transaction that inserted 10M rows. I set
> logical_decoding_work_mem large enough to avoid spilling behavior. In
> this scenario, we allocate many memory chunks while decoding the
> transaction and resulting in calling more malloc() in smaller memory
> block sizes. Here are results (an average of 3 executions):
>
> 8kB: 19747.870 ms
> 16kB: 19780.025 ms
> 32kB: 19760.575 ms
> 64kB: 19772.387 ms
> 128kB: 19825.385 ms
> 256kB: 19781.118 ms
> 512kB: 19808.138 ms
> 1MB: 19757.640 ms
> 2MB: 19801.429 ms
> 4MB: 19673.996 ms
> 8MB: 19643.547 ms
>
> Interestingly, there were no noticeable differences in the execution
> time. I've checked the number of allocated memory blocks in each case
> and more blocks are allocated in smaller block size cases. For
> example, when the logical decoding used the maximum memory (about
> 1.5GB), we allocated about 80k blocks in 8kb memory block size case
> and 80 blocks in 8MB memory block cases.
>

What exactly do these test results mean? Do you want to prove that
there is no regression by using smaller block sizes?

-- 
With Regards,
Amit Kapila.

Re: Pgoutput not capturing the generated columns

2024-09-19 Thread Amit Kapila

On Thu, Sep 19, 2024 at 10:56 PM Masahiko Sawada  wrote:
>
>
> Given that we publish the generated columns if they are mentioned in
> the column list, can we separate the patch into two if it helps
> reviews? One is to allow logical replication to publish generated
> columns if they are explicitly mentioned in the column list. The
> second patch is to introduce the publish_generated_columns option.
>

It sounds like a reasonable idea to me but I haven't looked at the
feasibility of the same. So, if it is possible without much effort, we
should split the patch as per your suggestion.

-- 
With Regards,
Amit Kapila.

Re: Pgoutput not capturing the generated columns

2024-09-19 Thread Amit Kapila

On Fri, Sep 20, 2024 at 4:16 AM Peter Smith  wrote:
>
> On Fri, Sep 20, 2024 at 3:26 AM Masahiko Sawada  wrote:
> >
> > On Thu, Sep 19, 2024 at 2:32 AM Amit Kapila  wrote:
> > >
> > >
> > > Users can use a publication like "create publication pub1 for table
> > > t1(c1, c2), t2;" where they want t1's generated column to be published
> > > but not for t2. They can specify the generated column name in the
> > > column list of t1 in that case even though the rest of the tables
> > > won't publish generated columns.
> >
> > Agreed.
> >
> > I think that users can use the publish_generated_column option when
> > they want to publish all generated columns, instead of specifying all
> > the columns in the column list. It's another advantage of this option
> > that it will also include the future generated columns.
> >
>
> OK. Let me give some examples below to help understand this idea.
>
> Please correct me if these are incorrect.
>
> Examples, when publish_generated_columns=true:
>
> CREATE PUBLICATION pub1 FOR t1(a,b,gen2), t2 WITH
> (publish_generated_columns=true)
> t1 -> publishes a, b, gen2 (e.g. what column list says)
> t2 -> publishes c, d + ALSO gen1, gen2
>
> CREATE PUBLICATION pub1 FOR t1, t2(gen1) WITH (publish_generated_columns=true)
> t1 -> publishes a, b + ALSO gen1, gen2
> t2 -> publishes gen1 (e.g. what column list says)
>

These two could be controversial because one could expect that if
"publish_generated_columns=true" then publish generated columns
irrespective of whether they are mentioned in column_list. I am of the
opinion that column_list should take priority the results should be as
mentioned by you but let us see if anyone thinks otherwise.

>
> ==
>
> The idea LGTM, although now the parameter name
> ('publish_generated_columns') seems a bit misleading since sometimes
> generated columns get published "irrespective of the option".
>
> So, I think the original parameter name 'include_generated_columns'
> might be better here because IMO "include" seems more like "add them
> if they are not already specified", which is exactly what this idea is
> doing.
>

I still prefer 'publish_generated_columns' because it matches with
other publication option names. One can also deduce from
'include_generated_columns' that add all the generated columns even
when some of them are specified in column_list.

-- 
With Regards,
Amit Kapila.

Re: Pgoutput not capturing the generated columns

2024-09-19 Thread Amit Kapila

On Tue, Sep 17, 2024 at 12:04 PM Peter Smith  wrote:
>
> On Tue, Sep 17, 2024 at 4:15 PM Masahiko Sawada  wrote:
> >
> > On Mon, Sep 16, 2024 at 8:09 PM Peter Smith  wrote:
> > >
> > > I thought that the option "publish_generated_columns" is more related
> > > to "column lists" than "row filters".
> > >
> > > Let's say table 't1' has columns 'a', 'b', 'c', 'gen1', 'gen2'.
> > >
> >
> > > And
> > > PUBLICATION pub2 FOR TABLE t1 WITH (publish_generated_columns = false);
> > > is equivalent to
> > > PUBLICATION pub2 FOR TABLE t1(a,b,c);
> >
> > This makes sense to me as it preserves the current behavior.
> >
> > > Then:
> > > PUBLICATION pub1 FOR TABLE t1 WITH (publish_generated_columns = true);
> > > is equivalent to
> > > PUBLICATION pub1 FOR TABLE t1(a,b,c,gen1,gen2);
> >
> > This also makes sense. It would also include future generated columns.
> >
> > > So, I would expect this to fail because the SUBSCRIPTION docs say
> > > "Subscriptions having several publications in which the same table has
> > > been published with different column lists are not supported."
> >
> > So I agree that it would raise an error if users subscribe to both
> > pub1 and pub2.
> >
> > And looking back at your examples,
> >
> > > > > e.g.1
> > > > > -
> > > > > CREATE PUBLICATION pub1 FOR TABLE t1 WITH (publish_generated_columns 
> > > > > = true);
> > > > > CREATE PUBLICATION pub2 FOR TABLE t1 WITH (publish_generated_columns 
> > > > > = false);
> > > > > CREATE SUBSCRIPTION sub ... PUBLICATIONS pub1,pub2;
> > > > > -
> > > > >
> > > > > e.g.2
> > > > > -
> > > > > CREATE PUBLICATION pub1 FOR ALL TABLES WITH 
> > > > > (publish_generated_columns = true);
> > > > > CREATE PUBLICATION pub2 FOR TABLE t1 WITH (publish_generated_columns 
> > > > > = false);
> > > > > CREATE SUBSCRIPTION sub ... PUBLICATIONS pub1,pub2;
> > > > > -
> >
> > Both examples would not be supported.
> >
> > > > >
> > > > > The CREATE SUBSCRIPTION docs [1] only says "Subscriptions having
> > > > > several publications in which the same table has been published with
> > > > > different column lists are not supported."
> > > > >
> > > > > Perhaps the user is supposed to deduce that the example above would
> > > > > work OK if table 't1' has no generated cols. OTOH, if it did have
> > > > > generated cols then the PUBLICATION column lists must be different and
> > > > > therefore it is "not supported" (??).
> > > >
> > > > With the patch, how should this feature work when users specify a
> > > > generated column to the column list and set publish_generated_column =
> > > > false, in the first place? raise an error (as we do today)? or always
> > > > send NULL?
> > >
> > > For this scenario, I suggested (see [1] #3) that the code could give a
> > > WARNING. As I wrote up-thread: This combination doesn't seem
> > > like something a user would do intentionally, so just silently
> > > ignoring it (which the current patch does) is likely going to give
> > > someone unexpected results/grief.
> >
> > It gives a WARNING, and then publishes the specified generated column
> > data (even if publish_generated_column = false)?


I think that the column list should take priority and we should
publish the generated column if it is mentioned in  irrespective of
the option.

> > If so, it would mean
> > that specifying the generated column to the column list means to
> > publish its data regardless of the publish_generated_column parameter
> > value.
> >
>
> No. I meant only it can give the WARNING to tell the user user  "Hey,
> there is a conflict here because you said publish_generated_column=
> false, but you also specified gencols in the column list".
>

Users can use a publication like "create publication pub1 for table
t1(c1, c2), t2;" where they want t1's generated column to be published
but not for t2. They can specify the generated column name in the
column list of t1 in that case even though the rest of the tables
won't publish generated columns.

-- 
With Regards,
Amit Kapila.

Re: Pgoutput not capturing the generated columns

2024-09-19 Thread Amit Kapila

On Thu, Aug 29, 2024 at 8:44 AM Masahiko Sawada  wrote:
>
> On Wed, Aug 28, 2024 at 1:06 AM Amit Kapila  wrote:
> >
> > >
> > > As Euler mentioned earlier, I think it's a decision not to replicate
> > > generated columns because we don't know the target table on the
> > > subscriber has the same expression and there could be locale issues
> > > even if it looks the same. I can see that a benefit of this proposal
> > > would be to save cost to compute generated column values if the user
> > > wants the target table on the subscriber to have exactly the same data
> > > as the publisher's one. Are there other benefits or use cases?
> > >
> >
> > The cost is one but the other is the user may not want the data to be
> > different based on volatile functions like timeofday()
>
> Shouldn't the generation expression be immutable?
>
> > or the table on
> > subscriber won't have the column marked as generated.
>
> Yeah, it would be another use case.
>

While speaking with one of the decoding output plugin users, I learned
that this feature will be useful when replicating data to a
non-postgres database using the plugin output, especially when the
other database doesn't have a generated column concept.

-- 
With Regards,
Amit Kapila.

Re: Allow logical failover slots to wait on synchronous replication

2024-09-18 Thread Amit Kapila

On Tue, Sep 17, 2024 at 9:08 AM shveta malik  wrote:
>
> On Mon, Sep 16, 2024 at 4:04 PM Amit Kapila  wrote:
> >
> > On Mon, Sep 16, 2024 at 2:55 PM shveta malik  wrote:
> > >
> > > On Mon, Sep 16, 2024 at 11:13 AM Amit Kapila  
> > > wrote:
> > > >
> > >
> > > > Another question aside from the above point, what if someone has
> > > > specified logical subscribers in synchronous_standby_names? In the
> > > > case of synchronized_standby_slots, we won't proceed with such slots.
> > > >
> > >
> > > Yes, it is a possibility. I have missed this point earlier. Now I
> > > tried a case where I give a mix of logical subscriber and physical
> > > standby in 'synchronous_standby_names' on pgHead, it even takes that
> > > 'mix' configuration and starts waiting accordingly.
> > >
> > > synchronous_standby_names = 'FIRST 2(logicalsub_1, phy_standby_1,
> > > phy_standby_2)';
> > >
> >
> > This should not happen as we don't support syncing failover slots on
> > logical subscribers.
>
> +1
>
> > The other point to consider here is that the user
> > may not have set 'sync_replication_slots' on all the physical standbys
> > mentioned in 'synchronous_standby_names' and in that case, it doesn't
> > make sense to wait for WAL to get flushed on those standbys. What do
> > you think?
> >
>
> Yes, it is a possibility. But then it is a possibility in case of
> 'synchronized_standby_slots' as well. User may always configure one of
> the standbys in  'synchronized_standby_slots' while may not configure
> slot-sync GUCs on that standby (hot_standby_feedback,
> sync_replication_slots etc). In such a case, logical replication is
> dependent upon the concerned physical standby even though latter is
> not syncing failover slots.
>

The difference is that the purpose of 'synchronized_standby_slots' is
to mention slot names for which the user expects logical walsenders to
wait before sending the logical changes to subscribers. OTOH,
'synchronous_standby_names' has a different purpose as well. It is not
clear to me if the users would be interested in syncing failover slots
to all the standbys mentioned in 'synchronous_standby_names'.

-- 
With Regards,
Amit Kapila.

Re: Add contrib/pg_logicalsnapinspect

2024-09-18 Thread Amit Kapila

On Tue, Sep 17, 2024 at 12:44 PM Bertrand Drouvot
 wrote:
>
> On Tue, Sep 17, 2024 at 10:18:35AM +0530, shveta malik wrote:
> > Thanks for addressing the comments. I have not started reviewing v4
> > yet, but here are few more comments on v3:
> >
>
> > 4)
> > Most of the output columns in pg_get_logical_snapshot_info() look
> > self-explanatory except 'state'. Should we have meaningful 'text' here
> > corresponding to SnapBuildState? Similar to what we do for
> > 'invalidation_reason' in pg_replication_slots. (SlotInvalidationCauses
> > for ReplicationSlotInvalidationCause)
>
> Yeah we could. I was not sure about that (and that was my first remark in [1])
> , as the module is mainly for debugging purpose, I was thinking that the one
> using it could refer to "snapbuild.h". Let's see what others think.
>

Displaying the 'text' for the state column makes it easy to
understand. So, +1 for doing it that way.

-- 
With Regards,
Amit Kapila.

Re: Using per-transaction memory contexts for storing decoded tuples

2024-09-18 Thread Amit Kapila

On Thu, Sep 19, 2024 at 6:46 AM David Rowley  wrote:
>
> On Thu, 19 Sept 2024 at 11:54, Masahiko Sawada  wrote:
> > I've done some benchmark tests for three different code bases with
> > different test cases. In short, reducing the generation memory context
> > block size to 8kB seems to be promising; it mitigates the problem
> > while keeping a similar performance.
>
> Did you try any sizes between 8KB and 8MB?  1000x reduction seems
> quite large a jump. There is additional overhead from having more
> blocks. It means more malloc() work and more free() work when deleting
> a context. It would be nice to see some numbers with all powers of 2
> between 8KB and 8MB.  I imagine the returns are diminishing as the
> block size is reduced further.
>

Good idea.

> Another alternative idea would be to defragment transactions with a
> large number of changes after they grow to some predefined size.
> Defragmentation would just be a matter of performing
> palloc/memcpy/pfree for each change. If that's all done at once, all
> the changes for that transaction would be contiguous in memory. If
> you're smart about what the trigger point is for performing the
> defragmentation then I imagine there's not much risk of performance
> regression for the general case.  For example, you might only want to
> trigger it when MemoryContextMemAllocated() for the generation context
> exceeds logical_decoding_work_mem by some factor and only do it for
> transactions where the size of the changes exceeds some threshold.
>

After collecting the changes that exceed 'logical_decoding_work_mem',
one can choose to stream the transaction and free the changes to avoid
hitting this problem, however, we can use that or some other constant
to decide the point of defragmentation. The other point we need to
think in this idea is whether we actually need any defragmentation at
all. This will depend on whether there are concurrent transactions
being decoded. This would require benchmarking to see the performance
impact.

-- 
With Regards,
Amit Kapila.

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-09-18 Thread Amit Kapila

On Mon, Sep 16, 2024 at 10:41 PM Bharath Rupireddy
 wrote:
>
> Thanks for looking into this.
>
> On Mon, Sep 16, 2024 at 4:54 PM Amit Kapila  wrote:
> >
> > Why raise the ERROR just for timeout invalidation here and why not if
> > the slot is invalidated for other reasons? This raises the question of
> > what happens before this patch if the invalid slot is used from places
> > where we call ReplicationSlotAcquire(). I did a brief code analysis
> > and found that for StartLogicalReplication(), even if the error won't
> > occur in ReplicationSlotAcquire(), it would have been caught in
> > CreateDecodingContext(). I think that is where we should also add this
> > new error. Similarly, pg_logical_slot_get_changes_guts() and other
> > logical replication functions should be calling
> > CreateDecodingContext() which can raise the new ERROR. I am not sure
> > about how the invalid slots are handled during physical replication,
> > please check the behavior of that before this patch.
>
> When physical slots are invalidated due to wal_removed reason, the failure 
> happens at a much later point for the streaming standbys while reading the 
> requested WAL files like the following:
>
> 2024-09-16 16:29:52.416 UTC [876059] FATAL:  could not receive data from WAL 
> stream: ERROR:  requested WAL segment 00010005 has already 
> been removed
> 2024-09-16 16:29:52.416 UTC [872418] LOG:  waiting for WAL to become 
> available at 0/5002000
>
> At this point, despite the slot being invalidated, its wal_status can still 
> come back to 'unreserved' even from 'lost', and the standby can catch up if 
> removed WAL files are copied either by manually or by a tool/script to the 
> primary's pg_wal directory. IOW, the physical slots invalidated due to 
> wal_removed are *somehow* recoverable unlike the logical slots.
>
> IIUC, the invalidation of a slot implies that it is not guaranteed to hold 
> any resources like WAL and XMINs. Does it also imply that the slot must be 
> unusable?
>

If we can't hold the dead rows against xmin of the invalid slot, then
how can we make it usable even after copying the required WAL?

-- 
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

2024-09-17 Thread Amit Kapila

On Tue, Sep 17, 2024 at 11:24 PM Masahiko Sawada  wrote:
>
> On Mon, Sep 16, 2024 at 11:53 PM Amit Kapila  wrote:
> >
> > On Tue, Sep 17, 2024 at 6:08 AM Masahiko Sawada  
> > wrote:
> >
> > I haven't thought about the implementation details yet but I think
> > during pruning (for example in heap_prune_satisfies_vacuum()), apart
> > from checking if the tuple satisfies
> > HeapTupleSatisfiesVacuumHorizon(), we should also check if the tuple's
> > committs is greater than configured vacuum_committs_age (for the
> > table) to decide whether tuple can be removed.
>
> Sounds very costly. I think we need to do performance tests. Even if
> the vacuum gets slower only on the particular table having the
> vacuum_committs_age setting, it would affect overall autovacuum
> performance. Also, it would affect HOT pruning performance.
>

Agreed that we should do some performance testing and additionally
think of any better way to implement. I think the cost won't be much
if the tuples to be removed are from a single transaction because the
required commit_ts information would be cached but when the tuples are
from different transactions, we could see a noticeable impact. We need
to test to say anything concrete on this.

> >
> > > > but IIUC this value doesn’t need to be significant; it
> > > > can be limited to just a few minutes. The one which is sufficient to
> > > > handle replication delays caused by network lag or other factors,
> > > > assuming clock skew has already been addressed.
> > >
> > > I think that in a non-bidirectional case the value could need to be a
> > > large number. Is that right?
> > >
> >
> > As per my understanding, even for non-bidirectional cases, the value
> > should be small. For example, in the case, pointed out by Shveta [1],
> > where the updates from 2 nodes are received by a third node, this
> > setting is expected to be small. This setting primarily deals with
> > concurrent transactions on multiple nodes, so it should be small but I
> > could be missing something.
> >
>
> I might be missing something but the scenario I was thinking of is
> something below.
>
> Suppose that we setup uni-directional logical replication between Node
> A and Node B (e.g., Node A -> Node B) and both nodes have the same row
> with key = 1:
>
> Node A:
> T1: UPDATE t SET val = 2 WHERE key = 1; (10:00 AM)
>   -> This change is applied on Node B at 10:01 AM.
>
> Node B:
> T2: DELETE FROM t WHERE key = 1; (05:00 AM)
>
> If a vacuum runs on Node B at 06:00 AM, the change of T1 coming from
> Node A would raise an "update_missing" conflict. On the other hand, if
> a vacuum runs on Node B at 11:00 AM, the change would raise an
> "update_deleted" conflict. It looks whether we detect an
> "update_deleted" or an "updated_missing" depends on the timing of
> vacuum, and to avoid such a situation, we would need to set
> vacuum_committs_age to more than 5 hours.
>

Yeah, in this case, it would detect a different conflict (if we don't
set vacuum_committs_age to greater than 5 hours) but as per my
understanding, the primary purpose of conflict detection and
resolution is to avoid data inconsistency in a bi-directional setup.
Assume, in the above case it is a bi-directional setup, then we want
to have the same data in both nodes. Now, if there are other cases
like the one you mentioned that require to detect the conflict
reliably than I agree this value could be large and probably not the
best way to achieve it. I think we can mention in the docs that the
primary purpose of this is to achieve data consistency among
bi-directional kind of setups.

Having said that even in the above case, the result should be the same
whether the vacuum has removed the row or not. Say, if the vacuum has
not yet removed the row (due to vacuum_committs_age or otherwise) then
also because the incoming update has a later timestamp, we will
convert the update to insert as per last_update_wins resolution
method, so the conflict will be considered as update_missing. And,
say, the vacuum has removed the row and the conflict detected is
update_missing, then also we will convert the update to insert. In
short, if UPDATE has lower commit-ts, DELETE should win and if UPDATE
has higher commit-ts, UPDATE should win.

So, we can expect data consistency in bidirectional cases and expect a
deterministic behavior in other cases (e.g. the final data in a table
does not depend on the order of applying the transactions from other
nodes).

-- 
With Regards,
Amit Kapila.

Re: Using per-transaction memory contexts for storing decoded tuples

2024-09-17 Thread Amit Kapila

On Mon, Sep 16, 2024 at 10:43 PM Masahiko Sawada  wrote:
>
> On Fri, Sep 13, 2024 at 3:58 AM Amit Kapila  wrote:
> >
> > On Thu, Sep 12, 2024 at 4:03 AM Masahiko Sawada  
> > wrote:
> > >
> > > We have several reports that logical decoding uses memory much more
> > > than logical_decoding_work_mem[1][2][3]. For instance in one of the
> > > reports[1], even though users set logical_decoding_work_mem to
> > > '256MB', a walsender process was killed by OOM because of using more
> > > than 4GB memory.
> > >
> > > As per the discussion in these threads so far, what happened was that
> > > there was huge memory fragmentation in rb->tup_context.
> > > rb->tup_context uses GenerationContext with 8MB memory blocks. We
> > > cannot free memory blocks until all memory chunks in the block are
> > > freed. If there is a long-running transaction making changes, its
> > > changes could be spread across many memory blocks and we end up not
> > > being able to free memory blocks unless the long-running transaction
> > > is evicted or completed. Since we don't account fragmentation, block
> > > header size, nor chunk header size into per-transaction memory usage
> > > (i.e. txn->size), rb->size could be less than
> > > logical_decoding_work_mem but the actual allocated memory in the
> > > context is much higher than logical_decoding_work_mem.
> > >
> >
> > It is not clear to me how the fragmentation happens. Is it because of
> > some interleaving transactions which are even ended but the memory
> > corresponding to them is not released?
>
> In a generation context, we can free a memory block only when all
> memory chunks there are freed. Therefore, individual tuple buffers are
> already pfree()'ed but the underlying memory blocks are not freed.
>

I understood this part but didn't understand the cases leading to this
problem. For example, if there is a large (and only) transaction in
the system that allocates many buffers for change records during
decoding, in the end, it should free memory for all the buffers
allocated in the transaction. So, wouldn't that free all the memory
chunks corresponding to the memory blocks allocated? My guess was that
we couldn't free all the chunks because there were small interleaving
transactions that allocated memory but didn't free it before the large
transaction ended.

> > Can we try reducing the size of
> > 8MB memory blocks? The comment atop allocation says: "XXX the
> > allocation sizes used below pre-date generation context's block
> > growing code.  These values should likely be benchmarked and set to
> > more suitable values.", so do we need some tuning here?
>
> Reducing the size of the 8MB memory block would be one solution and
> could be better as it could be back-patchable. It would mitigate the
> problem but would not resolve it. I agree to try reducing it and do
> some benchmark tests. If it reasonably makes the problem less likely
> to happen, it would be a good solution.
>

makes sense.

-- 
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

2024-09-16 Thread Amit Kapila

On Tue, Sep 17, 2024 at 6:08 AM Masahiko Sawada  wrote:
>
> On Fri, Sep 13, 2024 at 12:56 AM shveta malik  wrote:
> >
> > On Fri, Sep 13, 2024 at 11:38 AM Amit Kapila  
> > wrote:
> > >
> > > > >
> > > > > So in brief, this solution is only for bidrectional setup? For 
> > > > > non-bidirectional,
> > > > > feedback_slots is non-configurable and thus irrelevant.
> > > >
> > > > Right.
> > > >
> > >
> > > One possible idea to address the non-bidirectional case raised by
> > > Shveta is to use a time-based cut-off to remove dead tuples. As
> > > mentioned earlier in my email [1], we can define a new GUC parameter
> > > say vacuum_committs_age which would indicate that we will allow rows
> > > to be removed only if the modified time of the tuple as indicated by
> > > committs module is greater than the vacuum_committs_age. We could keep
> > > this parameter a table-level option without introducing a GUC as this
> > > may not apply to all tables. I checked and found that some other
> > > replication solutions like GoldenGate also allowed similar parameters
> > > (tombstone_deletes) to be specified at table level [2]. The other
> > > advantage of allowing it at table level is that it won't hamper the
> > > performance of hot-pruning or vacuum in general. Note, I am careful
> > > here because to decide whether to remove a dead tuple or not we need
> > > to compare its committs_time both during hot-pruning and vacuum.
> >
> > +1 on the idea,
>
> I agree that this idea is much simpler than the idea originally
> proposed in this thread.
>
> IIUC vacuum_committs_age specifies a time rather than an XID age.
>

Your understanding is correct that vacuum_committs_age specifies a time.

>
> But
> how can we implement it? If it ends up affecting the vacuum cutoff, we
> should be careful not to end up with the same result of
> vacuum_defer_cleanup_age that was discussed before[1]. Also, I think
> the implementation needs not to affect the performance of
> ComputeXidHorizons().
>

I haven't thought about the implementation details yet but I think
during pruning (for example in heap_prune_satisfies_vacuum()), apart
from checking if the tuple satisfies
HeapTupleSatisfiesVacuumHorizon(), we should also check if the tuple's
committs is greater than configured vacuum_committs_age (for the
table) to decide whether tuple can be removed. One thing to consider
is what to do in case of aggressive vacuum where we expect
relfrozenxid to be advanced to FreezeLimit (at a minimum). We may want
to just ignore vacuum_committs_age during aggressive vacuum and LOG if
we end up removing some tuple. This will allow users to retain deleted
tuples by respecting the freeze limits which also avoid xid_wrap
around. I think we can't retain tuples forever if the user
misconfigured vacuum_committs_age and to avoid that we can keep the
maximum limit on this parameter to say an hour or so. Also, users can
tune freeze parameters if they want to retain tuples for longer.

> > but IIUC this value doesn’t need to be significant; it
> > can be limited to just a few minutes. The one which is sufficient to
> > handle replication delays caused by network lag or other factors,
> > assuming clock skew has already been addressed.
>
> I think that in a non-bidirectional case the value could need to be a
> large number. Is that right?
>

As per my understanding, even for non-bidirectional cases, the value
should be small. For example, in the case, pointed out by Shveta [1],
where the updates from 2 nodes are received by a third node, this
setting is expected to be small. This setting primarily deals with
concurrent transactions on multiple nodes, so it should be small but I
could be missing something.

[1] - 
https://www.postgresql.org/message-id/CAJpy0uAzzOzhXGH-zBc7Zt8ndXRf6r4OnLzgRrHyf8cvd%2Bfpwg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-09-16 Thread Amit Kapila

On Mon, Sep 16, 2024 at 3:31 PM Bharath Rupireddy
 wrote:
>
> Please find the attached v46 patch having changes for the above review
> comments and your test review comments and Shveta's review comments.
>

-ReplicationSlotAcquire(const char *name, bool nowait)
+ReplicationSlotAcquire(const char *name, bool nowait, bool error_if_invalid)
 {
  ReplicationSlot *s;
  int active_pid;
@@ -615,6 +620,22 @@ retry:
  /* We made this slot active, so it's ours now. */
  MyReplicationSlot = s;

+ /*
+ * An error is raised if error_if_invalid is true and the slot has been
+ * previously invalidated due to inactive timeout.
+ */
+ if (error_if_invalid &&
+ s->data.invalidated == RS_INVAL_INACTIVE_TIMEOUT)
+ {
+ Assert(s->inactive_since > 0);
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("can no longer get changes from replication slot \"%s\"",
+ NameStr(s->data.name)),
+ errdetail("This slot has been invalidated because it was inactive
for longer than the amount of time specified by \"%s\".",
+"replication_slot_inactive_timeout")));
+ }

Why raise the ERROR just for timeout invalidation here and why not if
the slot is invalidated for other reasons? This raises the question of
what happens before this patch if the invalid slot is used from places
where we call ReplicationSlotAcquire(). I did a brief code analysis
and found that for StartLogicalReplication(), even if the error won't
occur in ReplicationSlotAcquire(), it would have been caught in
CreateDecodingContext(). I think that is where we should also add this
new error. Similarly, pg_logical_slot_get_changes_guts() and other
logical replication functions should be calling
CreateDecodingContext() which can raise the new ERROR. I am not sure
about how the invalid slots are handled during physical replication,
please check the behavior of that before this patch.

-- 
With Regards,
Amit Kapila.

Re: Allow logical failover slots to wait on synchronous replication

2024-09-16 Thread Amit Kapila

On Mon, Sep 16, 2024 at 2:55 PM shveta malik  wrote:
>
> On Mon, Sep 16, 2024 at 11:13 AM Amit Kapila  wrote:
> >
>
> > Another question aside from the above point, what if someone has
> > specified logical subscribers in synchronous_standby_names? In the
> > case of synchronized_standby_slots, we won't proceed with such slots.
> >
>
> Yes, it is a possibility. I have missed this point earlier. Now I
> tried a case where I give a mix of logical subscriber and physical
> standby in 'synchronous_standby_names' on pgHead, it even takes that
> 'mix' configuration and starts waiting accordingly.
>
> synchronous_standby_names = 'FIRST 2(logicalsub_1, phy_standby_1,
> phy_standby_2)';
>

This should not happen as we don't support syncing failover slots on
logical subscribers. The other point to consider here is that the user
may not have set 'sync_replication_slots' on all the physical standbys
mentioned in 'synchronous_standby_names' and in that case, it doesn't
make sense to wait for WAL to get flushed on those standbys. What do
you think?

-- 
With Regards,
Amit Kapila.

Re: Allow logical failover slots to wait on synchronous replication

2024-09-15 Thread Amit Kapila

On Fri, Sep 13, 2024 at 3:13 PM shveta malik  wrote:
>
> On Thu, Sep 12, 2024 at 3:04 PM shveta malik  wrote:
> >
> > On Wed, Sep 11, 2024 at 2:40 AM John H  wrote:
> > >
> > > Hi Shveta,
> > >
> > > On Sun, Sep 8, 2024 at 11:16 PM shveta malik  
> > > wrote:
> > >
> > > >
> > > > I was trying to have a look at the patch again, it doesn't apply on
> > > > the head, needs rebase.
> > > >
> > >
> > > Rebased with the latest changes.
> > >
> > > > Regarding 'mode = SyncRepWaitMode', FWIW, SyncRepWaitForLSN() also
> > > > does in a similar way. It gets mode in local var initially and uses it
> > > > later. See [1]. So isn't there a chance too that 'SyncRepWaitMode' can
> > > > change in between.
> > > >
> > > > [1]:
> > > > mode = SyncRepWaitMode;
> > > > .
> > > > 
> > > > if (!WalSndCtl->sync_standbys_defined ||
> > > > lsn <= WalSndCtl->lsn[mode])
> > > > {
> > > > LWLockRelease(SyncRepLock);
> > > > return;
> > > > }
> > >
> > > You are right, thanks for the correction. I tried reproducing with GDB
> > > where SyncRepWaitMode
> > > changes due to pg_ctl reload but was unable to do so. It seems like
> > > SIGHUP only sets ConfigReloadPending = true,
> > > which gets processed in the next loop in WalSndLoop() and that's
> > > probably where I was getting confused.
> >
> > yes, SIGHUP will be processed in the caller of
> > StandbySlotsHaveCaughtup() (see ProcessConfigFile() in
> > WaitForStandbyConfirmation()). So we can use 'SyncRepWaitMode'
> > directly as it is not going to change in StandbySlotsHaveCaughtup()
> > even if user triggers the change. And thus it was okay to use it even
> > in the local variable too similar to how SyncRepWaitForLSN() does it.
> >
> > > In the latest patch, I've added:
> > >
> > > Assert(SyncRepWaitMode >= 0);
> > >
> > > which should be true since we call SyncRepConfigured() at the
> > > beginning of StandbySlotsHaveCaughtup(),
> > > and used SyncRepWaitMode directly.
> >
> > Yes, it should be okay I think. As SyncRepRequested() in the beginning
> > makes sure synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH and
> > thus SyncRepWaitMode should be mapped to either of
> > WAIT_WRITE/FLUSH/APPLY etc. Will review further.
> >
>
> I was wondering if we need somethign similar to SyncRepWaitForLSN() here:
>
> /* Cap the level for anything other than commit to remote flush only. 
> */
> if (commit)
> mode = SyncRepWaitMode;
> else
> mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);
>
> The header comment says:
>  * 'lsn' represents the LSN to wait for.  'commit' indicates whether this LSN
>  * represents a commit record.  If it doesn't, then we wait only for the WAL
>  * to be flushed if synchronous_commit is set to the higher level of
>  * remote_apply, because only commit records provide apply feedback.
>
> If we don't do something similar, then aren't there chances that we
> keep on waiting on the wrong lsn[mode] for the case when mode =
> SYNC_REP_WAIT_APPLY while sync-rep-wait infrastructure is updating
> different mode's lsn. Is my understanding correct?
>

I think here we always need the lsn values corresponding to
SYNC_REP_WAIT_FLUSH as we want to ensure that the WAL has to be
flushed on physical standby before sending it to the logical
subscriber. See ProcessStandbyReplyMessage() where we always use
flushPtr to advance the physical_slot via
PhysicalConfirmReceivedLocation().

Another question aside from the above point, what if someone has
specified logical subscribers in synchronous_standby_names? In the
case of synchronized_standby_slots, we won't proceed with such slots.

-- 
With Regards,
Amit Kapila.

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-09-15 Thread Amit Kapila

On Tue, Sep 10, 2024 at 12:13 AM Bharath Rupireddy
 wrote:
>
> On Mon, Sep 9, 2024 at 3:04 PM Amit Kapila  wrote:
> >
> > > > > > We should not allow the invalid replication slot to be altered
> > > > > > irrespective of the reason unless there is any benefit.
> > > > >
> > > > > Okay, then I think we need to change the existing behaviour of the
> > > > > other invalidation causes which still allow alter-slot.
> > > >
> > > > +1. Perhaps, track it in a separate thread?
> > >
> > > I think so. It does not come under the scope of this thread.
> >
> > It makes sense to me as well. But let's go ahead and get that sorted out 
> > first.
>
> Moved the discussion to new thread -
> https://www.postgresql.org/message-id/CALj2ACW4fSOMiKjQ3%3D2NVBMTZRTG8Ujg6jsK9z3EvOtvA4vzKQ%40mail.gmail.com.
> Please have a look.
>

That is pushed now. Please send the rebased patch after addressing the
pending comments.

-- 
With Regards,
Amit Kapila.

Re: Using per-transaction memory contexts for storing decoded tuples

2024-09-13 Thread Amit Kapila

r top-level xact?

> We always evict sub-transactions and its top-level
> transaction at the same time, I think it would not be a problem.
> Checking performance degradation due to using many memory contexts
> would have to be done.
>

The generation context has been introduced in commit a4ccc1ce which
claims that it has significantly reduced logical decoding's memory
usage and improved its performance. Won't changing it requires us to
validate all the cases which led us to use generation context in the
first place?

-- 
With Regards,
Amit Kapila.

Re: Disallow altering invalidated replication slots

2024-09-12 Thread Amit Kapila

On Thu, Sep 12, 2024 at 4:24 PM Amit Kapila  wrote:
>
> On Wed, Sep 11, 2024 at 8:41 AM shveta malik  wrote:
> >
> > On Tue, Sep 10, 2024 at 11:24 PM Bharath Rupireddy
> >  wrote:
> > >
> > >
> > > Please find the attached v2 patch also having Shveta's review comments
> > > addressed.
> >
> > The v2 patch looks good to me.
> >
>
> LGTM as well. I'll push this tomorrow morning unless there are more
> comments or suggestions.
>

Pushed.

-- 
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

2024-09-12 Thread Amit Kapila

On Wed, Sep 11, 2024 at 11:07 AM Zhijie Hou (Fujitsu)
 wrote:
>
> On Wednesday, September 11, 2024 1:03 PM shveta malik 
>  wrote:
> >
> > > >
> > > > Another query is about 3 node setup. I couldn't figure out what
> > > > would be feedback_slots setting when it is not bidirectional, as in
> > > > consider the case where there are three nodes A,B,C. Node C is
> > > > subscribing to both Node A and Node B. Node A and Node B are the
> > > > ones doing concurrent "update" and "delete" which will both be
> > > > replicated to Node C. In this case what will be the feedback_slots
> > > > setting on Node C? We don't have any slots here which will be
> > > > replicating changes from Node C to Node A and Node C to Node B. This
> > > > is given in [3] in your first email ([1])
> > >
> > > Thanks for pointing this, the link was a bit misleading. I think the
> > > solution proposed in this thread is only used to allow detecting
> > > update_deleted reliably in a bidirectional cluster.  For non-
> > > bidirectional cases, it would be more tricky to predict the timing till 
> > > when
> > should we retain the dead tuples.
> > >
> >
> > So in brief, this solution is only for bidrectional setup? For 
> > non-bidirectional,
> > feedback_slots is non-configurable and thus irrelevant.
>
> Right.
>

One possible idea to address the non-bidirectional case raised by
Shveta is to use a time-based cut-off to remove dead tuples. As
mentioned earlier in my email [1], we can define a new GUC parameter
say vacuum_committs_age which would indicate that we will allow rows
to be removed only if the modified time of the tuple as indicated by
committs module is greater than the vacuum_committs_age. We could keep
this parameter a table-level option without introducing a GUC as this
may not apply to all tables. I checked and found that some other
replication solutions like GoldenGate also allowed similar parameters
(tombstone_deletes) to be specified at table level [2]. The other
advantage of allowing it at table level is that it won't hamper the
performance of hot-pruning or vacuum in general. Note, I am careful
here because to decide whether to remove a dead tuple or not we need
to compare its committs_time both during hot-pruning and vacuum.

Note that tombstones_deletes is a general concept used by replication
solutions to detect updated_deleted conflict and time-based purging is
recommended. See [3][4]. We previously discussed having tombstone
tables to keep the deleted records information but it was suggested to
prevent the vacuum from removing the required dead tuples as that
would be simpler than inventing a new kind of tables/store for
tombstone_deletes [5]. So, we came up with the idea of feedback slots
discussed in this email but that didn't work out in all cases and
appears difficult to configure as pointed out by Shveta. So, now, we
are back to one of the other ideas [1] discussed previously to solve
this problem.

Thoughts?

[1] - 
https://www.postgresql.org/message-id/CAA4eK1Lj-PWrP789KnKxZydisHajd38rSihWXO8MVBLDwxG1Kg%40mail.gmail.com
[2] -
BEGIN
  DBMS_GOLDENGATE_ADM.ALTER_AUTO_CDR(
schema_name   => 'hr',
table_name=> 'employees',
tombstone_deletes => TRUE);
END;
/
[3] - https://en.wikipedia.org/wiki/Tombstone_(data_store)
[4] - 
https://docs.oracle.com/en/middleware/goldengate/core/19.1/oracle-db/automatic-conflict-detection-and-resolution1.html#GUID-423C6EE8-1C62-4085-899C-8454B8FB9C92
[5] - 
https://www.postgresql.org/message-id/e4cdb849-d647-4acf-aabe-7049ae170fbf%40enterprisedb.com

-- 
With Regards,
Amit Kapila.

Re: Disallow altering invalidated replication slots

2024-09-12 Thread Amit Kapila

On Wed, Sep 11, 2024 at 8:41 AM shveta malik  wrote:
>
> On Tue, Sep 10, 2024 at 11:24 PM Bharath Rupireddy
>  wrote:
> >
> >
> > Please find the attached v2 patch also having Shveta's review comments
> > addressed.
>
> The v2 patch looks good to me.
>

LGTM as well. I'll push this tomorrow morning unless there are more
comments or suggestions.

-- 
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

2024-09-11 Thread Amit Kapila

On Wed, Sep 11, 2024 at 8:32 AM Zhijie Hou (Fujitsu)
 wrote:
>
> On Tuesday, September 10, 2024 7:25 PM Amit Kapila  
> wrote:
> >
> > One minor comment on 0003
> > ===
> > 1.
> > get_slot_confirmed_flush()
> > {
> > ...
> > + /*
> > + * To prevent concurrent slot dropping and creation while filtering the
> > + * slots, take the ReplicationSlotControlLock outside of the loop.
> > + */
> > + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
> > +
> > + foreach_ptr(String, name, MySubscription->feedback_slots) { XLogRecPtr
> > + confirmed_flush; ReplicationSlot *slot;
> > +
> > + slot = ValidateAndGetFeedbackSlot(strVal(name));
> >
> > Why do we need to validate slots each time here? Isn't it better to do it 
> > once?
>
> I think it's possible that the slot was correct but changed or dropped later,
> so it could be useful to give a warning in this case to hint user to adjust 
> the
> slots, otherwise, the xmin of the publisher's slot won't be advanced and might
> cause dead tuples accumulation. This is similar to the checks we performed for
> the slots in "synchronized_standby_slots". (E.g. StandbySlotsHaveCaughtup)
>

In the case of "synchronized_standby_slots", we seem to be invoking
such checks via StandbySlotsHaveCaughtup() when we need to wait for
WAL and also we have some optimizations that avoid the frequent
checking for validation checks. OTOH, this patch doesn't have any such
optimizations. We can optimize it by maintaining a local copy of
feedback slots to avoid looping all the slots each time (if this is
required, we can make it a top-up patch so that it can be reviewed
separately). I have also thought of maintaining the updated value of
confirmed_flush_lsn for feedback slots corresponding to a subscription
in shared memory but that seems tricky because then we have to
maintain slot->subscription mapping. Can you think of any other ways?

Having said that it is better to profile this in various scenarios
like by increasing the frequency of keep_alieve message and or in idle
subscriber cases where we try to send this new feedback message.

-- 
With Regards,
Amit Kapila.

Re: Add contrib/pg_logicalsnapinspect

2024-09-10 Thread Amit Kapila

On Tue, Sep 10, 2024 at 8:56 PM Bertrand Drouvot
 wrote:
>
> On Mon, Sep 09, 2024 at 04:24:09PM +0530, Amit Kapila wrote:
> > On Fri, Aug 30, 2024 at 5:18 PM Bertrand Drouvot
> >  wrote:
> > > as we decided not to expose the SnapBuildOnDisk and SnapBuild structs to 
> > > public
> > > and to create/expose 2 new functions in snapbuild.c then the functions in 
> > > the
> > > module would do nothing but expose the data coming from the snapbuild.c's
> > > functions (get the tuple and send it to the client). That sounds weird to 
> > > me to
> > > create a module that would "only" do so, that's why I thought that in core
> > > functions taking care of everything make more sense.
> > >
> >
> > I see your point. Does anyone else have an opinion on the need for
> > these functions and whether to expose them from a contrib module or
> > have them as core functions?
>
> I looked at when the SNAPBUILD_VERSION has been changed:
>
> ec5896aed3 (2014)
> a975ff4980 (2021)
> 8bdb1332eb (2021)
> 7f13ac8123 (2022)
> bb19b70081 (2024)
>
> So it's not like we are changing the SnapBuildOnDisk or SnapBuild structs that
> frequently. Furthermore, those structs are serialized and so we have to 
> preserve
> their on-disk compatibility (means we can change them only in a major release
> if we need to).
>
> So, I think it would not be that much of an issue to expose those structs and
> create a new contrib module (as v1 did propose) instead of in core new 
> functions.
>
> If we want to insist that external modules "should" not rely on those structs 
> then
> we could put them into a new internal_snapbuild.h file (instead of snapbuild.h
> as proposed in v1).
>

Adding snapbuild_internal.h sounds like a good idea.

> At the end, I think that creating a contrib module and exposing those structs 
> in
> internal_snapbuild.h make more sense (as compared to in core functions).
>

Fail enough. We can keep the module name as logicalinspect so that we
can extend it for other logical decoding/replication-related files in
the future.


-- 
With Regards,
Amit Kapila.

Re: Invalid Assert while validating REPLICA IDENTITY?

2024-09-10 Thread Amit Kapila

On Tue, Sep 10, 2024 at 2:16 PM Amit Kapila  wrote:
>
> On Tue, Sep 10, 2024 at 11:36 AM vignesh C  wrote:
> >
> > On Mon, 9 Sept 2024 at 13:12, Amit Kapila  wrote:
> > >
> > > The second part of the assertion is incomplete. The
> > > IsIndexUsableForReplicaIdentityFull() should be used only when the
> > > remote relation has REPLICA_IDENTITY_FULL set. I haven't tested all
> > > possible cases yet but I think the attached should be a better way to
> > > write this assertion.
> >
> > The changes look good to me.
> >
>
> Thanks, I'll push this tomorrow unless Dilip or anyone else has any
> comments. BTW, as the current code doesn't lead to any bug or
> assertion failure, I am planning to push this change to HEAD only, let
> me know if you think otherwise.
>

Pushed.

-- 
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

2024-09-10 Thread Amit Kapila

On Thu, Sep 5, 2024 at 5:07 PM Zhijie Hou (Fujitsu)
 wrote:
>
> ---
> ISSUES and SOLUTION
> ---
>
> To detect update_deleted conflicts, we need to search for dead tuples in the
> table. However, dead tuples can be removed by VACUUM at any time. Therefore, 
> to
> ensure consistent and accurate conflict detection, tuples deleted by other
> origins must not be removed by VACUUM before the conflict detection process. 
> If
> the tuples are removed prematurely, it might lead to incorrect conflict
> identification and resolution, causing data divergence between nodes.
>
> Here is an example of how VACUUM could affect conflict detection and how to
> prevent this issue. Assume we have a bidirectional cluster with two nodes, A
> and B.
>
> Node A:
>   T1: INSERT INTO t (id, value) VALUES (1,1);
>   T2: DELETE FROM t WHERE id = 1;
>
> Node B:
>   T3: UPDATE t SET value = 2 WHERE id = 1;
>
> To retain the deleted tuples, the initial idea was that once transaction T2 
> had
> been applied to both nodes, there was no longer a need to preserve the dead
> tuple on Node A. However, a scenario arises where transactions T3 and T2 occur
> concurrently, with T3 committing slightly earlier than T2. In this case, if
> Node B applies T2 and Node A removes the dead tuple (1,1) via VACUUM, and then
> Node A applies T3 after the VACUUM operation, it can only result in an
> update_missing conflict. Given that the default resolution for update_missing
> conflicts is apply_or_skip (e.g. convert update to insert if possible and 
> apply
> the insert), Node A will eventually hold a row (1,2) while Node B becomes
> empty, causing data inconsistency.
>
> Therefore, the strategy needs to be expanded as follows: Node A cannot remove
> the dead tuple until:
> (a) The DELETE operation is replayed on all remote nodes, *AND*
> (b) The transactions on logical standbys occurring before the replay of Node
> A's DELETE are replayed on Node A as well.
>
> ---
> THE DESIGN
> ---
>
> To achieve the above, we plan to allow the logical walsender to maintain and
> advance the slot.xmin to protect the data in the user table and introduce a 
> new
> logical standby feedback message. This message reports the WAL position that
> has been replayed on the logical standby *AND* the changes occurring on the
> logical standby before the WAL position are also replayed to the walsender's
> node (where the walsender is running). After receiving the new feedback
> message, the walsender will advance the slot.xmin based on the flush info,
> similar to the advancement of catalog_xmin. Currently, the effective_xmin/xmin
> of logical slot are unused during logical replication, so I think it's safe 
> and
> won't cause side-effect to reuse the xmin for this feature.
>
> We have introduced a new subscription option (feedback_slots='slot1,...'),
> where these slots will be used to check condition (b): the transactions on
> logical standbys occurring before the replay of Node A's DELETE are replayed 
> on
> Node A as well. Therefore, on Node B, users should specify the slots
> corresponding to Node A in this option. The apply worker will get the oldest
> confirmed flush LSN among the specified slots and send the LSN as a feedback
> message to the walsender. -- I also thought of making it an automaic way, e.g.
> let apply worker select the slots that acquired by the walsenders which 
> connect
> to the same remote server(e.g. if apply worker's connection info or some other
> flags is same as the walsender's connection info). But it seems tricky because
> if some slots are inactive which means the walsenders are not there, the apply
> worker could not find the correct slots to check unless we save the host along
> with the slot's persistence data.
>
> The new feedback message is sent only if feedback_slots is not NULL.
>

Don't you need to deal with versioning stuff for sending this new
message? I mean what if the receiver side of this message is old and
doesn't support this new message.

One minor comment on 0003
===
1.
get_slot_confirmed_flush()
{
...
+ /*
+ * To prevent concurrent slot dropping and creation while filtering the
+ * slots, take the ReplicationSlotControlLock outside of the loop.
+ */
+ LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+
+ foreach_ptr(String, name, MySubscription->feedback_slots)
+ {
+ XLogRecPtr confirmed_flush;
+ ReplicationSlot *slot;
+
+ slot = ValidateAndGetFeedbackSlot(strVal(name));

Why do we need to validate slots each time here? Isn't it better to do it once?

-- 
With Regards,
Amit Kapila.

Re: Invalid Assert while validating REPLICA IDENTITY?

2024-09-10 Thread Amit Kapila

On Tue, Sep 10, 2024 at 11:36 AM vignesh C  wrote:
>
> On Mon, 9 Sept 2024 at 13:12, Amit Kapila  wrote:
> >
> > The second part of the assertion is incomplete. The
> > IsIndexUsableForReplicaIdentityFull() should be used only when the
> > remote relation has REPLICA_IDENTITY_FULL set. I haven't tested all
> > possible cases yet but I think the attached should be a better way to
> > write this assertion.
>
> The changes look good to me.
>

Thanks, I'll push this tomorrow unless Dilip or anyone else has any
comments. BTW, as the current code doesn't lead to any bug or
assertion failure, I am planning to push this change to HEAD only, let
me know if you think otherwise.

With Regards,
Amit Kapila.

Re: Disallow altering invalidated replication slots

2024-09-09 Thread Amit Kapila

On Tue, Sep 10, 2024 at 8:40 AM Peter Smith  wrote:
>
> Hi, here are some review comments for patch v1.
>
> ==
> Commit message
>
> 1.
> ALTER_REPLICATION_SLOT on invalidated replication slots is unnecessary
> as there is no way...
>
> suggestion:
> ALTER_REPLICATION_SLOT for invalid replication slots should not be
> allowed because there is no way...
>
> ==
> 2. Missing docs update
>
> Should this docs page [1] be updated to say ALTER_REPLICATION_SLOT is
> not allowed for invalid slots?
>
> ==
> src/backend/replication/slot.c
>
> 3.
> + if (MyReplicationSlot->data.invalidated != RS_INVAL_NONE)
> + ereport(ERROR,
> + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("cannot alter replication slot \"%s\"", name),
> + errdetail("This replication slot was invalidated due to \"%s\".",
> +   SlotInvalidationCauses[MyReplicationSlot->data.invalidated]));
> +
>
> I thought including the reason "invalid" (e.g. "cannot alter invalid
> replication slot \"%s\"") in the message might be better,
>

Agreed, I could see a similar case with a message ("cannot alter
invalid database \"%s\"") in the code. Additionally, we should also
include Shveta's suggestion to change the detailed message to other
similar messages in logical.c

-- 
With Regards,
Amit Kapila.

Re: Pgoutput not capturing the generated columns

2024-09-09 Thread Amit Kapila

On Tue, Sep 10, 2024 at 2:51 AM Masahiko Sawada  wrote:
>
> On Mon, Sep 9, 2024 at 2:38 AM Shubham Khanna
>  wrote:
> >
>
> Thank you for updating the patches. I have some comments:
>
> Do we really need to add this option to test_decoding?
>

I don't see any reason to have such an option in test_decoding,
otherwise, we need a separate option for each publication option. I
guess this is leftover of the previous subscriber-side approach.

> I think it
> would be good if this improves the test coverage. Otherwise, I'm not
> sure we need this part. If we want to add it, I think it would be
> better to have it in a separate patch.
>

Right.

> ---
> + 
> +  If the publisher-side column is also a generated column
> then this option
> +  has no effect; the publisher column will be filled as normal with 
> the
> +  publisher-side computed or default data.
> + 
>
> I don't understand this description. Why does this option have no
> effect if the publisher-side column is a generated column?
>

Shouldn't it be subscriber-side?

I have one additional comment:
/*
- * If the publication is FOR ALL TABLES then it is treated the same as
- * if there are no column lists (even if other publications have a
- * list).
+ * If the publication is FOR ALL TABLES and include generated columns
+ * then it is treated the same as if there are no column lists (even
+ * if other publications have a list).
  */
- if (!pub->alltables)
+ if (!pub->alltables || !pub->pubgencolumns)

Why do we treat pubgencolumns at the same level as the FOR ALL TABLES
case? I thought that if the user has provided a column list, we only
need to publish the specified columns even when the
publish_generated_columns option is set.

-- 
With Regards,
Amit Kapila.

Re: Add contrib/pg_logicalsnapinspect

2024-09-09 Thread Amit Kapila

On Fri, Aug 30, 2024 at 5:18 PM Bertrand Drouvot
 wrote:
>
> On Fri, Aug 30, 2024 at 03:43:12PM +0530, Amit Kapila wrote:
> > On Thu, Aug 29, 2024 at 6:33 PM Bharath Rupireddy
> >  wrote:
> > >
> > > On Thu, Aug 29, 2024 at 3:44 PM Bertrand Drouvot
> > >  wrote:
> > > >
> > > > Yeah that's fair. And now I'm wondering if we need an extra module. I 
> > > > think
> > > > we could "simply" expose 2 new functions in core, thoughts?
> > > >
> > > > > > What do you think? Did you have something else in mind?
> > > > > >
> > > > >
> > > > > On similar lines, we can also provide a function to get the slot's
> > > > > on-disk data.
> > > >
> > > > Yeah, having a way to expose the data from the disk makes fully sense 
> > > > to me.
> > > >
> > > > > IIRC, Bharath had previously proposed a tool to achieve
> > > > > the same. It is fine if we don't want to add that as part of this
> > > > > patch but I mentioned it because by having that we can have a set of
> > > > > functions to view logical decoding data.
> > > >
> > > > That's right. I think this one would be simply enough to expose one or 
> > > > two
> > > > functions in core too (and probably would not need an extra module).
> > >
> > > +1 for functions in core unless this extra module
> > > pg_logicalsnapinspect works as a tool to be helpful even when the
> > > server is down.
> > >
> >
> > We have an example of pageinspect which provides low-level functions
> > to aid debugging. The proposal for these APIs also seems to fall in
> > the same category,
>
> That's right, but...
>
> >  so why go for the core for these functions?
>
> as we decided not to expose the SnapBuildOnDisk and SnapBuild structs to 
> public
> and to create/expose 2 new functions in snapbuild.c then the functions in the
> module would do nothing but expose the data coming from the snapbuild.c's
> functions (get the tuple and send it to the client). That sounds weird to me 
> to
> create a module that would "only" do so, that's why I thought that in core
> functions taking care of everything make more sense.
>

I see your point. Does anyone else have an opinion on the need for
these functions and whether to expose them from a contrib module or
have them as core functions?

-- 
With Regards,
Amit Kapila.

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-09-09 Thread Amit Kapila

On Mon, Sep 9, 2024 at 10:28 AM shveta malik  wrote:
>
> On Mon, Sep 9, 2024 at 10:26 AM Bharath Rupireddy
>  wrote:
> >
> > Hi,
> >
> > On Mon, Sep 9, 2024 at 9:17 AM shveta malik  wrote:
> > >
> > > > We should not allow the invalid replication slot to be altered
> > > > irrespective of the reason unless there is any benefit.
> > >
> > > Okay, then I think we need to change the existing behaviour of the
> > > other invalidation causes which still allow alter-slot.
> >
> > +1. Perhaps, track it in a separate thread?
>
> I think so. It does not come under the scope of this thread.
>

It makes sense to me as well. But let's go ahead and get that sorted out first.

-- 
With Regards,
Amit Kapila.

Re: Invalid Assert while validating REPLICA IDENTITY?

2024-09-09 Thread Amit Kapila

On Mon, Sep 9, 2024 at 11:44 AM Dilip Kumar  wrote:
>
> On Fri, Sep 6, 2024 at 4:48 PM vignesh C  wrote:
> >
> > On Mon, 2 Sept 2024 at 18:22, Dilip Kumar  wrote:
> > >
> > > On Mon, Sep 2, 2024 at 3:32 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Mon, Sep 2, 2024 at 11:21 AM Dilip Kumar  
> > > > wrote:
> > > > >
> > > > > While working on some other code I noticed that in
> > > > > FindReplTupleInLocalRel() there is an assert [1] that seems to be
> > > > > passing IndexRelation to GetRelationIdentityOrPK() whereas it should
> > > > > be passing normal relation.
> > > > >
> > > >
> > > > Agreed. But this should lead to assertion failure. Did you try testing 
> > > > it?
> > >
> > > No, I did not test this particular case, it impacted me with my other
> > > addition of the code where I got Index Relation as input to the
> > > RelationGetIndexList() function, and my local changes were impacted by
> > > that.  I will write a test for this stand-alone case so that it hits
> > > the assert.  Thanks for looking into this.
> >
> > The FindReplTupleInLocalRel function can be triggered by both update
> > and delete operations, but this only occurs if the relation has been
> > marked as updatable by the logicalrep_rel_mark_updatable function. If
> > the relation is marked as non-updatable, an error will be thrown by
> > check_relation_updatable. Given this, if a relation is updatable, the
> > IsIndexUsableForReplicaIdentityFull condition might always evaluate to
> > true due to the previous checks in logicalrep_rel_mark_updatable.
> > Therefore, it’s possible that we might not encounter the Assert
> > statement, as IsIndexUsableForReplicaIdentityFull may consistently be
> > true.
> > Thoughts?
>
> With that it seems that the first Assert condition is useless isn't it?
>

The second part of the assertion is incomplete. The
IsIndexUsableForReplicaIdentityFull() should be used only when the
remote relation has REPLICA_IDENTITY_FULL set. I haven't tested all
possible cases yet but I think the attached should be a better way to
write this assertion.

-- 
With Regards,
Amit Kapila.


v1_improve_assert_worker.patch
Description: Binary data

Re: long-standing data loss bug in initial sync of logical replication

2024-09-05 Thread Amit Kapila

On Mon, Sep 2, 2024 at 9:19 PM Nitin Motiani  wrote:
>
> I think that the partial data replication for one table is a bigger
> issue than the case of data being sent for a subset of the tables in
> the transaction. This can lead to inconsistent data if the same row is
> updated multiple times or deleted in the same transaction. In such a
> case if only the partial updates from the transaction are sent to the
> subscriber, it might end up with the data which was never visible on
> the publisher side.
>
> Here is an example I tried with the patch v8-001 :
>
> I created following 2 tables on the publisher and the subscriber :
>
> CREATE TABLE delete_test(id int primary key, name varchar(100));
> CREATE TABLE update_test(id int primary key, name varchar(100));
>
> I added both the tables to the publication p on the publisher and
> created a subscription s on the subscriber.
>
> I run 2 sessions on the publisher and do the following :
>
> Session 1 :
> BEGIN;
> INSERT INTO delete_test VALUES(0, 'Nitin');
>
> Session 2 :
> ALTER PUBLICATION p DROP TABLE delete_test;
>
> Session 1 :
> DELETE FROM delete_test WHERE id=0;
> COMMIT;
>
> After the commit there should be no new row created on the publisher.
> But because the partial data was replicated, this is what the select
> on the subscriber shows :
>
> SELECT * FROM delete_test;
>  id |   name
> +---
>   0 | Nitin
> (1 row)
>
> I don't think the above is a common use case. But this is still an
> issue because the subscriber has the data which never existed on the
> publisher.
>

I don't think that is the correct conclusion because the user has
intentionally avoided sending part of the transaction changes. This
can happen in various ways without the patch as well. For example, if
the user has performed the ALTER in the same transaction.

Publisher:
=
BEGIN
postgres=*# Insert into delete_test values(0, 'Nitin');
INSERT 0 1
postgres=*# Alter Publication pub1 drop table delete_test;
ALTER PUBLICATION
postgres=*# Delete from delete_test where id=0;
DELETE 1
postgres=*# commit;
COMMIT
postgres=# select * from delete_test;
 id | name
+--
(0 rows)

Subscriber:
=
postgres=# select * from delete_test;
 id | name
+---
  0 | Nitin
(1 row)

This can also happen when the user has published only 'inserts' but
not 'updates' or 'deletes'.

-- 
With Regards,
Amit Kapila.

Re: Commit Timestamp and LSN Inversion issue

2024-09-04 Thread Amit Kapila

On Wed, Sep 4, 2024 at 6:35 PM Aleksander Alekseev
 wrote:
>
> > > I don't think you can rely on a system clock for conflict resolution.
> > > In a corner case a DBA can move the clock forward or backward between
> > > recordings of Ts1 and Ts2. On top of that there is no guarantee that
> > > 2+ servers have synchronised clocks. It seems to me that what you are
> > > proposing will just hide the problem instead of solving it in the
> > > general case.
> > >
> >
> > It is possible that we can't rely on the system clock for conflict
> > resolution but that is not the specific point of this thread. As
> > mentioned in the subject of this thread, the problem is "Commit
> > Timestamp and LSN Inversion issue". The LSN value and timestamp for a
> > commit are not generated atomically, so two different transactions can
> > have them in different order.
>
> Hm Then I'm having difficulties understanding why this is a
> problem

This is a potential problem pointed out during discussion of CDR [1]
(Please read the point starting from "How is this going to deal .."
and response by Shveta). The point of this thread is that though it
appears to be a problem but practically there is no scenario where it
can impact even when we implement "last_write_wins" startegy as
explained in the initial email. If you or someone sees a problem due
to LSN<->timestamp inversion then we need to explore the solution for
it.

>
> and why it was necessary to mention CDR in this context in the
> first place.
>
> OK, let's forget about CDR completely. Who is affected by the current
> behavior and why would it be beneficial changing it?
>

We can't forget CDR completely as this could only be a potential
problem in that context. Right now, we don't have any built-in
resolution strategies, so this can't impact but if this is a problem
then we need to have a solution for it before considering a solution
like "last_write_wins" strategy.

Now, instead of discussing LSN<->timestamp inversion issue, you
started to discuss "last_write_wins" strategy itself which we have
discussed to some extent in the thread [2]. BTW, we are planning to
start a separate thread as well just to discuss the clock skew problem
w.r.t resolution strategies like "last_write_wins" strategy. So, we
can discuss clock skew in that thread and keep the focus of this
thread LSN<->timestamp inversion problem.

[1] - 
https://www.postgresql.org/message-id/CAJpy0uBWBEveM8LO2b7wNZ47raZ9tVJw3D2_WXd8-b6LSqP6HA%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/CAJpy0uD0-DpYVMtsxK5R%3DzszXauZBayQMAYET9sWr_w0CNWXxQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-09-04 Thread Amit Kapila

On Wed, Sep 4, 2024 at 9:17 AM shveta malik  wrote:
>
> On Tue, Sep 3, 2024 at 3:01 PM shveta malik  wrote:
> >
> >
> > 1)
> > I see that ReplicationSlotAlter() will error out if the slot is
> > invalidated due to timeout. I have not tested it myself, but do you
> > know if  slot-alter errors out for other invalidation causes as well?
> > Just wanted to confirm that the behaviour is consistent for all
> > invalidation causes.
>
> I was able to test this and as anticipated behavior is different. When
> slot is invalidated due to say 'wal_removed', I am still able to do
> 'alter' of that slot.
> Please see:
>
> Pub:
>   slot_name  | failover | synced |  inactive_since  |
> invalidation_reason
> -+--++--+-
>  mysubnew1_1 | t| f  | 2024-09-04 08:58:12.802278+05:30 |
> wal_removed
>
> Sub:
> newdb1=# alter subscription mysubnew1_1 disable;
> ALTER SUBSCRIPTION
>
> newdb1=# alter subscription mysubnew1_1 set (failover=false);
> ALTER SUBSCRIPTION
>
> Pub: (failover altered)
>   slot_name  | failover | synced |  inactive_since  |
> invalidation_reason
> -+--++--+-
>  mysubnew1_1 | f| f  | 2024-09-04 08:58:47.824471+05:30 |
> wal_removed
>
>
> while when invalidation_reason is 'inactive_timeout', it fails:
>
> Pub:
>   slot_name  | failover | synced |  inactive_since  |
> invalidation_reason
> -+--++--+-
>  mysubnew1_1 | t| f  | 2024-09-03 14:30:57.532206+05:30 |
> inactive_timeout
>
> Sub:
> newdb1=# alter subscription mysubnew1_1 disable;
> ALTER SUBSCRIPTION
>
> newdb1=# alter subscription mysubnew1_1 set (failover=false);
> ERROR:  could not alter replication slot "mysubnew1_1": ERROR:  can no
> longer get changes from replication slot "mysubnew1_1"
> DETAIL:  The slot became invalid because it was inactive since
> 2024-09-04 08:54:20.308996+05:30, which is more than 0 seconds ago.
> HINT:  You might need to increase "replication_slot_inactive_timeout.".
>
> I think the behavior should be same.
>

We should not allow the invalid replication slot to be altered
irrespective of the reason unless there is any benefit.

-- 
With Regards,
Amit Kapila.

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-09-04 Thread Amit Kapila

On Wed, Sep 4, 2024 at 2:49 PM shveta malik  wrote:
>
> On Wed, Sep 4, 2024 at 9:17 AM shveta malik  wrote:
> >
> > On Tue, Sep 3, 2024 at 3:01 PM shveta malik  wrote:
> > >
> > >
>
>
> 1)
> It is related to one of my previous comments (pt 3 in [1]) where I
> stated that inactive_since should not keep on changing once a slot is
> invalidated.
>

Agreed. Updating the inactive_since for a slot that is already invalid
is misleading.

>
>
> 2)
> One more issue in this message is, once I set
> replication_slot_inactive_timeout to a bigger value, it becomes more
> misleading. This is because invalidation was done in the past using
> previous value while message starts showing new value:
>
> ALTER SYSTEM SET replication_slot_inactive_timeout TO '36h';
>
> --see 129600 secs in DETAIL and the current time.
> postgres=# SELECT * FROM pg_replication_slot_advance('mysubnew1_1',
> pg_current_wal_lsn());
> ERROR:  can no longer get changes from replication slot "mysubnew1_1"
> DETAIL:  The slot became invalid because it was inactive since
> 2024-09-04 10:06:38.980939+05:30, which is more than 129600 seconds
> ago.
> postgres=# select now();
>now
> ------
>  2024-09-04 10:07:35.201894+05:30
>
> I feel we should change this message itself.
>

+1.

-- 
With Regards,
Amit Kapila.

Re: Commit Timestamp and LSN Inversion issue

2024-09-04 Thread Amit Kapila

On Wed, Sep 4, 2024 at 2:05 PM Aleksander Alekseev
 wrote:
>
> > While discussing Logical Replication's Conflict Detection and
> > Resolution (CDR) design in [1] , it came to  our notice that the
> > commit LSN and timestamp may not correlate perfectly i.e. commits may
> > happen with LSN1 < LSN2 but with Ts1 > Ts2. This issue may arise
> > because, during the commit process, the timestamp (xactStopTimestamp)
> > is captured slightly earlier than when space is reserved in the WAL.
> >  [...]
> > There was a suggestion in [3] to acquire the timestamp while reserving
> > the space (because that happens in LSN order). The clock would need to
> > be monotonic (easy enough with CLOCK_MONOTONIC), but also cheap. The
> > main problem why it's being done outside the critical section, because
> > gettimeofday() may be quite expensive. There's a concept of hybrid
> > clock, combining "time" and logical counter, which might be useful
> > independently of CDR.
>
> I don't think you can rely on a system clock for conflict resolution.
> In a corner case a DBA can move the clock forward or backward between
> recordings of Ts1 and Ts2. On top of that there is no guarantee that
> 2+ servers have synchronised clocks. It seems to me that what you are
> proposing will just hide the problem instead of solving it in the
> general case.
>

It is possible that we can't rely on the system clock for conflict
resolution but that is not the specific point of this thread. As
mentioned in the subject of this thread, the problem is "Commit
Timestamp and LSN Inversion issue". The LSN value and timestamp for a
commit are not generated atomically, so two different transactions can
have them in different order.

Your point as far as I can understand is that in the first place, it
is not a good idea to have a strategy like "last_write_wins" which
relies on the system clock. So, even if LSN->Timestamp ordering has
any problem, it won't matter to us. Now, we can discuss whether
"last_write_wins" is a poor strategy or not but if possible, for the
sake of the point of this thread, let's assume that users using the
resolution feature ("last_write_wins") ensure that clocks are synced
or they won't enable this feature and then see if we can think of any
problem w.r.t the current code.

-- 
With Regards,
Amit Kapila.

Re: Collect statistics about conflicts in logical replication

2024-09-03 Thread Amit Kapila

On Wed, Sep 4, 2024 at 9:17 AM Peter Smith  wrote:
>
> On Tue, Sep 3, 2024 at 9:23 PM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > I confirmed that it only increased the testing time by 1 second on my 
> > machine.
> >
>
> It seems a pity to throw away perfectly good test cases just because
> they increase how long the suite takes to run.
>

We can take every possible test, but I worry about the time they
consume without adding much value and the maintenance burden. I feel
like core-code we should pay attention to tests as well and don't try
to jam all the possible tests testing mostly similar stuff. Each time
before committing or otherwise verifying the patch, we run make
check-world, so don't want that time to go enormously high. Having
said that, I don't want the added functionality shouldn't be tested
properly and I try my best to achieve that.

> This seems like yet another example of where we could have made good
> use of the 'PG_TEST_EXTRA' environment variable. I have been trying to
> propose adding "subscription" support for this in another thread [1].
> By using this variable to make some tests conditional, we could have
> the best of both worlds. e.g.
> - retain all tests, but
> - by default, only run a subset of those tests (to keep default test
> execution time low).
>
> I hope that if the idea to use PG_TEST_EXTRA for "subscription" tests
> gets accepted then later we can revisit this, and put all the removed
> extra test cases back in again.
>

I am not convinced that tests that are less useful than others or are
increasing the time are good to be kept under PG_TEST_EXTRA but if
more people advocate such an approach then it is worth considering.

-- 
With Regards,
Amit Kapila.

Re: Collect statistics about conflicts in logical replication

2024-09-03 Thread Amit Kapila

On Fri, Aug 30, 2024 at 12:15 PM Zhijie Hou (Fujitsu)
 wrote:
>
> Here is V5 patch which addressed above and Shveta's[1] comments.
>

Testing the stats for all types of conflicts is not required for this
patch especially because they increase the timings by 3-4s. We can add
tests for one or two types of conflicts.

*
(see
+ * PgStat_StatSubEntry::conflict_count and PgStat_StatSubEntry::conflict_count)

There is a typo in the above comment.

-- 
With Regards,
Amit Kapila.

Re: DOCS - pg_replication_slot . Fix the 'inactive_since' description

2024-09-02 Thread Amit Kapila

On Mon, Sep 2, 2024 at 9:14 AM shveta malik  wrote:
>
> On Mon, Sep 2, 2024 at 5:47 AM Peter Smith  wrote:
> > 
> >
> > To summarize, the current description wrongly describes the field as a
> > time duration:
> > "The time since the slot has become inactive."
> >
> > I suggest replacing it with:
> > "The slot has been inactive since this time."
> >
>
> +1 for the change. If I had read the document without knowing about
> the patch, I too would have interpreted it as a duration.
>

The suggested change looks good to me as well. I'll wait for a day or
two before pushing to see if anyone thinks otherwise.

-- 
With Regards,
Amit Kapila.

Re: Invalid Assert while validating REPLICA IDENTITY?

2024-09-02 Thread Amit Kapila

On Mon, Sep 2, 2024 at 11:21 AM Dilip Kumar  wrote:
>
> While working on some other code I noticed that in
> FindReplTupleInLocalRel() there is an assert [1] that seems to be
> passing IndexRelation to GetRelationIdentityOrPK() whereas it should
> be passing normal relation.
>

Agreed. But this should lead to assertion failure. Did you try testing it?

-- 
With Regards,
Amit Kapila.

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-09-02 Thread Amit Kapila

On Sat, Aug 31, 2024 at 1:45 PM Bharath Rupireddy
 wrote:
>
> Please find the attached v44 patch with the above changes. I will
> include the 0002 xid_age based invalidation patch later.
>

It is better to get the 0001 reviewed and committed first. We can
discuss about 0002 afterwards as 0001 is in itself a complete and
separate patch that can be committed.

-- 
With Regards,
Amit Kapila.

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-09-01 Thread Amit Kapila

On Thu, Aug 29, 2024 at 11:31 AM Bharath Rupireddy
 wrote:
>
> Thanks for looking into this.
>
> On Mon, Aug 26, 2024 at 4:35 PM Amit Kapila  wrote:
> >
> > Few comments on 0001:
> > 1.
> > @@ -651,6 +651,13 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid
> >
> > + /*
> > + * Skip the sync if the local slot is already invalidated. We do this
> > + * beforehand to avoid slot acquire and release.
> > + */
> >
> > I was wondering why you have added this new check as part of this
> > patch. If you see the following comments in the related code, you will
> > know why we haven't done this previously.
>
> Removed. Can deal with optimization separately.
>
> > 2.
> > + */
> > + InvalidateObsoleteReplicationSlots(RS_INVAL_INACTIVE_TIMEOUT,
> > +0, InvalidOid, InvalidTransactionId);
> >
> > Why do we want to call this for shutdown case (when is_shutdown is
> > true)? I understand trying to invalidate slots during regular
> > checkpoint but not sure if we need it at the time of shutdown.
>
> Changed it to invalidate only for non-shutdown checkpoints. inactive_timeout 
> invalidation isn't critical for shutdown unlike wal_removed which can help 
> shutdown by freeing up some disk space.
>
> > The
> > other point is can we try to check the performance impact with 100s of
> > slots as mentioned in the code comments?
>
> I first checked how much does the wal_removed invalidation check add to the 
> checkpoint (see 2nd and 3rd column). I then checked how much inactive_timeout 
> invalidation check adds to the checkpoint (see 4th column), it is not more 
> than wal_remove invalidation check. I then checked how much the wal_removed 
> invalidation check adds for replication slots that have already been 
> invalidated due to inactive_timeout (see 5th column), looks like not much.
>
> | # of slots | HEAD (no invalidation) ms | HEAD (wal_removed) ms | PATCHED 
> (inactive_timeout) ms | PATCHED (inactive_timeout+wal_removed) ms |
> |||---|---|--|
> | 100| 18.591 | 370.586   | 359.299   
> | 373.882  |
> | 1000   | 15.722 | 4834.901  | 5081.751  
> | 5072.128 |
> | 1  | 19.261 | 59801.062 | 61270.406 
> | 60270.099|
>
> Having said that, I'm okay to implement the optimization specified. Thoughts?
>

The other possibility is to try invalidating due to timeout along with
wal_removed case during checkpoint. The idea is that if the slot can
be invalidated due to WAL then fine, otherwise check if it can be
invalidated due to timeout. This can avoid looping the slots and doing
similar work multiple times during the checkpoint.

-- 
With Regards,
Amit Kapila.

Re: long-standing data loss bug in initial sync of logical replication

2024-09-01 Thread Amit Kapila

On Fri, Aug 30, 2024 at 3:06 PM Shlok Kyal  wrote:
>
> Next I am planning to test solely on the logical decoding side and
> will share the results.
>

Thanks, the next set of proposed tests makes sense to me. It will also
be useful to generate some worst-case scenarios where the number of
invalidations is more to see the distribution cost in such cases. For
example, Truncate/Drop a table with 100 or 1000 partitions.

-- 
With Regards,
Amit Kapila.

Re: Add contrib/pg_logicalsnapinspect

2024-08-30 Thread Amit Kapila

On Thu, Aug 29, 2024 at 6:33 PM Bharath Rupireddy
 wrote:
>
> On Thu, Aug 29, 2024 at 3:44 PM Bertrand Drouvot
>  wrote:
> >
> > Yeah that's fair. And now I'm wondering if we need an extra module. I think
> > we could "simply" expose 2 new functions in core, thoughts?
> >
> > > > What do you think? Did you have something else in mind?
> > > >
> > >
> > > On similar lines, we can also provide a function to get the slot's
> > > on-disk data.
> >
> > Yeah, having a way to expose the data from the disk makes fully sense to me.
> >
> > > IIRC, Bharath had previously proposed a tool to achieve
> > > the same. It is fine if we don't want to add that as part of this
> > > patch but I mentioned it because by having that we can have a set of
> > > functions to view logical decoding data.
> >
> > That's right. I think this one would be simply enough to expose one or two
> > functions in core too (and probably would not need an extra module).
>
> +1 for functions in core unless this extra module
> pg_logicalsnapinspect works as a tool to be helpful even when the
> server is down.
>

We have an example of pageinspect which provides low-level functions
to aid debugging. The proposal for these APIs also seems to fall in
the same category, so why go for the core for these functions?

-- 
With Regards,
Amit Kapila.

Re: Conflict Detection and Resolution

2024-08-29 Thread Amit Kapila

On Wed, Aug 28, 2024 at 4:07 PM shveta malik  wrote:
>
> > On Wed, Aug 28, 2024 at 10:30 AM Ajin Cherian  wrote:
> > >
>
> The review is WIP. Please find a few comments on patch001.
>
> 1)
> logical-repliction.sgmlL
>
> + Additional logging is triggered for specific conflict_resolvers.
> Users can also configure conflict_types while creating the
> subscription. Refer to section CONFLICT RESOLVERS for details on
> conflict_types and conflict_resolvers.
>
> Can we please change it to:
>
> Additional logging is triggered in various conflict scenarios, each
> identified as a conflict type. Users have the option to configure a
> conflict resolver for each conflict type when creating a subscription.
> For more information on the conflict types detected and the supported
> conflict resolvers, refer to the section 
>
> 2)
> SetSubConflictResolver
>
> + for (type = 0; type < resolvers_cnt; type++)
>
> 'type' does not look like the correct name here. The variable does not
> state conflict_type, it is instead a resolver-array-index, so please
> rename accordingly. Maybe idx or res_idx?
>
>  3)
> CreateSubscription():
>
> +if (stmt->resolvers)
> +check_conflict_detection();
>
> 3a) We can have a comment saying warn users if prerequisites  are not met.
>
> 3b) Also, I do not find the name 'check_conflict_detection'
> appropriate. One suggestion could be
> 'conf_detection_check_prerequisites' (similar to
> replorigin_check_prerequisites)
>
> 3c) We can move the below comment after check_conflict_detection() as
> it makes more sense there.
> /*
>  * Parse and check conflict resolvers. Initialize with default values
>  */
>
> 4)
> Should we allow repetition/duplicates of 'conflict_type=..' in CREATE
> and ALTER SUB? As an example:
> ALTER SUBSCRIPTION sub1 CONFLICT RESOLVER (insert_exists =
> 'apply_remote', insert_exists = 'error');
>
> Such a repetition works for Create-Sub but gives some internal error
> for alter-sub. (ERROR:  tuple already updated by self). Behaviour
> should be the same for both. And if we give an error, it should be
> some user understandable one. But I would like to know the opinions of
> others. Shall it give an error or the last one should be accepted as
> valid configuration in case of repetition?
>

I have tried the below statement to check existing behavior:
create subscription sub1 connection 'dbname=postgres' publication pub1
with (streaming = on, streaming=off);
ERROR:  conflicting or redundant options
LINE 1: ...=postgres' publication pub1 with (streaming = on, streaming=...

So duplicate options are not allowed. If we see any challenges to
follow same for resolvers then we can discuss but it seems better to
follow the existing behavior of other subscription options.

Also, the behavior for CREATE/ALTER should be the same.

-- 
With Regards,
Amit Kapila.

Re: Conflict Detection and Resolution

2024-08-29 Thread Amit Kapila

On Wed, Aug 28, 2024 at 10:58 AM shveta malik  wrote:
>
> On Wed, Aug 28, 2024 at 10:30 AM Ajin Cherian  wrote:
> >
> >> 2)
> >> Currently pg_dump is dumping even the default resolvers configuration.
> >> As an example if I have not changed default configuration for say
> >> sub1, it still dumps all:
> >>
> >> CREATE SUBSCRIPTION sub1 CONNECTION '..' PUBLICATION pub1 WITH ()
> >> CONFLICT RESOLVER (insert_exists = 'error', update_differ =
> >> 'apply_remote', update_exists = 'error', update_missing = 'skip',
> >> delete_differ = 'apply_remote', delete_missing = 'skip');
> >>
> >> I am not sure if we need to dump default resolvers. Would like to know
> >> what others think on this.
> >>

Normally, we don't add defaults in the dumped command. For example,
dumpSubscription won't dump the options where the default is
unchanged. We shouldn't do it unless we have a reason for dumping
defaults.

> >> 3)
> >> Why in 002_pg_dump.pl we have default resolvers set explicitly?
> >>
> > In 003_pg_dump.pl, default resolvers are not set explicitly, that is the 
> > regexp to check the pg_dump generated command for creating subscriptions. 
> > This is again connected to your 2nd question.
>
> Okay so we may not need this change if we plan to *not *dump defaults
> in pg_dump.
>
> Another point about 'defaults' is regarding insertion into the
> pg_subscription_conflict table. We currently do insert default
> resolvers into 'pg_subscription_conflict' even if the user has not
> explicitly configured them.
>

I don't see any problem with it. BTW, if we don't do it, I think
wherever we are referring the resolvers for a conflict, we need some
special handling for default and non-default. Am I missing something?

-- 
With Regards,
Amit Kapila.

Re: Conflict Detection and Resolution

2024-08-29 Thread Amit Kapila

 and delete_differ cannot be detected, and the origin and commit
> timestamp for the local row will not be logged.
>
> Thoughts?
>
> If we emit this WARNING during each resolution, then it may flood our
> log files, thus it seems better to emit it during create or alter
> subscription instead of during resolution.
>

Sounds reasonable.

-- 
With Regards,
Amit Kapila.

Re: Add contrib/pg_logicalsnapinspect

2024-08-29 Thread Amit Kapila

On Wed, Aug 28, 2024 at 1:25 AM Bertrand Drouvot
 wrote:
>
> On Mon, Aug 26, 2024 at 07:05:27PM +0530, Amit Kapila wrote:
> > On Thu, Aug 22, 2024 at 5:56 PM Bertrand Drouvot
> >  wrote:
> > >
> > > 2. The SnapBuildOnDisk and SnapBuild structs are now exposed to public. 
> > > Means
> > > we should now pay much more attention when changing their contents but I 
> > > think
> > > it's worth it.
> > >
> >
> > Is it possible to avoid exposing these structures? Can we expose some
> > function from snapbuild.c that provides the required information?
>
> Yeah, that's an option if we don't want to expose those structs to public.
>
> I think we could 1/ create a function that would return a formed HeapTuple, or
> 2/ we could create multiple functions (about 15) that would return the values
> we are interested in.
>
> I think 2/ is fine as it would give more flexiblity (no need to retrieve a 
> whole
> tuple if one is interested to only one value).
>

True, but OTOH, each time we add a new field to these structures, a
new function has to be exposed. I don't have a strong opinion on this
but seeing current use cases, it seems okay to expose a single
function.

> What do you think? Did you have something else in mind?
>

On similar lines, we can also provide a function to get the slot's
on-disk data. IIRC, Bharath had previously proposed a tool to achieve
the same. It is fine if we don't want to add that as part of this
patch but I mentioned it because by having that we can have a set of
functions to view logical decoding data.

-- 
With Regards,
Amit Kapila.

Re: Pgoutput not capturing the generated columns

2024-08-28 Thread Amit Kapila

On Thu, Aug 29, 2024 at 8:44 AM Masahiko Sawada  wrote:
>
> On Wed, Aug 28, 2024 at 1:06 AM Amit Kapila  wrote:
> >
> > On Mon, May 20, 2024 at 1:49 PM Masahiko Sawada  
> > wrote:
> > >
> > > As Euler mentioned earlier, I think it's a decision not to replicate
> > > generated columns because we don't know the target table on the
> > > subscriber has the same expression and there could be locale issues
> > > even if it looks the same. I can see that a benefit of this proposal
> > > would be to save cost to compute generated column values if the user
> > > wants the target table on the subscriber to have exactly the same data
> > > as the publisher's one. Are there other benefits or use cases?
> > >
> >
> > The cost is one but the other is the user may not want the data to be
> > different based on volatile functions like timeofday()
>
> Shouldn't the generation expression be immutable?
>

Yes, I missed that point.

> > or the table on
> > subscriber won't have the column marked as generated.
>
> Yeah, it would be another use case.
>

Right, apart from that I am not aware of other use cases. If they
have, I would request Euler or Rajendra to share any other use case.

> >  Now, considering
> > such use cases, is providing a subscription-level option a good idea
> > as the patch is doing? I understand that this can serve the purpose
> > but it could also lead to having the same behavior for all the tables
> > in all the publications for a subscription which may or may not be
> > what the user expects. This could lead to some performance overhead
> > (due to always sending generated columns for all the tables) for cases
> > where the user needs it only for a subset of tables.
>
> Yeah, it's a downside and I think it's less flexible. For example, if
> users want to send both tables with generated columns and tables
> without generated columns, they would have to create at least two
> subscriptions.
>

Agreed and that would consume more resources.

> Also, they would have to include a different set of
> tables to two publications.
>
> >
> > I think we should consider it as a table-level option while defining
> > publication in some way. A few ideas could be: (a) We ask users to
> > explicitly mention the generated column in the columns list while
> > defining publication. This has a drawback such that users need to
> > specify the column list even when all columns need to be replicated.
> > (b) We can have some new syntax to indicate the same like: CREATE
> > PUBLICATION pub1 FOR TABLE t1 INCLUDE GENERATED COLS, t2, t3, t4
> > INCLUDE ..., t5;. I haven't analyzed the feasibility of this, so there
> > could be some challenges but we can at least investigate it.
>
> I think we can create a publication for a single table, so what we can
> do with this feature can be done also by the idea you described below.
>
> > Yet another idea is to keep this as a publication option
> > (include_generated_columns or publish_generated_columns) similar to
> > "publish_via_partition_root". Normally, "publish_via_partition_root"
> > is used when tables on either side have different partition
> > hierarchies which is somewhat the case here.
>
> It sounds more useful to me.
>

Fair enough. Let's see if anyone else has any preference among the
proposed methods or can think of a better way.

-- 
With Regards,
Amit Kapila.

Re: Collect statistics about conflicts in logical replication

2024-08-28 Thread Amit Kapila

On Wed, Aug 28, 2024 at 11:43 AM shveta malik  wrote:
>
> Thanks for the patch. Just thinking out loud, since we have names like
> 'apply_error_count', 'sync_error_count' which tells that they are
> actually error-count, will it be better to have something similar in
> conflict-count cases, like 'insert_exists_conflict_count',
> 'delete_missing_conflict_count' and so on. Thoughts?
>

It would be better to have conflict in the names but OTOH it will make
the names a bit longer. The other alternatives could be (a)
insert_exists_confl_count, etc. (b) confl_insert_exists_count, etc.
(c) confl_insert_exists, etc. These are based on the column names in
the existing view pg_stat_database_conflicts [1]. The (c) looks better
than other options but it will make the conflict-related columns
different from error-related columns.

Yet another option is to have a different view like
pg_stat_subscription_conflicts but that sounds like going too far.

[1] - 
https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-DATABASE-CONFLICTS-VIEW

-- 
With Regards,
Amit Kapila.

Re: Doc: fix the note related to the GUC "synchronized_standby_slots"

2024-08-28 Thread Amit Kapila

On Wed, Aug 28, 2024 at 3:02 PM  wrote:
>
> >
> > The next line related to asynchronous replication is also not required. See 
> > attached.
>
> Thanks, I found another ".. without losing data".
>

I'll push this tomorrow unless there are any other suggestions on this patch.

-- 
With Regards,
Amit Kapila.

Re: Conflict detection and logging in logical replication

2024-08-28 Thread Amit Kapila

On Wed, Aug 28, 2024 at 12:24 PM Peter Smith  wrote:
>
> On Wed, Aug 28, 2024 at 3:53 PM shveta malik  wrote:
> >
> > On Wed, Aug 28, 2024 at 9:44 AM Zhijie Hou (Fujitsu)
> >  wrote:
> > >
> > > > > +1 on 'update_origin_differs' instead of 'update_origins_differ' as
> > > > > the former is somewhat similar to other conflict names 'insert_exists'
> > > > > and 'update_exists'.
> > > >
> > > > Since we reached a consensus on this, I am attaching a small patch to 
> > > > rename
> > > > as suggested.
> > >
> > > Sorry, I attached the wrong patch. Here is correct one.
> > >
> >
> > LGTM.
> >
>
> LGTM.
>

I'll push this patch tomorrow unless there are any suggestions or comments.

-- 
With Regards,
Amit Kapila.

Re: Doc: fix the note related to the GUC "synchronized_standby_slots"

2024-08-28 Thread Amit Kapila

On Wed, Aug 28, 2024 at 6:16 AM  wrote:
>
> > > > So, will it be okay if we just remove ".. without losing data" from
> > > > the sentence? Will that avoid the confusion you have?
> > > Yes. Additionally, it would be better to add notes about data
> > > consistency after failover for example
> > >
> > > Note that data consistency after failover can vary depending on the
> > > configurations. If "synchronized_standby_slots" is not configured,
> > > there may be data that only the subscribers hold, even though the new 
> > > primary does
> > not.
> > >
> >
> > This part can be inferred from the description of 
> > synchronized_standby_slots [1] (See:
> > This guarantees that logical replication failover slots do not consume 
> > changes until those
> > changes are received and flushed to corresponding physical standbys. If a 
> > logical
> > replication connection is meant to switch to a physical standby after the 
> > standby is
> > promoted, the physical replication slot for the standby should be listed 
> > here.)
>
> OK, it's enough for me just remove ".. without losing data".
>

The next line related to asynchronous replication is also not
required. See attached.

-- 
With Regards,
Amit Kapila.


fix_doc_1.patch
Description: Binary data

Re: Pgoutput not capturing the generated columns

2024-08-27 Thread Amit Kapila

On Mon, May 20, 2024 at 1:49 PM Masahiko Sawada  wrote:
>
> On Wed, May 8, 2024 at 4:14 PM Shubham Khanna
>  wrote:
> >
> > On Wed, May 8, 2024 at 11:39 AM Rajendra Kumar Dangwal
> >  wrote:
> > >
> > > Hi PG Hackers.
> > >
> > > We are interested in enhancing the functionality of the pgoutput plugin 
> > > by adding support for generated columns.
> > > Could you please guide us on the necessary steps to achieve this? 
> > > Additionally, do you have a platform for tracking such feature requests? 
> > > Any insights or assistance you can provide on this matter would be 
> > > greatly appreciated.
> >
> > The attached patch has the changes to support capturing generated
> > column data using ‘pgoutput’ and’ test_decoding’ plugin. Now if the
> > ‘include_generated_columns’ option is specified, the generated column
> > information and generated column data also will be sent.
>
> As Euler mentioned earlier, I think it's a decision not to replicate
> generated columns because we don't know the target table on the
> subscriber has the same expression and there could be locale issues
> even if it looks the same. I can see that a benefit of this proposal
> would be to save cost to compute generated column values if the user
> wants the target table on the subscriber to have exactly the same data
> as the publisher's one. Are there other benefits or use cases?
>

The cost is one but the other is the user may not want the data to be
different based on volatile functions like timeofday() or the table on
subscriber won't have the column marked as generated. Now, considering
such use cases, is providing a subscription-level option a good idea
as the patch is doing? I understand that this can serve the purpose
but it could also lead to having the same behavior for all the tables
in all the publications for a subscription which may or may not be
what the user expects. This could lead to some performance overhead
(due to always sending generated columns for all the tables) for cases
where the user needs it only for a subset of tables.

I think we should consider it as a table-level option while defining
publication in some way. A few ideas could be: (a) We ask users to
explicitly mention the generated column in the columns list while
defining publication. This has a drawback such that users need to
specify the column list even when all columns need to be replicated.
(b) We can have some new syntax to indicate the same like: CREATE
PUBLICATION pub1 FOR TABLE t1 INCLUDE GENERATED COLS, t2, t3, t4
INCLUDE ..., t5;. I haven't analyzed the feasibility of this, so there
could be some challenges but we can at least investigate it.

Yet another idea is to keep this as a publication option
(include_generated_columns or publish_generated_columns) similar to
"publish_via_partition_root". Normally, "publish_via_partition_root"
is used when tables on either side have different partition
hierarchies which is somewhat the case here.

Thoughts?

-- 
With Regards,
Amit Kapila.

Re: Doc: fix the note related to the GUC "synchronized_standby_slots"

2024-08-27 Thread Amit Kapila

On Tue, Aug 27, 2024 at 3:05 PM  wrote:
>
> > So, will it be okay if we just remove ".. without losing data" from the 
> > sentence? Will that
> > avoid the confusion you have?
> Yes. Additionally, it would be better to add notes about data consistency 
> after failover for example
>
> Note that data consistency after failover can vary depending on the 
> configurations. If
> "synchronized_standby_slots" is not configured, there may be data that only 
> the subscribers hold,
> even though the new primary does not.
>

This part can be inferred from the description of
synchronized_standby_slots [1] (See: This guarantees that logical
replication failover slots do not consume changes until those changes
are received and flushed to corresponding physical standbys. If a
logical replication connection is meant to switch to a physical
standby after the standby is promoted, the physical replication slot
for the standby should be listed here.)

>
 Additionally, in the case of asynchronous physical replication,
> there remains a risk of data loss for transactions committed on the former 
> primary server
> but have yet to be replicated to the new primary server.
>

This has nothing to do with failover slots. This is a known behavior
of asynchronous replication, so adding here doesn't make much sense.

In general, adding more information unrelated to failover slots can
confuse users.

[1] - 
https://www.postgresql.org/docs/17/runtime-config-replication.html#GUC-SYNCHRONIZED-STANDBY-SLOTS

-- 
With Regards,
Amit Kapila.

Re: Doc: fix the note related to the GUC "synchronized_standby_slots"

2024-08-27 Thread Amit Kapila

On Tue, Aug 27, 2024 at 10:18 AM  wrote:
>
> > I think you see such a behavior because you have disabled 
> > 'synchronized_standby_slots'
> > in your script (# disable "synchronized_standby_slots"). You need to enable 
> > that to
> > avoid data loss. Considering that, I don't think your proposed text is an 
> > improvement.
> Yes, I know.
>
> As David said, "without losing data" makes me confused because there are 
> three patterns that users
> think the data was lost though there may be other cases.
>

So, will it be okay if we just remove ".. without losing data" from
the sentence? Will that avoid the confusion you have?

With Regards,
Amit Kapila.

Re: Allow logical failover slots to wait on synchronous replication

2024-08-26 Thread Amit Kapila

On Tue, Aug 27, 2024 at 12:58 AM John H  wrote:
>
> For instance, in Shveta's suggestion of
>
> > > > We can perform this test with both of the below settings and say make
> > > > D and E slow in sending responses:
> > > > 1) synchronous_standby_names = 'ANY 3 (A,B,C,D,E)'
> > > > 2) standby_slot_names = A_slot, B_slot, C_slot, D_slot, E_slot.
>
> if the server associated with E_slot is just down or undergoing
> some sort of maintenance, then all logical consumers would start lagging until
> the server is back up. I could also mimic a network lag of 20 seconds
> and it's guaranteed
> that this patch will perform better.
>

I wanted a simple test where in the first case you use
synchronous_standby_names = 'ANY 3 (A,B,C,D,E)' and in the second case
use standby_slot_names = A_slot, B_slot, C_slot, D_slot, E_slot. You
can try some variations of it as well. The idea is that even if the
performance is less for synchronous_standby_names configuration, we
should be able to document it.  This will help users to decide what is
best for them.

> I re-ran the benchmarks with a longer run time of 3 hours, and testing
> a new shared cache
> for walsenders to check the value before obtaining the SyncRepLock.
>
> I also saw I was being throttled on storage in my previous benchmarks
> so I moved to a new setup.
> I benchmarked a new test case with an additional shared cache between
> all the walsenders to
> reduce potential contention on SyncRepLock, and have attached said patch.
>
> Database: Writer on it's own disk, 5 RRs on the other disk together
> Client: 10 logical clients, pgbench running from here as well
>
> 'pgbench -c 32 -j 4 -T 10800 -U "ec2-user" -d postgres -r -P 1'
>
> # Test failover_slots with synchronized_standby_slots = 'rr_1, rr_2,
> rr_3, rr_4, rr_5'
> latency average = 10.683 ms
> latency stddev = 11.851 ms
> initial connection time = 145.876 ms
> tps = 2994.595673 (without initial connection time)
>
> # Test failover_slots waiting on sync_rep no new shared cache
> latency average = 10.684 ms
> latency stddev = 12.247 ms
> initial connection time = 142.561 ms
> tps = 2994.160136 (without initial connection time)
> statement latencies in milliseconds and failures:
>
> # Test failover slots with additional shared cache
> latency average = 10.674 ms
> latency stddev = 11.917 ms
> initial connection time = 142.486 ms
> tps = 2997.315874 (without initial connection time)
>
> The tps improvement between no cache and shared_cache seems marginal, but we 
> do
> see the slight improvement  in stddev which makes sense from a
> contention perspective.
>

What is the difference between "Test failover_slots with
synchronized_standby_slots = 'rr_1, rr_2,
> rr_3, rr_4, rr_5'" and "Test failover_slots waiting on sync_rep no new shared 
> cache"? I want to know what configuration did you used for 
> synchronous_standby_names in the latter case.

> I think the cache would demonstrate a lot more improvement if we had
> say 1000 logical slots
> and all of them are trying to obtain SyncRepLock for updating its values.
>
> I've attached the patch but don't feel particularly strongly about the
> new shared LSN values.
>

I am also not sure especially as the test results didn't shown much
improvement and the code also becomes bit complicated. BTW, in the
0003 version in the below code:
+ /* Cache values to reduce contention */
+ LWLockAcquire(SyncRepLock, LW_SHARED);
+ memcpy((XLogRecPtr *) walsndctl->cached_lsn, (XLogRecPtr *)
walsndctl->lsn, sizeof(lsn));
+ LWLockRelease(SyncRepLock);

Which mode lsn is being copied? I am not sure if I understood this
part of the code.

In the 0002 version, in the following code [1], you are referring to
LSN mode which is enabled for logical walsender irrespective of the
mode used by the physical walsender. It is possible that they are
always the same but that is not evident from the code or comments in
the patch.
[1] :
+ /* Cache values to reduce contention on lock */
+ for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; i++)
+ {
+ lsn[i] = walsndctl->lsn[i];
+ }

- ss_oldest_flush_lsn = min_restart_lsn;
+ LWLockRelease(SyncRepLock);

- return true;
+ if (lsn[mode] >= wait_for_lsn)
+ return true;

-- 
With Regards,
Amit Kapila.

Re: Doc: fix the note related to the GUC "synchronized_standby_slots"

2024-08-26 Thread Amit Kapila

On Mon, Aug 26, 2024 at 6:38 PM Zhijie Hou (Fujitsu)
 wrote:
>
> On Monday, August 26, 2024 5:37 PM Amit Kapila  
> wrote:
> >
> > On Mon, Aug 26, 2024 at 1:30 PM  wrote:
> > >
> > > When I read the following documentation related to the
> > "synchronized_standby_slots", I misunderstood that data loss would not occur
> > in the case of synchronous physical replication. However, this is incorrect 
> > (see
> > reproduce.txt).
> > >
> > > > Note that in the case of asynchronous replication, there remains a risk 
> > > > of
> > data loss for transactions committed on the former primary server but have 
> > yet
> > to be replicated to the new primary server.
> > > https://www.postgresql.org/docs/17/logical-replication-failover.html
> > >
> > > Am I missing something?
> > >
> >
> > It seems part of the paragraph: "Note that in the case of asynchronous
> > replication, there remains a risk of data loss for transactions committed 
> > on the
> > former primary server but have yet to be replicated to the new primary 
> > server." is
> > a bit confusing. Will it make things clear to me if we remove that part?
>
> I think the intention is to address a complaint[1] that the date inserted on
> primary after the primary disconnects with the standby is still lost after
> failover. But after rethinking, maybe it's doesn't directly belong to the 
> topic in
> the logical failover section because it's a general fact for async 
> replication.
> If we think it matters, maybe we can remove this part and slightly modify
> another part:
>
>parameter ensures a seamless transition of those subscriptions after the
>standby is promoted. They can continue subscribing to publications on the
> -   new primary server without losing data.
> +   new primary server without losing that has already been replicated and
> +flushed on the standby server.
>

Yeah, we can change that way but not sure if that satisfies the OP's
concern. I am waiting for his response.

-- 
With Regards,
Amit Kapila.

Re: Doc: fix the note related to the GUC "synchronized_standby_slots"

2024-08-26 Thread Amit Kapila

On Mon, Aug 26, 2024 at 1:30 PM  wrote:
>
> When I read the following documentation related to the 
> "synchronized_standby_slots", I misunderstood that data loss would not occur 
> in the case of synchronous physical replication. However, this is incorrect 
> (see reproduce.txt).
>

I think you see such a behavior because you have disabled
'synchronized_standby_slots' in your script (# disable
"synchronized_standby_slots"). You need to enable that to avoid data
loss. Considering that, I don't think your proposed text is an
improvement.

-- 
With Regards,
Amit Kapila.

Re: Add contrib/pg_logicalsnapinspect

2024-08-26 Thread Amit Kapila

On Thu, Aug 22, 2024 at 5:56 PM Bertrand Drouvot
 wrote:
>
> Please find attached a patch to $SUBJECT.
>
> This module provides SQL functions to inspect the contents of serialized 
> logical
> snapshots of a running database cluster, which I think could be useful for
> debugging or educational purposes.
>

+1. I see it could be a good debugging aid.

>
> 2. The SnapBuildOnDisk and SnapBuild structs are now exposed to public. Means
> we should now pay much more attention when changing their contents but I think
> it's worth it.
>

Is it possible to avoid exposing these structures? Can we expose some
function from snapbuild.c that provides the required information?

-- 
With Regards,
Amit Kapila.

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-08-26 Thread Amit Kapila

On Mon, Aug 26, 2024 at 11:44 AM Bharath Rupireddy
 wrote:
>

Few comments on 0001:
1.
@@ -651,6 +651,13 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid
remote_dbid)
 " name slot \"%s\" already exists on the standby",
 remote_slot->name));

+ /*
+ * Skip the sync if the local slot is already invalidated. We do this
+ * beforehand to avoid slot acquire and release.
+ */
+ if (slot->data.invalidated != RS_INVAL_NONE)
+ return false;
+
  /*
  * The slot has been synchronized before.

I was wondering why you have added this new check as part of this
patch. If you see the following comments in the related code, you will
know why we haven't done this previously.

/*
* The slot has been synchronized before.
*
* It is important to acquire the slot here before checking
* invalidation. If we don't acquire the slot first, there could be a
* race condition that the local slot could be invalidated just after
* checking the 'invalidated' flag here and we could end up
* overwriting 'invalidated' flag to remote_slot's value. See
* InvalidatePossiblyObsoleteSlot() where it invalidates slot directly
* if the slot is not acquired by other processes.
*
* XXX: If it ever turns out that slot acquire/release is costly for
* cases when none of the slot properties is changed then we can do a
* pre-check to ensure that at least one of the slot properties is
* changed before acquiring the slot.
*/
ReplicationSlotAcquire(remote_slot->name, true);

We need some modifications in these comments if you want to add a
pre-check here.

2.
@@ -1907,6 +2033,31 @@ CheckPointReplicationSlots(bool is_shutdown)
  SaveSlotToPath(s, path, LOG);
  }
  LWLockRelease(ReplicationSlotAllocationLock);
+
+ elog(DEBUG1, "performing replication slot invalidation checks");
+
+ /*
+ * Note that we will make another pass over replication slots for
+ * invalidations to keep the code simple. The assumption here is that the
+ * traversal over replication slots isn't that costly even with hundreds
+ * of replication slots. If it ever turns out that this assumption is
+ * wrong, we might have to put the invalidation check logic in the above
+ * loop, for that we might have to do the following:
+ *
+ * - Acqure ControlLock lock once before the loop.
+ *
+ * - Call InvalidatePossiblyObsoleteSlot for each slot.
+ *
+ * - Handle the cases in which ControlLock gets released just like
+ * InvalidateObsoleteReplicationSlots does.
+ *
+ * - Avoid saving slot info to disk two times for each invalidated slot.
+ *
+ * XXX: Should we move inactive_timeout inavalidation check closer to
+ * wal_removed in CreateCheckPoint and CreateRestartPoint?
+ */
+ InvalidateObsoleteReplicationSlots(RS_INVAL_INACTIVE_TIMEOUT,
+0, InvalidOid, InvalidTransactionId);

Why do we want to call this for shutdown case (when is_shutdown is
true)? I understand trying to invalidate slots during regular
checkpoint but not sure if we need it at the time of shutdown. The
other point is can we try to check the performance impact with 100s of
slots as mentioned in the code comments?

-- 
With Regards,
Amit Kapila.

Re: Conflict detection and logging in logical replication

2024-08-26 Thread Amit Kapila

On Thu, Aug 22, 2024 at 2:21 PM Amit Kapila  wrote:
>
> On Thu, Aug 22, 2024 at 1:33 PM Peter Smith  wrote:
> >
> > Do you think the documentation for the 'column_value' parameter of the
> > conflict logging should say that the displayed value might be
> > truncated?
> >
>
> I updated the patch to mention this and pushed it.
>

Peter Smith mentioned to me off-list that the names of conflict types
'update_differ' and 'delete_differ' are not intuitive as compared to
all other conflict types like insert_exists, update_missing, etc. The
other alternative that comes to mind for those conflicts is to name
them as 'update_origin_differ'/''delete_origin_differ'.

The description in docs for 'update_differ' is as follows: Updating a
row that was previously modified by another origin. Note that this
conflict can only be detected when track_commit_timestamp is enabled
on the subscriber. Currently, the update is always applied regardless
of the origin of the local row.

Does anyone else have any thoughts on the naming of these conflicts?

-- 
With Regards,
Amit Kapila.

Re: Doc: fix the note related to the GUC "synchronized_standby_slots"

2024-08-26 Thread Amit Kapila

On Mon, Aug 26, 2024 at 1:30 PM  wrote:
>
> When I read the following documentation related to the 
> "synchronized_standby_slots", I misunderstood that data loss would not occur 
> in the case of synchronous physical replication. However, this is incorrect 
> (see reproduce.txt).
>
> > Note that in the case of asynchronous replication, there remains a risk of 
> > data loss for transactions committed on the former primary server but have 
> > yet to be replicated to the new primary server.
> https://www.postgresql.org/docs/17/logical-replication-failover.html
>
> Am I missing something?
>

It seems part of the paragraph: "Note that in the case of asynchronous
replication, there remains a risk of data loss for transactions
committed on the former primary server but have yet to be replicated
to the new primary server." is a bit confusing. Will it make things
clear to me if we remove that part?

I am keeping a few others involved in this feature development in Cc.

-- 
With Regards,
Amit Kapila.

Re: Conflict Detection and Resolution

2024-08-26 Thread Amit Kapila

On Mon, Aug 26, 2024 at 7:28 AM Peter Smith  wrote:
>
> On Thu, Aug 22, 2024 at 8:15 PM shveta malik  wrote:
> >
>
> Hi Shveta,
>
> I felt it would be better to keep the syntax similar to the existing
> INSERT ... ON CONFLICT [1].
>
> I'd suggest a syntax like this:
>
> ... ON CONFLICT ['conflict_type'] DO { 'conflict_action' | DEFAULT }
>
> ~~~
>
> e.g.
>
> To configure conflict resolvers for the SUBSCRIPTION:
>
> CREATE SUBSCRIPTION subname CONNECTION coninfo PUBLICATION pubname
> ON CONFLICT 'conflict_type1' DO 'conflict_action1',
> ON CONFLICT 'conflict_type2' DO 'conflict_action2';
>

One thing that looks odd to me about this is the resolution part of
it. For example, ON CONFLICT 'insert_exists' DO 'keep_local'. The
action part doesn't go well without being explicit that it is a
resolution method. Another variant could be ON CONFLICT
'insert_exists' USE RESOLUTION [METHOD] 'keep_local'.

I think we can keep all these syntax alternatives either in the form
of comments or in the commit message and discuss more on these once we
agree on the solutions to the key design issues pointed out by Shveta.

-- 
With Regards,
Amit Kapila.

Re: Conflict Detection and Resolution

2024-08-26 Thread Amit Kapila

nalysis
sounds reasonable to me.

> ~~
>
> 5)
> Currently we only capture update_missing conflict i.e. we are not
> distinguishing between the missing row and the deleted row. We had
> discussed this in the past a couple of times. If we plan to target it
> in draft 1, I can dig up all old emails and resume discussion on this.
>

This is a separate conflict detection project in itself. I am thinking
about the solution to this problem. We will talk about this in a
separate thread.

> ~~
>
> 6)
> Table-level resolves. There was a suggestion earlier to implement
> table-level resolvers. The patch has been implemented to some extent,
> it can be completed and posted when we are done reviewing subscription
> level resolvers.
>

Yeah, it makes sense to do it after the subscription-level resolution
patch is ready.

-- 
With Regards,
Amit Kapila.

Re: Fix memory counter update in reorderbuffer

2024-08-25 Thread Amit Kapila

On Fri, Aug 23, 2024 at 3:44 PM Masahiko Sawada  wrote:
>
> On Tue, Aug 20, 2024 at 9:29 PM Amit Kapila  wrote:
> >
>
> I've updated the updated patch with regression tests.
>

LGTM.

-- 
With Regards,
Amit Kapila.

Re: Conflict detection and logging in logical replication

2024-08-22 Thread Amit Kapila

On Thu, Aug 22, 2024 at 1:33 PM Peter Smith  wrote:
>
> Do you think the documentation for the 'column_value' parameter of the
> conflict logging should say that the displayed value might be
> truncated?
>

I updated the patch to mention this and pushed it.

-- 
With Regards,
Amit Kapila.

Re: CREATE SUBSCRIPTION - add missing test case

2024-08-21 Thread Amit Kapila

On Fri, Aug 16, 2024 at 9:45 AM vignesh C  wrote:
>
> On Thu, 15 Aug 2024 at 12:55, Peter Smith  wrote:
> >
> > Hi Hackers,
> >
> > While reviewing another logical replication thread [1], I found an
> > ERROR scenario that seems to be untested.
> >
> > TEST CASE: Attempt CREATE SUBSCRIPTION where the subscriber table is
> > missing some expected column(s).
> >
> > Attached is a patch to add the missing test for this error message.
>
> I agree currently there is no test to hit this code.
>

I also don't see a test for this error condition. However, it is not
clear to me how important is it to cover this error code path. This
code has existed for a long time and I didn't notice any bugs related
to this. There is a possibility that in the future we might break
something because of a lack of this test but not sure if we want to
cover every code path via tests as each additional test also has some
cost. OTOH, If others think it is important or a good idea to have
this test then I don't have any objection to the same.

-- 
With Regards,
Amit Kapila.

Re: Conflict detection and logging in logical replication

2024-08-21 Thread Amit Kapila

On Wed, Aug 21, 2024 at 8:35 AM Zhijie Hou (Fujitsu)
 wrote:
>
> On Wednesday, August 21, 2024 9:33 AM Jonathan S. Katz  
> wrote:
> > On 8/6/24 4:15 AM, Zhijie Hou (Fujitsu) wrote:
> >
> > > Thanks for the idea! I thought about few styles based on the suggested
> > > format, what do you think about the following ?
> >
> > Thanks for proposing formats. Before commenting on the specifics, I do want 
> > to
> > ensure that we're thinking about the following for the log formats:
> >
> > 1. For the PostgreSQL logs, we'll want to ensure we do it in a way that's as
> > convenient as possible for people to parse the context from scripts.
>
> Yeah. And I personally think the current log format is OK for parsing 
> purposes.
>
> >
> > 2. Semi-related, I still think the simplest way to surface this info to a 
> > user is
> > through a "pg_stat_..." view or similar catalog mechanism (I'm less 
> > opinionated
> > on the how outside of we should make it available via SQL).
>
> We have a patch(v19-0002) in this thread to collect conflict stats and display
> them in the view, and the patch is under review.
>

IIUC, Jonathan is asking to store the conflict information (the one we
display in LOGs). We can do that separately as that is useful.

> Storing it into a catalog needs more analysis as we may need to add addition
> logic to clean up old conflict data in that catalog table. I think we can
> consider it as a future improvement.
>

Agreed. The cleanup part needs more consideration.

> >
> > 3. We should ensure we're able to convey to the user these details about the
> > conflict:
> >
> > * What time it occurred on the local server (which we'd have in the logs)
> > * What kind of conflict it is
> > * What table the conflict occurred on
> > * What action caused the conflict
> > * How the conflict was resolved (ability to include source/origin info)
>
> I think all above are already covered in the current conflict log. Except that
> we have not support resolving the conflict, so we don't log the resolution.
>
> >
> >
> > I think outputting the remote/local tuple value may be a parameter we need 
> > to
> > think about (with the desired outcome of trying to avoid another 
> > parameter). I
> > have a concern about unintentionally leaking data (and I understand that
> > someone with access to the logs does have a broad ability to view data); I'm
> > less concerned about the size of the logs, as conflicts in a well-designed
> > system should be rare (though a conflict storm could fill up the logs, 
> > likely there
> > are other issues to content with at that point).
>
> We could use an option to control, but the tuple value is already output in 
> some
> existing cases (e.g. partition check, table constraints check, view with check
> constraints, unique violation), and it would test the current user's
> privileges to decide whether to output the tuple or not. So, I think it's OK
> to display the tuple for conflicts.
>

The current information is displayed keeping in mind that users should
be able to use that to manually resolve conflicts if required. If we
think there is a leak of information (either from a security angle or
otherwise) like tuple data then we can re-consider. However, as we are
displaying tuple information in other places as pointed out by
Hou-San, we thought it is also okay to display in this case.

-- 
With Regards,
Amit Kapila.

Re: [bug fix] prepared transaction might be lost when max_prepared_transactions is zero on the subscriber

2024-08-21 Thread Amit Kapila

On Tue, Aug 20, 2024 at 2:01 PM shveta malik  wrote:
>
> On Tue, Aug 20, 2024 at 11:36 AM Amit Kapila  wrote:
> >
> > On Wed, Aug 14, 2024 at 10:26 AM shveta malik  
> > wrote:
> > >
> > > >
> > > > Thanks for the detailed analysis. I agree with your analysis that we
> > > > need to reset the origin information for the shutdown path to avoid it
> > > > being advanced incorrectly. However, the patch doesn't have sufficient
> > > > comments to explain why we need to reset it for both the ERROR and
> > > > Shutdown paths. Can we improve the comments in the patch?
> > > >
> > > > Also, for the ERROR path, can we reset the origin information in
> > > > apply_error_callback()?
> > >
> > > Please find v4 attached. Addressed comments in that.
> > >
> >
> > The patch looks mostly good to me. I have slightly changed a few of
> > the comments in the attached. What do you think of the attached?
> >
>
> Looks good to me. Please find backported patches attached.
>

Pushed.

-- 
With Regards,
Amit Kapila.

Re: Taking into account syncrep position in flush_lsn reported by apply worker

2024-08-21 Thread Amit Kapila

On Wed, Aug 21, 2024 at 12:40 PM Heikki Linnakangas  wrote:
>
> On 21/08/2024 09:25, Amit Kapila wrote:
> >>
> >> I think this patch makes sense. I'm not sure we've actually made any
> >> promises on it, but it feels wrong that the slot's LSN might be advanced
> >> past the LSN that's been has been acknowledged by the replica, if
> >> synchronous replication is configured. I see little downside in making
> >> that promise.
> >
> > One possible downside of such a promise could be that the publisher
> > may slow down for sync replication because it has to wait for all the
> > configured sync_standbys of subscribers to acknowledge the LSN. I
> > don't know how many applications can be impacted due to this if we do
> > it by default but if we feel there won't be any such cases or they
> > will be in the minority then it is okay to proceed with this.
>
> It only slows down updating the flush LSN on the publisher, which is
> updated quite lazily anyway.
>

But doesn't that also mean that if the logical subscriber is
configured in synchronous_standby_names, then the publisher's
transactions also need to wait for such an update? We do update it
lazily but as soon as the operation is applied to the subscriber the
transaction on publisher will be released, however, IIUC the same
won't be true after the patch.

> A more serious scenario is if the sync replica crashes or is not
> responding at all. In that case, the flush LSN on the publisher cannot
> advance, and WAL starts to accumulate. However, if a sync replica is not
> responding, that's very painful for the (subscribing) server anyway: all
> commits will hang waiting for the replica. Holding back the flush LSN on
> the publisher seems like a minor problem compared to that.
>

Yeah, but as per my understanding that also means holding all the
active work/transactions on the publisher which doesn't sound to be a
minor problem.

> It would be good to have some kind of an escape hatch though. If you get
> into that situation, is there a way to advance the publisher's flush LSN
> even though the synchronous replica has crashed? You can temporarily
> turn off synchronous replication on the subscriber. That will release
> any COMMITs on the server too. In theory you might not want that, but in
> practice stuck COMMITs are so painful that if you are taking manual
> action, you probably do want to release them as well.
>

This will work in the scenario you mentioned.

If the above understanding is correct and you agree that it is not a
good idea to hold back transactions on the publisher then we can think
of a new subscription that allows the apply worker to wait for
synchronous replicas.

-- 
With Regards,
Amit Kapila.

Re: Taking into account syncrep position in flush_lsn reported by apply worker

2024-08-20 Thread Amit Kapila

On Wed, Aug 21, 2024 at 2:25 AM Heikki Linnakangas  wrote:
>
> On 14/08/2024 16:54, Arseny Sher wrote:
> > On 8/13/24 06:35, Amit Kapila wrote:
> >> On Mon, Aug 12, 2024 at 3:43 PM Arseny Sher  wrote:
> >>>
> >>> Sorry for the poor formatting of the message above, this should be
> >>> better:
> >>>
> >>> Hey. Currently synchronous_commit is disabled for logical apply worker
> >>> on the ground that reported flush_lsn includes only locally flushed data
> >>> so slot (publisher) preserves everything higher than this, and so in
> >>> case of subscriber restart no data is lost. However, imagine that
> >>> subscriber is made highly available by standby to which synchronous
> >>> replication is enabled. Then reported flush_lsn is ignorant of this
> >>> synchronous replication progress, and in case of failover data loss may
> >>> occur if subscriber managed to ack flush_lsn ahead of syncrep.
> >>
> >> Won't the same can be achieved by enabling the synchronous_commit
> >> parameter for a subscription?
> >
> > Nope, because it would force WAL flush and wait for replication to the
> > standby in the apply worker, slowing down it. The logic missing
> > currently is not to wait for the synchronous commit, but still mind its
> > progress in the flush_lsn reporting.
>
> I think this patch makes sense. I'm not sure we've actually made any
> promises on it, but it feels wrong that the slot's LSN might be advanced
> past the LSN that's been has been acknowledged by the replica, if
> synchronous replication is configured. I see little downside in making
> that promise.
>

One possible downside of such a promise could be that the publisher
may slow down for sync replication because it has to wait for all the
configured sync_standbys of subscribers to acknowledge the LSN. I
don't know how many applications can be impacted due to this if we do
it by default but if we feel there won't be any such cases or they
will be in the minority then it is okay to proceed with this.

-- 
With Regards,
Amit Kapila.

Re: Fix memory counter update in reorderbuffer

2024-08-20 Thread Amit Kapila

On Tue, Aug 20, 2024 at 5:55 PM Masahiko Sawada  wrote:
>
> On Fri, Aug 16, 2024 at 12:22 AM Shlok Kyal  wrote:
> >
>
> Thank you for testing the patch!
>
> I'm slightly hesitant to add a test under src/test/subscription since
> it's a bug in ReorderBuffer and not specific to logical replication.
> If we reasonably cannot add a test under contrib/test_decoding, I'm
> okay with adding it under src/test/subscription.
>

Sounds like a reasonable approach.

> I've attached the updated patch with the commit message (but without a
> test case for now).
>

The patch LGTM except for one minor comment.

+ /* All changes must be returned */
+ Assert(txn->size == 0);

Isn't it better to say: "All changes must be deallocated." in the above comment?

-- 
With Regards,
Amit Kapila.

Re: long-standing data loss bug in initial sync of logical replication

2024-08-20 Thread Amit Kapila

On Thu, Aug 15, 2024 at 9:31 PM vignesh C  wrote:
>
> On Thu, 8 Aug 2024 at 16:24, Shlok Kyal  wrote:
> >
> > On Wed, 31 Jul 2024 at 11:17, Shlok Kyal  wrote:
> > >
> >
> > Created a patch for distributing invalidations.
> > Here we collect the invalidation messages for the current transaction
> > and distribute it to all the inprogress transactions, whenever we are
> > distributing the snapshots..Thoughts?
>
> Since we are applying invalidations to all in-progress transactions,
> the publisher will only replicate half of the transaction data up to
> the point of invalidation, while the remaining half will not be
> replicated.
> Ex:
> Session1:
> BEGIN;
> INSERT INTO tab_conc VALUES (1);
>
> Session2:
> ALTER PUBLICATION regress_pub1 DROP TABLE tab_conc;
>
> Session1:
> INSERT INTO tab_conc VALUES (2);
> INSERT INTO tab_conc VALUES (3);
> COMMIT;
>
> After the above the subscriber data looks like:
> postgres=# select * from tab_conc ;
>  a
> ---
>  1
> (1 row)
>
> You can reproduce the issue using the attached test.
> I'm not sure if this behavior is ok. At present, we’ve replicated the
> first record within the same transaction, but the second and third
> records are being skipped.
>

This can happen even without a concurrent DDL if some of the tables in
the database are part of the publication and others are not. In such a
case inserts for publicized tables will be replicated but other
inserts won't. Sending the partial data of the transaction isn't a
problem to me. Do you have any other concerns that I am missing?

> Would it be better to apply invalidations
> after the transaction is underway?
>

But that won't fix the problem reported by Sawada-san in an email [1].

BTW, we should do some performance testing by having a mix of DML and
DDLs to see the performance impact of this patch.

[1] - 
https://www.postgresql.org/message-id/CAD21AoAenVqiMjpN-PvGHL1N9DWnHSq673bfgr6phmBUzx=k...@mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: Adding clarification to description of IPC wait events XactGroupUpdate and ProcArrayGroupUpdate

2024-08-20 Thread Amit Kapila

On Thu, Aug 15, 2024 at 8:11 PM Nathan Bossart  wrote:
>
> On Thu, Aug 15, 2024 at 11:25:25AM +0800, SAMEER KUMAR wrote:
> > I think it is important to indicate that the group leader is responsible
> > for clearing the transaction ID/transaction status of other backends
> > (including this one).
>
> Your proposal is
>
> Waiting for the group leader process to clear the transaction ID of
> this backend at the end of a transaction.
>
> Waiting for the group leader process to update the transaction status
> for this backend.
>
> Mine is
>
> Waiting for the group leader to clear the transaction ID at 
> transaction
> end.
>
> Waiting for the group leader to update transaction status at
> transaction end.
>
> IMHO the latter doesn't convey substantially less information, and it fits
> a little better with the terse style of the other wait events nearby.
>

+1 for Nathan's version. It is quite close to the previous version,
for which we haven't heard any complaints since they were introduced.

-- 
With Regards,
Amit Kapila.

Re: [bug fix] prepared transaction might be lost when max_prepared_transactions is zero on the subscriber

2024-08-19 Thread Amit Kapila

On Wed, Aug 14, 2024 at 10:26 AM shveta malik  wrote:
>
> >
> > Thanks for the detailed analysis. I agree with your analysis that we
> > need to reset the origin information for the shutdown path to avoid it
> > being advanced incorrectly. However, the patch doesn't have sufficient
> > comments to explain why we need to reset it for both the ERROR and
> > Shutdown paths. Can we improve the comments in the patch?
> >
> > Also, for the ERROR path, can we reset the origin information in
> > apply_error_callback()?
>
> Please find v4 attached. Addressed comments in that.
>

The patch looks mostly good to me. I have slightly changed a few of
the comments in the attached. What do you think of the attached?

-- 
With Regards,
Amit Kapila.


v5-0001-Don-t-advance-origin-during-apply-failure.patch
Description: Binary data

Re: Conflict detection and logging in logical replication

2024-08-19 Thread Amit Kapila

On Mon, Aug 19, 2024 at 4:16 PM Amit Kapila  wrote:
>
> > Rest looks good.
> >
>
> Thanks for the review and testing.
>

Pushed.

-- 
With Regards,
Amit Kapila.

Re: Conflict detection and logging in logical replication

2024-08-19 Thread Amit Kapila

On Mon, Aug 19, 2024 at 3:03 PM shveta malik  wrote:
>
> On Mon, Aug 19, 2024 at 12:32 PM Zhijie Hou (Fujitsu)
>  wrote:
> >
> >
> > Thanks for reporting the bug. I have fixed it and ran pgindent in V17 patch.
> > I also adjusted few comments and fixed a typo.
> >
>
> Thanks for the patch. Re-tested it, all scenarios seem to work well now.
>
> I see that this version has new header inclusion in conflict.c, due to
> which I think "catalog/index.h" inclusion is now redundant. Please
> recheck and remove if so.
>

This is an extra include, so removed in the attached. Additionally, I
have modified a few comments and commit message.

> Also, there are few long lines in conflict.c (see line 408, 410).
>

I have left these as it is because pgindent doesn't complain about them.

> Rest looks good.
>

Thanks for the review and testing.

-- 
With Regards,
Amit Kapila.


v18-0001-Log-the-conflicts-while-applying-changes-in-logi.patch
Description: Binary data

Re: Conflict detection and logging in logical replication

2024-08-18 Thread Amit Kapila

On Mon, Aug 19, 2024 at 11:54 AM shveta malik  wrote:
>
> On Mon, Aug 19, 2024 at 11:37 AM Amit Kapila  wrote:
> >
> > On Mon, Aug 19, 2024 at 9:08 AM shveta malik  wrote:
> > >
> > > On Sun, Aug 18, 2024 at 2:27 PM Zhijie Hou (Fujitsu)
> > >  wrote:
> > > >
> > > > Attach the V16 patch which addressed the comments we agreed on.
> > > > I will add a doc patch to explain the log format after the 0001 is RFC.
> > > >
> > >
> > > Thank You for addressing comments. Please see this scenario:
> > >
> > > create table tab1(pk int primary key, val1 int unique, val2 int);
> > >
> > > pub: insert into tab1 values(1,1,1);
> > > sub: insert into tab1 values(2,2,3);
> > > pub: update tab1 set val1=2 where pk=1;
> > >
> > > Wrong 'replica identity' column logged? shouldn't it be pk?
> > >
> > > ERROR:  conflict detected on relation "public.tab1": 
> > > conflict=update_exists
> > > DETAIL:  Key already exists in unique index "tab1_val1_key", modified
> > > locally in transaction 801 at 2024-08-19 08:50:47.974815+05:30.
> > > Key (val1)=(2); existing local tuple (2, 2, 3); remote tuple (1, 2,
> > > 1); replica identity (val1)=(1).
> > >
> >
> > The docs say that by default replica identity is primary_key [1] (see
> > REPLICA IDENTITY),
>
> yes, I agree. But here the importance of dumping it was to know the
> value of RI as well which is being used as an identification of row
> being updated rather than row being conflicted. Value is logged
> correctly.
>

Agreed, sorry, I misunderstood the problem reported. I thought the
suggestion was to use 'primary key' instead of 'replica identity' but
you pointed out that the column used in 'replica identity' was wrong.
We should fix this one.

-- 
With Regards,
Amit Kapila.

Re: Conflict detection and logging in logical replication

2024-08-18 Thread Amit Kapila

On Mon, Aug 19, 2024 at 9:08 AM shveta malik  wrote:
>
> On Sun, Aug 18, 2024 at 2:27 PM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > Attach the V16 patch which addressed the comments we agreed on.
> > I will add a doc patch to explain the log format after the 0001 is RFC.
> >
>
> Thank You for addressing comments. Please see this scenario:
>
> create table tab1(pk int primary key, val1 int unique, val2 int);
>
> pub: insert into tab1 values(1,1,1);
> sub: insert into tab1 values(2,2,3);
> pub: update tab1 set val1=2 where pk=1;
>
> Wrong 'replica identity' column logged? shouldn't it be pk?
>
> ERROR:  conflict detected on relation "public.tab1": conflict=update_exists
> DETAIL:  Key already exists in unique index "tab1_val1_key", modified
> locally in transaction 801 at 2024-08-19 08:50:47.974815+05:30.
> Key (val1)=(2); existing local tuple (2, 2, 3); remote tuple (1, 2,
> 1); replica identity (val1)=(1).
>

The docs say that by default replica identity is primary_key [1] (see
REPLICA IDENTITY), [2] (see pg_class.relreplident). So, using the same
format to display PK seems reasonable. I don't think adding additional
code to distinguish these two cases in the LOG message is worth it. We
can always change such things later if that is what users and or
others prefer.

[1] - https://www.postgresql.org/docs/devel/sql-altertable.html
[2] - https://www.postgresql.org/docs/devel/catalog-pg-class.html

-- 
With Regards,
Amit Kapila.

Re: Conflict detection and logging in logical replication

2024-08-15 Thread Amit Kapila

On Fri, Aug 16, 2024 at 11:48 AM shveta malik  wrote:
>
> On Fri, Aug 16, 2024 at 10:46 AM shveta malik  wrote:
> >
> > 3)
> > For update_exists(), we dump:
> > Key (a, b)=(2, 1)
> >
> > For delete_missing, update_missing, update_differ, we dump:
> > Replica identity (a, b)=(2, 1).
> >
> > For update_exists as well, shouldn't we dump 'Replica identity'? Only
> > for insert case, it should be referred as 'Key'.
> >
>
> On rethinking, is it because for update_exists case 'Key' dumped is
> not the one used to search the row to be updated? Instead it is the
> one used to search the conflicting row. Unlike update_differ, the row
> to be updated and the row currently conflicting will be different for
> update_exists case. I earlier thought that 'KEY' and 'Existing local
> tuple' dumped always belong to the row currently being
> updated/deleted/inserted. But for 'update_eixsts', that is not the
> case. We are dumping 'Existing local tuple' and 'Key' for the row
> which is conflicting and not the one being updated.  Example:
>
> ERROR:  conflict detected on relation "public.tab_1": conflict=update_exists
> Key (a, b)=(2, 1); existing local tuple (2, 1); remote tuple (2, 1).
>
> Operations performed were:
> Pub: insert into tab values (1,1);
> Sub: insert into tab values (2,1);
> Pub: update tab set a=2 where a=1;
>
> Here Key and local tuple are both 2,1 instead of 1,1. While replica
> identity value (used to search original row) will be 1,1 only.
>
> It may be slightly confusing or say tricky to understand when compared
> to other conflicts' LOGs. But not sure what better we can do here.
>

The update_exists behaves more like insert_exists as we detect that
only while inserting into index. It is also not clear to me if we can
do better than to clarify this in docs.

> 
>
> One more comment:
>
> 5)
> For insert/update_exists, the sequence is:
> Key .. ; existing local tuple .. ; remote tuple ...
>
> For rest of the conflicts, sequence is:
>  Existing local tuple .. ; remote tuple .. ; replica identity ..
>
> Is it intentional? Shall the 'Key' or 'Replica Identity' be the first
> one to come in all conflicts?
>

This is worth considering but Replica Identity signifies the old tuple
values, that is why it is probably kept at the end. But let's see what
Hou-San or others think about this.

-- 
With Regards,
Amit Kapila.

Re: Conflict detection and logging in logical replication

2024-08-15 Thread Amit Kapila

On Fri, Aug 16, 2024 at 10:46 AM shveta malik  wrote:
>
> On Thu, Aug 15, 2024 at 12:47 PM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > Thanks. I have checked and merged the changes. Here is the V15 patch
> > which addressed above comments.
>
> Thanks for the patch. Please find few comments and queries:
>
> 1)
> For various conflicts , we have these in Logs:
> Replica identity (val1)=(30).(for RI on 1 column)
> Replica identity (pk, val1)=(200, 20).  (for RI on  2 columns)
> Replica identity (40, 40, 11).(for RI full)
>
> Shall we have have column list in last case as well, or can simply
> have *full* keyword i.e. Replica identity full (40, 40, 11)
>

I would prefer 'full' instead of the entire column list as the
complete column list could be long and may not much sense.

>
> 2)
> For toast column, we dump null in remote-tuple. I know that the toast
> column is not sent in  new-tuple from the publisher and thus the
> behaviour, but it could be misleading for users. Perhaps document
> this?
>

Agreed that we should document this. I suggest that we can have a doc
patch that explains the conflict logging format and in that, we can
mention this behavior as well.

> 3)
> For update_exists(), we dump:
> Key (a, b)=(2, 1)
>
> For delete_missing, update_missing, update_differ, we dump:
> Replica identity (a, b)=(2, 1).
>
> For update_exists as well, shouldn't we dump 'Replica identity'? Only
> for insert case, it should be referred as 'Key'.
>

I think update_exists is quite similar to insert_exists and both
happen due to unique key violation. So, it seems okay to display the
Key for update_exists.

>
> 4)
> Why delete_missing is not having remote_tuple. Is it because we dump
> new tuple as 'remote tuple', which is not relevant for delete_missing?
>

Right.

-- 
With Regards,
Amit Kapila.

Re: Conflict detection and logging in logical replication

2024-08-15 Thread Amit Kapila

On Wed, Aug 14, 2024 at 7:45 PM Michail Nikolaev
 wrote:
>
> > This is as expected, and we have documented this in the code comments. We 
> > don't
> > need to report a conflict if the conflicting tuple has been removed or 
> > updated
> > due to concurrent transaction. The same is true if the transaction that
> > inserted the conflicting tuple is rolled back before 
> > CheckAndReportConflict().
> > We don't consider such cases as a conflict.
>
> That seems a little bit strange to me.
>
> From the perspective of a user, I expect that if a change from publisher is 
> not applied - I need to know about it from the logs.
>

In the above conditions where a concurrent tuple insertion is removed
or rolled back before CheckAndReportConflict, the tuple inserted by
apply will remain. There is no need to report anything in such cases
as apply was successful.

> But in that case, I will not see any information about conflict in the logs 
> in SOME cases. But in OTHER cases I will see it.
> However, in both cases the change from publisher was not applied.
> And these cases are just random and depend on the timing of race conditions. 
> It is not something I am expecting from the database.
>
> Maybe it is better to report about the fact that event from publisher was not 
> applied because of conflict and then try to
> provide additional information about the conflict itself?
>
> Or possibly in case we were unable to apply the event and not able to find 
> the conflict, we should retry the event processing?
>

Per my understanding, we will apply or the conflict will be logged and
retried where required (unique key violation).

-- 
With Regards,
Amit Kapila.

Re: Conflict detection and logging in logical replication

2024-08-14 Thread Amit Kapila

On Wed, Aug 14, 2024 at 8:05 AM Zhijie Hou (Fujitsu)
 wrote:
>
> Here is the V14 patch.
>

Review comments:
1.
ReportApplyConflict()
{
...
+ ereport(elevel,
+ errcode(ERRCODE_INTEGRITY_CONSTRAINT_VIOLATION),
+ errmsg("conflict detected on relation \"%s.%s\": conflict=%s",
+get_namespace_name(RelationGetNamespace(localrel)),
...

Is it a good idea to use ERRCODE_INTEGRITY_CONSTRAINT_VIOLATION for
all conflicts? I think it is okay to use for insert_exists and
update_exists. The other error codes to consider for conflicts other
than insert_exists and update_exists are
ERRCODE_T_R_SERIALIZATION_FAILURE, ERRCODE_CARDINALITY_VIOLATION,
ERRCODE_NO_DATA, ERRCODE_NO_DATA_FOUND,
ERRCODE_TRIGGERED_DATA_CHANGE_VIOLATION,
ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE.

BTW, even for insert/update_exists, won't it better to use
ERRCODE_UNIQUE_VIOLATION?

2.
+build_tuple_value_details()
{
...
+ if (searchslot)
+ {
+ /*
+ * If a valid index OID is provided, build the replica identity key
+ * value string. Otherwise, construct the full tuple value for REPLICA
+ * IDENTITY FULL cases.
+ */

AFAICU, this can't happen for insert/update_exists. If so, we should
add an assert for those two conflict types.

3.
+build_tuple_value_details()
{
...
+/*
+ * Although logical replication doesn't maintain the bitmap for the
+ * columns being inserted, we still use it to create 'modifiedCols'
+ * for consistency with other calls to ExecBuildSlotValueDescription.
+ */
+modifiedCols = bms_union(ExecGetInsertedCols(relinfo, estate),
+ ExecGetUpdatedCols(relinfo, estate));
+desc = ExecBuildSlotValueDescription(relid, remoteslot, tupdesc,
+   modifiedCols, 64);

Can we mention in the comments the reason for not including generated columns?

Apart from the above, the attached contains some cosmetic changes.

-- 
With Regards,
Amit Kapila.
diff --git a/src/backend/replication/logical/conflict.c 
b/src/backend/replication/logical/conflict.c
index 0f51564aa2..c56814cf50 100644
--- a/src/backend/replication/logical/conflict.c
+++ b/src/backend/replication/logical/conflict.c
@@ -1,14 +1,14 @@
 /*-
  * conflict.c
- *Functionality for detecting and logging conflicts.
+ *Support routines for logging conflicts.
  *
  * Copyright (c) 2024, PostgreSQL Global Development Group
  *
  * IDENTIFICATION
  *   src/backend/replication/logical/conflict.c
  *
- * This file contains the code for detecting and logging conflicts on
- * the subscriber during logical replication.
+ * This file contains the code for logging conflicts on the subscriber during
+ * logical replication.
  *-
  */
 
@@ -105,7 +105,7 @@ GetTupleTransactionInfo(TupleTableSlot *localslot, 
TransactionId *xmin,
  * OIDs should not be passed here.
  *
  * The caller must ensure that the index with the OID 'indexoid' is locked so
- * that we can display the conflicting key value.
+ * that we can fetch and display the conflicting key value.
  */
 void
 ReportApplyConflict(EState *estate, ResultRelInfo *relinfo, int elevel,
@@ -403,7 +403,8 @@ build_tuple_value_details(EState *estate, ResultRelInfo 
*relinfo,
  * Helper functions to construct a string describing the contents of an index
  * entry. See BuildIndexValueDescription for details.
  *
- * The caller should ensure that the index with the OID 'indexoid' is locked.
+ * The caller must ensure that the index with the OID 'indexoid' is locked so
+ * that we can fetch and display the conflicting key value.
  */
 static char *
 build_index_value_desc(EState *estate, Relation localrel, TupleTableSlot *slot,
diff --git a/src/include/replication/conflict.h 
b/src/include/replication/conflict.h
index 971dfa98dc..02cb84da7e 100644
--- a/src/include/replication/conflict.h
+++ b/src/include/replication/conflict.h
@@ -1,6 +1,6 @@
 /*-
  * conflict.h
- *Exports for conflict detection and log
+ *Exports for conflicts logging.
  *
  * Copyright (c) 2024, PostgreSQL Global Development Group
  *
@@ -13,7 +13,7 @@
 #include "utils/timestamp.h"
 
 /*
- * Conflict types that could be encountered when applying remote changes.
+ * Conflict types that could occur while applying remote changes.
  */
 typedef enum
 {
@@ -36,9 +36,9 @@ typedef enum
CT_DELETE_MISSING,
 
/*
-* Other conflicts, such as exclusion constraint violations, involve 
rules
-* that are more complex than simple equality checks. These conflicts 
are
-* left for future improvements.
+* Other conflicts, such as exclusion constraint violations, involve 
more
+* complex rules than simple equality checks. These conflicts are left 
for
+* future improvements.
 */
 } ConflictType;

Re: Conflict detection and logging in logical replication

2024-08-13 Thread Amit Kapila

On Tue, Aug 13, 2024 at 10:09 AM Zhijie Hou (Fujitsu)
 wrote:
>
> Here is V13 patch set which addressed above comments.
>

1.
+ReportApplyConflict(int elevel, ConflictType type, EState *estate,
+ ResultRelInfo *relinfo,

The change looks better but it would still be better to keep elevel
and type after relinfo. The same applies to other places as well.

2.
+ * The caller should ensure that the index with the OID 'indexoid' is locked.
+ *
+ * Refer to errdetail_apply_conflict for the content that will be included in
+ * the DETAIL line.
+ */
+void
+ReportApplyConflict(int elevel, ConflictType type, EState *estate,

Is it possible to add an assert to ensure that the index is locked by
the caller?

3.
+static char *
+build_tuple_value_details(EState *estate, ResultRelInfo *relinfo,
+   TupleTableSlot *searchslot,
+   TupleTableSlot *localslot,
+   TupleTableSlot *remoteslot,
+   Oid indexoid)
{
...
...
+ /*
+ * If 'searchslot' is NULL and 'indexoid' is valid, it indicates that we
+ * are reporting the unique constraint violation conflict, in which case
+ * the conflicting key values will be reported.
+ */
+ if (OidIsValid(indexoid) && !searchslot)
+ {
...
...
}

This indirect way of inferencing constraint violation looks fragile.
The caller should pass the required information explicitly and then
you can have the required assertions here.

Apart from the above, I have made quite a few changes in the code
comments and LOG messages in the attached.

-- 
With Regards,
Amit Kapila.
diff --git a/src/backend/replication/logical/conflict.c 
b/src/backend/replication/logical/conflict.c
index 8f7e5bfdd4..4995651644 100644
--- a/src/backend/replication/logical/conflict.c
+++ b/src/backend/replication/logical/conflict.c
@@ -79,10 +79,11 @@ GetTupleTransactionInfo(TupleTableSlot *localslot, 
TransactionId *xmin,
 }
 
 /*
- * Report a conflict when applying remote changes.
+ * This function is used to report a conflict while applying replication
+ * changes.
  *
- * 'searchslot' should contain the tuple used to search for the local tuple to
- * be updated or deleted.
+ * 'searchslot' should contain the tuple used to search the local tuple to be
+ * updated or deleted.
  *
  * 'localslot' should contain the existing local tuple, if any, that conflicts
  * with the remote tuple. 'localxmin', 'localorigin', and 'localts' provide the
@@ -91,17 +92,17 @@ GetTupleTransactionInfo(TupleTableSlot *localslot, 
TransactionId *xmin,
  * 'remoteslot' should contain the remote new tuple, if any.
  *
  * The 'indexoid' represents the OID of the replica identity index or the OID
- * of the unique index that triggered the constraint violation error. Note that
- * while other indexes may also be used (see
+ * of the unique index that triggered the constraint violation error. We use
+ * this to report the key values for conflicting tuple.
+ *
+ * Note that while other indexes may also be used (see
  * IsIndexUsableForReplicaIdentityFull for details) to find the tuple when
  * applying update or delete, such an index scan may not result in a unique
  * tuple and we still compare the complete tuple in such cases, thus such index
  * OIDs should not be passed here.
  *
- * The caller should ensure that the index with the OID 'indexoid' is locked.
- *
- * Refer to errdetail_apply_conflict for the content that will be included in
- * the DETAIL line.
+ * The caller must ensure that the index with the OID 'indexoid' is locked so
+ * that we can display the conflicting key value.
  */
 void
 ReportApplyConflict(int elevel, ConflictType type, EState *estate,
@@ -157,10 +158,10 @@ InitConflictIndexes(ResultRelInfo *relInfo)
 /*
  * Add an errdetail() line showing conflict detail.
  *
- * The DETAIL line comprises two parts:
+ * The DETAIL line comprises of two parts:
  * 1. Explanation of the conflict type, including the origin and commit
  *timestamp of the existing local tuple.
- * 2. Display of conflicting key, existing local tuple, remote new tuple and
+ * 2. Display of conflicting key, existing local tuple, remote new tuple, and
  *replica identity columns, if any. The remote old tuple is excluded as its
  *information is covered in the replica identity columns.
  */
@@ -196,11 +197,11 @@ errdetail_apply_conflict(ConflictType type, EState 
*estate,
 
localxmin, timestamptz_to_str(localts));
 
/*
-* The origin which modified the row has been 
dropped. This
-* situation may occur if the origin was 
created by a
-* different apply worker, but its associated 
subscription and
-* origin were dropped after updating the row, 
or if the
-

Re: [bug fix] prepared transaction might be lost when max_prepared_transactions is zero on the subscriber

2024-08-12 Thread Amit Kapila

On Mon, Aug 12, 2024 at 3:37 PM shveta malik  wrote:
>
> On Fri, Aug 9, 2024 at 2:39 PM Hayato Kuroda (Fujitsu)
>  wrote:
> >
> >
> > > + /*
> > > + * Register a callback to reset the origin state before aborting the
> > > + * transaction in ShutdownPostgres(). This is to prevent the advancement
> > > + * of origin progress if the transaction failed to apply.
> > > + */
> > > + before_shmem_exit(replorigin_reset, (Datum) 0);
> > >
> > > I think we need this despite resetting the origin-related variables in
> > > PG_CATCH block to handle FATAL error cases, right? If so, can we use
> > > PG_ENSURE_ERROR_CLEANUP() instead of PG_CATCH()?
> >
> > There are two reasons to add a shmem-exit callback. One is to support a 
> > FATAL,
> > another one is to support the case that user does the shutdown request while
> > applying changes. In this case, I think ShutdownPostgres() can be called so 
> > that
> > the session origin may advance.
>
> Agree that we need the 'reset' during shutdown flow as well. Details at [1]
>

Thanks for the detailed analysis. I agree with your analysis that we
need to reset the origin information for the shutdown path to avoid it
being advanced incorrectly. However, the patch doesn't have sufficient
comments to explain why we need to reset it for both the ERROR and
Shutdown paths. Can we improve the comments in the patch?

Also, for the ERROR path, can we reset the origin information in
apply_error_callback()?

BTW, this needs to be backpatched till 16 when it has been introduced
by the parallel apply feature as part of commit 216a7848. So, can we
test this patch in back-branches as well?

-- 
With Regards,
Amit Kapila.

Re: Taking into account syncrep position in flush_lsn reported by apply worker

2024-08-12 Thread Amit Kapila

On Mon, Aug 12, 2024 at 3:43 PM Arseny Sher  wrote:
>
> Sorry for the poor formatting of the message above, this should be better:
>
> Hey. Currently synchronous_commit is disabled for logical apply worker
> on the ground that reported flush_lsn includes only locally flushed data
> so slot (publisher) preserves everything higher than this, and so in
> case of subscriber restart no data is lost. However, imagine that
> subscriber is made highly available by standby to which synchronous
> replication is enabled. Then reported flush_lsn is ignorant of this
> synchronous replication progress, and in case of failover data loss may
> occur if subscriber managed to ack flush_lsn ahead of syncrep.
>

Won't the same can be achieved by enabling the synchronous_commit
parameter for a subscription?

-- 
With Regards,
Amit Kapila.

Re: Conflict detection and logging in logical replication

2024-08-12 Thread Amit Kapila

On Fri, Aug 9, 2024 at 12:29 PM Zhijie Hou (Fujitsu)
 wrote:
>
> Here is the V12 patch that improved the log format as discussed.
>

Review comments:
===
1. The patch doesn't display the remote tuple for delete_differ case.
However, it shows the remote tuple correctly for update_differ. Is
there a reason for the same? See below messages:

update_differ:
--
LOG:  conflict detected on relation "public.t1": conflict=update_differ
DETAIL:  Updating the row containing (c1)=(1) that was modified
locally in transaction 806 at 2024-08-12 11:48:14.970002+05:30.
Existing local tuple (1, 3, arun  ); remote tuple (1, 3,
ajay  ).
...

delete_differ
--
LOG:  conflict detected on relation "public.t1": conflict=delete_differ
DETAIL:  Deleting the row containing (c1)=(1) that was modified by
locally in transaction 809 at 2024-08-12 14:15:41.966467+05:30.
Existing local tuple (1, 3, arun  ).

Note this happens when the publisher table has a REPLICA IDENTITY FULL
and the subscriber table has primary_key. It would be better to keep
the messages consistent. One possibility is that we remove
key/old_tuple from the first line of the DETAIL message and display it
in the second line as Existing local tuple ; remote tuple
<..>; key <...>

2. Similar to above, the remote tuple is not displayed in
delete_missing but displayed in updated_missing type of conflict. If
we follow the style mentioned in the previous point then the DETAIL
message: "DETAIL:  Did not find the row containing (c1)=(1) to be
updated." can also be changed to: "DETAIL:  Could not find the row to
be updated." followed by other detail.

3. The detail of insert_exists is confusing.

ERROR:  conflict detected on relation "public.t1": conflict=insert_exists
DETAIL:  Key (c1)=(1) already exists in unique index "t1_pkey", which
was modified locally in transaction 802 at 2024-08-12
11:11:31.252148+05:30.

It sounds like the key value "(c1)=(1)" in the index is modified. How
about changing slightly as: "Key (c1)=(1) already exists in unique
index "t1_pkey", modified locally in transaction 802 at 2024-08-12
11:11:31.252148+05:30."? Feel free to propose if anything better comes
to your mind.

4.
if (localorigin == InvalidRepOriginId)
+ appendStringInfo(&err_detail, _("Deleting the row containing %s that
was modified by locally in transaction %u at %s."),
+ val_desc, localxmin, timestamptz_to_str(localts));

Typo in the above message. /modified by locally/modified locally

5.
@@ -2661,6 +2662,29 @@ apply_handle_update_internal(ApplyExecutionData *edata,
{
...
found = FindReplTupleInLocalRel(edata, localrel,
&relmapentry->remoterel,
localindexoid,
remoteslot, &localslot);
...
...
+
+ ReportApplyConflict(LOG, CT_UPDATE_DIFFER, relinfo,
+ GetRelationIdentityOrPK(localrel),

To find the tuple, we may have used an index other than Replica
Identity or PK (see IsIndexUsableForReplicaIdentityFull), but while
reporting conflict we don't consider such an index. I think the reason
is that such an index scan wouldn't have resulted in a unique tuple
and that is why we always compare the complete tuple in such cases. Is
that the reason? Can we write a comment to make it clear?

6.
void ReportApplyConflict(int elevel, ConflictType type,
+ ResultRelInfo *relinfo, Oid indexoid,
+ TransactionId localxmin,
+ RepOriginId localorigin,
+ TimestampTz localts,
+ TupleTableSlot *searchslot,
+ TupleTableSlot *localslot,
+ TupleTableSlot *remoteslot,
+ EState *estate);

The prototype looks odd with pointers and non-pointer variables in
mixed order. How about arranging parameters in the following order:
Estate, ResultRelInfo, TupleTableSlot *searchslot, TupleTableSlot
*localslot, TupleTableSlot *remoteslot, Oid indexoid, TransactionId
localxmin, RepOriginId localorigin, TimestampTz localts?

7. Like above, check the parameters of other functions like
errdetail_apply_conflict, build_index_value_desc,
build_tuple_value_details, etc.

-- 
With Regards,
Amit Kapila.

Re: Conflict detection and logging in logical replication

2024-08-11 Thread Amit Kapila

On Fri, Aug 9, 2024 at 12:29 PM Zhijie Hou (Fujitsu)
 wrote:
>
> Here is the V12 patch that improved the log format as discussed.
>

*
diff --git a/src/test/subscription/out b/src/test/subscription/out
new file mode 100644
index 00..2b68e9264a
--- /dev/null
+++ b/src/test/subscription/out
@@ -0,0 +1,29 @@
+make -C ../../../src/backend generated-headers
+make[1]: Entering directory '/home/houzj/postgresql/src/backend'
+make -C ../include/catalog generated-headers
+make[2]: Entering directory '/home/houzj/postgresql/src/include/catalog'
+make[2]: Nothing to be done for 'generated-headers'.
+make[2]: Leaving directory '/home/houzj/postgresql/src/include/catalog'
+make -C nodes generated-header-symlinks
+make[2]: Entering directory '/home/houzj/postgresql/src/backend/nodes'
+make[2]: Nothing to be done for 'generated-header-symlinks'.
+make[2]: Leaving directory '/home/houzj/postgresql/src/backend/nodes'
+make -C utils generated-header-symlinks
+make[2]: Entering directory '/home/houzj/postgresql/src/backend/utils'
+make -C adt jsonpath_gram.h
+make[3]: Entering directory '/home/houzj/postgresql/src/backend/utils/adt'
+make[3]: 'jsonpath_gram.h' is up to date.
+make[3]: Leaving directory '/home/houzj/postgresql/src/backend/utils/adt'
+make[2]: Leaving directory '/home/houzj/postgresql/src/backend/utils'
+make[1]: Leaving directory '/home/houzj/postgresql/src/backend'
+rm -rf '/home/houzj/postgresql'/tmp_install
+/usr/bin/mkdir -p '/home/houzj/postgresql'/tmp_install/log
+make -C '../../..' DESTDIR='/home/houzj/postgresql'/tmp_install
install >'/home/houzj/postgresql'/tmp_install/log/install.log 2>&1
+make -j1  checkprep >>'/home/houzj/postgresql'/tmp_install/log/install.log 2>&1
+PATH="/home/houzj/postgresql/tmp_install/home/houzj/pgsql/bin:/home/houzj/postgresql/src/test/subscription:$PATH"
LD_LIBRARY_PATH="/home/houzj/postgresql/tmp_install/home/houzj/pgsql/lib"
INITDB_TEMPLATE='/home/houzj/postgresql'/tmp_install/initdb-template
initdb --auth trust --no-sync --no-instructions --lc-messages=C
--no-clean '/home/houzj/postgresql'/tmp_install/initdb-template
>>'/home/houzj/postgresql'/tmp_install/log/initdb-template.log 2>&1
+echo "# +++ tap check in src/test/subscription +++" && rm -rf
'/home/houzj/postgresql/src/test/subscription'/tmp_check &&
/usr/bin/mkdir -p
'/home/houzj/postgresql/src/test/subscription'/tmp_check && cd . &&
TESTLOGDIR='/home/houzj/postgresql/src/test/subscription/tmp_check/log'
TESTDATADIR='/home/houzj/postgresql/src/test/subscription/tmp_check'
PATH="/home/houzj/postgresql/tmp_install/home/houzj/pgsql/bin:/home/houzj/postgresql/src/test/subscription:$PATH"
LD_LIBRARY_PATH="/home/houzj/postgresql/tmp_install/home/houzj/pgsql/lib"
INITDB_TEMPLATE='/home/houzj/postgresql'/tmp_install/initdb-template
PGPORT='65432' 
top_builddir='/home/houzj/postgresql/src/test/subscription/../../..'
PG_REGRESS='/home/houzj/postgresql/src/test/subscription/../../../src/test/regress/pg_regress'
/usr/bin/prove -I ../../../src/test/perl/ -I .  t/013_partition.pl
+# +++ tap check in src/test/subscription +++
+t/013_partition.pl .. ok
+All tests successful.
+Files=1, Tests=73,  4 wallclock secs ( 0.02 usr  0.00 sys +  0.59
cusr  0.21 csys =  0.82 CPU)
+Result: PASS

The above is added to the patch by mistake. Can you please remove it
from the patch unless there is a reason?

-- 
With Regards,
Amit Kapila.

Re: Fix memory counter update in reorderbuffer

2024-08-08 Thread Amit Kapila

On Thu, Aug 8, 2024 at 9:43 PM Masahiko Sawada  wrote:
>
> On Wed, Aug 7, 2024 at 3:17 PM Amit Kapila  wrote:
> >
> > On Wed, Aug 7, 2024 at 7:42 AM Masahiko Sawada  
> > wrote:
>
> >
> > >
> > > > BTW, commit 5bec1d6bc5e also introduced
> > > > ReorderBufferChangeMemoryUpdate() in ReorderBufferTruncateTXN() which
> > > > is also worth considering while fixing the reported problem. It may
> > > > not have the same problem as you have reported but we can consider
> > > > whether setting txn size as zero explicitly is required or not.
> > >
> > > The reason why it introduced ReorderBufferChangeMemoryUpdate() is the
> > > same as I mentioned above. And yes, as you mentioned, it doesn't have
> > > the same problem that I reported here.
> > >
> >
> > I checked again and found that ReorderBufferResetTXN() first calls
> > ReorderBufferTruncateTXN() and then ReorderBufferToastReset(). After
> > that, it also tries to free spec_insert change which should have the
> > same problem. So, what saves this path from the same problem?
>
> Good catch. I've not created a test case for that but it seems to be
> possible to happen.
>
> I think that subtracting txn->size to reduce the memory counter to
> zero seems to be a wrong idea in the first place. If we want to save
> updating memory counter and max-heap, we should use the exact memory
> size that we freed. In other words, just change the memory usage
> update to a batch operation.
>

Sounds reasonable but how would you find the size for a batch update
operation? Are you planning to track it while freeing the individual
changes?

-- 
With Regards,
Amit Kapila.

Re: [bug fix] prepared transaction might be lost when max_prepared_transactions is zero on the subscriber

2024-08-08 Thread Amit Kapila

On Fri, Aug 9, 2024 at 9:35 AM shveta malik  wrote:
>
> On Thu, Aug 8, 2024 at 6:08 PM Amit Kapila  wrote:
> >
> > >
> > > I think this is a general issue that can occur not only due to 2PC. IIUC, 
> > > this
> > > problem should arise if any ERROR arises after setting the
> > > replorigin_session_origin_lsn but before the CommitTransactionCommand is
> > > completed. If so, I think we should register it for tablesync worker as 
> > > well.
> > >
> >
> > As pointed out earlier, won't using PG_ENSURE_ERROR_CLEANUP() instead
> > of PG_CATCH() be enough?
>
> Yes, I think it should suffice. IIUC, we are going to change
> 'replorigin_session_origin_lsn' only in start_apply() and not before
> that, and thus ensuring its reset during any ERROR or FATAL in
> start_apply() is good enough.
>

Right, I also think so.

> And I guess we don't want this
> origin-reset to be called during regular shutdown, isn't it?
>

Agreed. OTOH, there was no harm even if such a reset function is invoked.

> But
> registering it through before_shmem_exit() will make the
> reset-function to be called during normal shutdown as well.
>

True and unless I am missing something we won't need it. I would like
to hear the opinions of Hou-San and Kuroda-San on the same.

> And to answer my previous question (as Hou-San also  pointed out), we
> do need it in table-sync worker as well. So a change in start_apply
> will make sure the fix is valid for both apply and tablesync worker.
>

As table-sync workers can also apply transactions after the initial
copy, we need it for table-sync during its apply phase.

-- 
With Regards,
Amit Kapila.

Re: Found issues related with logical replication and 2PC

2024-08-08 Thread Amit Kapila

On Fri, Aug 9, 2024 at 10:34 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> >
> > The code changes look mostly good to me. I have changed/added a few
> > comments in the attached modified version.
> >
>
> Thanks for updating the patch! It LGTM. I've tested your patch and confirmed
> it did not cause the data loss.
>

Thanks for the additional testing. I have pushed this patch.

-- 
With Regards,
Amit Kapila.

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1697 matches

Mail list logo