On Thursday, April 11, 2024 12:11 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Wed, Apr 10, 2024 at 5:28 PM Zhijie Hou (Fujitsu) <houzj.f...@fujitsu.com> > wrote: > > > > On Thursday, April 4, 2024 5:37 PM Amit Kapila <amit.kapil...@gmail.com> > wrote: > > > > > > BTW, while thinking on this one, I > > > noticed that in the function LogicalConfirmReceivedLocation(), we > > > first update the disk copy, see comment [1] and then in-memory > > > whereas the same is not true in > > > update_local_synced_slot() for the case when snapshot exists. Now, > > > do we have the same risk here in case of standby? Because I think we > > > will use these xmins while sending the feedback message (in > XLogWalRcvSendHSFeedback()). > > > > > > * We have to write the changed xmin to disk *before* we change > > > * the in-memory value, otherwise after a crash we wouldn't know > > > * that some catalog tuples might have been removed already. > > > > Yes, I think we have the risk on the standby, I can reproduce the case > > that if the server crashes after updating the in-memory value and > > before saving them to disk, the synced slot could be invalidated after > > restarting from crash, because the necessary rows have been removed on > > the primary. The steps can be found in [1]. > > > > I think we'd better fix the order in update_local_synced_slot() as > > well. I tried to make the fix in 0002, 0001 is Shveta's patch to fix > > another issue in this thread. Since they are touching the same function, so > attach them together for review. > > > > Few comments: > =============== > 1. > + > + /* Sanity check */ > + if (slot->data.confirmed_flush != remote_slot->confirmed_lsn) > + ereport(LOG, errmsg("synchronized confirmed_flush for slot \"%s\" > + differs from > remote slot", > + remote_slot->name), > > Is there a reason to use elevel as LOG instead of ERROR? I think it should be > elog(ERROR, .. as this is an unexpected case.
Agreed. > > 2. > - if (remote_slot->restart_lsn < slot->data.restart_lsn) > + if (remote_slot->confirmed_lsn < slot->data.confirmed_flush) > elog(ERROR, > "cannot synchronize local slot \"%s\" LSN(%X/%X)" > > Can we be more specific in this message? How about splitting it into > error_message as "cannot synchronize local slot \"%s\"" and then errdetail as > "Local slot's start streaming location LSN(%X/%X) is ahead of remote slot's > LSN(%X/%X)"? Your version looks better. Since the above two messages all have errdetail, I used the style of ereport(ERROR, errmsg_internal(), errdetail_internal()... in the patch which is equal to the elog(ERROR but has an additional detail message. Here is V5 patch set. Best Regards, Hou zj
v5-0002-write-the-changed-xmin-to-disk-before-computing-g.patch
Description: v5-0002-write-the-changed-xmin-to-disk-before-computing-g.patch
v5-0001-Correct-sanity-check-to-compare-confirmed_lsn.patch
Description: v5-0001-Correct-sanity-check-to-compare-confirmed_lsn.patch