Re: Synchronizing slots from primary to standby

Bertrand Drouvot Sat, 06 Apr 2024 09:47:56 -0700

Hi,

On Sat, Apr 06, 2024 at 10:13:00AM +0530, Amit Kapila wrote:
> On Fri, Apr 5, 2024 at 8:05 PM Bertrand Drouvot
> <bertranddrouvot...@gmail.com> wrote:
> 
> I think the new LSN can be visible only when the corresponding WAL is
> written by XLogWrite(). I don't know what in XLogSetAsyncXactLSN() can
> make it visible. In your experiment below, isn't it possible that in
> the meantime WAL writer has written that WAL due to which you are
> seeing an updated location?


What I did is:

session 1:  select pg_current_wal_lsn();\watch 1
session 2:  select pg_backend_pid();

terminal 1: tail -f logfile | grep -i snap 
terminal 2 : gdb -p <backendpid session 2) -ex 'b LogCurrentRunningXacts' + 
"continue" once in gdb

session 2: SELECT pg_log_standby_snapshot();

That produces a break in the gdb session, then:

Breakpoint 1, LogCurrentRunningXacts (CurrRunningXacts=0x5774f92f8da0 
<CurrentRunningXactsData.13>) at standby.c:1346
1346    {
(gdb) n
1350
 
Then next, next until the DEBUG message is emitted (confirmed in terminal 1).

At this stage the DEBUG message shows the new LSN while session 1 still displays
the previous LSN.

Then once XLogSetAsyncXactLSN() is done in the gdb session (terminal 2) then
session 1 displays the new LSN.

This is reproducible as desired.

With more debugging I can see that when the spinlock is released in 
XLogSetAsyncXactLSN()
then XLogWrite() is doing its job and then session 1 does see the new value 
(that
happens in this order, and as you said that's expected).

My point is that while the DEBUG message is emitted session 1 still 
see the old LSN (until the new LSN is vsible). I think that we should emit the
DEBUG message once session 1 can see the new value (If not, I think the 
timestamp
of the DEBUG message can be missleading during debugging purpose).

> I think I am missing how exactly moving DEBUG2 can confirm the above theory.

I meant to say that instead of seeing:

2024-04-05 04:37:05.074 UTC [3854278][background writer][:0] DEBUG:  snapshot 
of 0+0 running transaction ids (lsn 0/3000098 oldest xid 740 latest complete 
739 next xid 740)
2024-04-05 04:37:05.197 UTC [3866475][client backend][2/4:0] LOG:  statement: 
SELECT '0/3000060' <= replay_lsn AND state = 'streaming'

We would probably see something like:

2024-04-05 04:37:05.<something> UTC [3866475][client backend][2/4:0] LOG:  
statement: SELECT '0/3000060' <= replay_lsn AND state = 'streaming'
2024-04-05 04:37:05.<something>+xx UTC [3854278][background writer][:0] DEBUG:  
snapshot of 0+0 running transaction ids (lsn 0/3000098 oldest xid 740 latest 
complete 739 next xid 740)

And then it would be clear that the query has ran before the new LSN is visible.

> > If the theory is proven then I'm not sure we need the extra complexity of
> > injection point here, maybe just relying on the slots created via SQL API 
> > could
> > be enough.
> >
> 
> Yeah, that could be the first step. We can probably add an injection
> point to control the bgwrite behavior and then add tests involving
> walsender performing the decoding. But I think it is important to have
> sufficient tests in this area as I see they are quite helpful in
> uncovering the issues.
>

Yeah agree. 

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Synchronizing slots from primary to standby

Reply via email to