Hi,

On 12/5/23 6:08 AM, shveta malik wrote:
On Mon, Dec 4, 2023 at 10:07 PM Drouvot, Bertrand
<bertranddrouvot...@gmail.com> wrote:
Maybe another option could be to have the walreceiver a way to let the slot sync
worker knows that it (the walreceiver) was not able to start due to non existing
replication slot on the primary? (that way we'd avoid the slot sync worker 
having
to talk to the primary).

Few points:
1) I think if we do it, we should do it in generic way i.e. slotsync
worker should go to no-op if walreceiver is not able to start due to
any reason and not only due to invalid primary_slot_name.

Agree.

2) Secondly, slotsync worker needs to make sure it has synced the
slots so far i.e. worker should not go to no-op immediately on seeing
missing WalRcv process if there are pending slots to be synced.

Agree.

So the generic way I see to have this optimization is:
1) Slotsync worker can use 'WalRcv->pid' to figure out if WalReceiver
is running or not.

Not sure that would work because the walreceiver keeps try re-starting
and so get a pid before reaching the "could not start WAL streaming: ERROR:  replication slot 
"XXXX" does not exist"
error.

We may want to add an extra check on walrcv->walRcvState (or should/could be 
enough by its own).
But walrcv->walRcvState is set to WALRCV_STREAMING way before 
walrcv_startstreaming().

Wouldn't that make sense to move it once we are sure that
walrcv_startstreaming() returns true and first_stream is true, here?

"
                        if (first_stream)
+                       {
                                ereport(LOG,
                                                (errmsg("started streaming WAL from 
primary at %X/%X on timeline %u",
                                                                
LSN_FORMAT_ARGS(startpoint), startpointTLI)));
+                               SpinLockAcquire(&walrcv->mutex);
+                               walrcv->walRcvState = WALRCV_STREAMING;
+                               SpinLockRelease(&walrcv->mutex);
+                       }
"

2) Slotsync worker should check null 'WalRcv->pid' only when
no-activity is observed for threshold time i.e. it can do it during
existing logic of increasing naptime.
3) On finding null  'WalRcv->pid', worker can mark a flag to go to
no-op unless WalRcv->pid becomes valid again. Marking this flag during
increasing naptime will guarantee that the worker has taken all the
changes so far i.e. standby is not lagging in terms of slots.


2) and 3) looks good to me but with a check on walrcv->walRcvState
looking for WALRCV_STREAMING state instead of looking for a non null
WalRcv->pid.

And only if it makes sense to move the walrcv->walRcvState = WALRCV_STREAMING as
mentioned above (I think it does).

Thoughts?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


Reply via email to