Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

vignesh C Mon, 14 Aug 2023 05:00:01 -0700

On Thu, 10 Aug 2023 at 10:16, Amit Kapila <[email protected]> wrote:
>
> On Wed, Aug 9, 2023 at 8:28 AM Zhijie Hou (Fujitsu)
> <[email protected]> wrote:
> >
> > On Thursday, August 3, 2023 7:30 PM Melih Mutlu <[email protected]>  
> > wrote:
> >
> > > Right. I attached the v26 as you asked.
> >
> > Thanks for posting the patches.
> >
> > While reviewing the patch, I noticed one rare case that it's possible that 
> > there
> > are two table sync worker for the same table in the same time.
> >
> > The patch relies on LogicalRepWorkerLock to prevent concurrent access, but 
> > the
> > apply worker will start a new worker after releasing the lock. So, at the 
> > point[1]
> > where the lock is released and the new table sync worker has not been 
> > started,
> > it seems possible that another old table sync worker will be reused for the
> > same table.
> >
> >                                 /* Now safe to release the LWLock */
> >                                 LWLockRelease(LogicalRepWorkerLock);
> > *[1]
> >                                 /*
> >                                  * If there are free sync worker slot(s), 
> > start a new sync
> >                                  * worker for the table.
> >                                  */
> >                                 if (nsyncworkers < 
> > max_sync_workers_per_subscription)
> >                                 ...
> >                                                 
> > logicalrep_worker_launch(MyLogicalRepWorker->dbid,
> >
>
> Yeah, this is a problem. I think one idea to solve this is by
> extending the lock duration till we launch the tablesync worker but we
> should also consider changing this locking scheme such that there is a
> better way to indicate that for a particular rel, tablesync is in
> progress. Currently, the code in TablesyncWorkerMain() also acquires
> the lock in exclusive mode even though the tablesync for a rel is in
> progress which I guess could easily heart us for larger values of
> max_logical_replication_workers. So, that could be another motivation
> to think for a different locking scheme.


There are couple of ways in which this issue can be solved:
Approach #1) check that the reuse worker has not picked up this table
for table sync from logicalrep_worker_launch while holding a lock on
LogicalRepWorkerLock, if the reuse worker has already picked it up for
processing, simply ignore it and return, nothing has to be done by the
launcher in this case.
Approach #2) a) Applyworker to create a shared memory of all the
relations that need to be synced, b) tablesync worker to take a lock
on this shared memory and pick the next table to be
processed(tablesync worker need not get the subscription relations
again and again) c) tablesync worker to update the status in shared
memory for the relation(since the lock is held there will be no
concurrency issues), also mark the start time in the shared memory,
this will help in not to restart the failed table before
wal_retrieve_retry_interval has expired d) tablesync worker to sync
the table e) subscription relation will be marked as ready and the
tablesync worker to remove the entry from shared memory f) Applyworker
will periodically synchronize the shared memory relations to keep it
in sync with the fetched subscription relation tables  g) when apply
worker exits, the shared memory will be cleared.

Approach #2) will also help in solving the other issue reported by Amit at [1].
I felt we can use Approach #2 to solve the problem as it solves both
the reported issues and also there is an added advantage where the
re-use table sync worker need not scan the pg_subscription_rel to get
the non-ready table for every run, instead we can use the list
prepared by apply worker.
Thoughts?

[1] - 
https://www.postgresql.org/message-id/CAA4eK1KyHfVOVeio28p8CHDnuyKuej78cj_7U9mHAB4ictVQwQ%40mail.gmail.com

Regards,
Vignesh

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

Reply via email to