On Thu, Jun 1, 2023 6:54 PM Melih Mutlu <m.melihmu...@gmail.com> wrote:
> 
> Hi,
> 
> I rebased the patch and addressed the following reviews.
> 

Thanks for updating the patch. Here are some comments on 0001 patch.

1.
-       ereport(LOG,
-                       (errmsg("logical replication table synchronization 
worker for subscription \"%s\", table \"%s\" has finished",
-                                       MySubscription->name,
-                                       
get_rel_name(MyLogicalRepWorker->relid))));

Could we move this to somewhere else instead of removing it?

2.
+       if (!OidIsValid(originid))
+               originid = replorigin_create(originname);
+       replorigin_session_setup(originid, 0);
+       replorigin_session_origin = originid;
+       *origin_startpos = replorigin_session_get_progress(false);
+       CommitTransactionCommand();
+
+       /* Is the use of a password mandatory? */
+       must_use_password = MySubscription->passwordrequired &&
+               !superuser_arg(MySubscription->owner);
+       LogRepWorkerWalRcvConn = walrcv_connect(MySubscription->conninfo, true,
+                                                                               
        must_use_password,
+                                                                               
        MySubscription->name, &err);

It seems that there is a problem when refactoring.
See commit e7e7da2f8d5.

3.
+       /* Set this to false for safety, in case we're already reusing the 
worker */
+       MyLogicalRepWorker->ready_to_reuse = false;
+

I am not sure do we need to lock when setting it.

4.
+       /*
+        * Allocate the origin name in long-lived context for error context
+        * message.
+        */
+       StartTransactionCommand();
+       ReplicationOriginNameForLogicalRep(MySubscription->oid,
+                                                                          
MyLogicalRepWorker->relid,
+                                                                          
originname,
+                                                                          
originname_size);
+       CommitTransactionCommand();

Do we need the call to StartTransactionCommand() and CommitTransactionCommand()
here? Besides, the comment here is the same as the comment atop
set_apply_error_context_origin(), do we need it?

5.
I saw a segmentation fault when debugging.

It happened when calling sync_worker_exit() called (see the code below in
LogicalRepSyncTableStart()). In the case that this is not the first table the
worker synchronizes, clean_sync_worker() has been called before (in
TablesyncWorkerMain()), and LogRepWorkerWalRcvConn has been set to NULL. Then, a
segmentation fault happened because LogRepWorkerWalRcvConn is a null pointer.

        switch (relstate)
        {
                case SUBREL_STATE_SYNCDONE:
                case SUBREL_STATE_READY:
                case SUBREL_STATE_UNKNOWN:
                        sync_worker_exit();     /* doesn't return */
        }

Here is the backtrace.

#0  0x00007fc8a8ce4c95 in libpqrcv_disconnect (conn=0x0) at 
libpqwalreceiver.c:757
#1  0x000000000092b8c0 in clean_sync_worker () at tablesync.c:150
#2  0x000000000092b8ed in sync_worker_exit () at tablesync.c:164
#3  0x000000000092d8f6 in LogicalRepSyncTableStart 
(origin_startpos=0x7ffd50f30f08) at tablesync.c:1293
#4  0x0000000000934f76 in start_table_sync (origin_startpos=0x7ffd50f30f08, 
myslotname=0x7ffd50f30e80) at worker.c:4457
#5  0x000000000093513b in run_tablesync_worker (options=0x7ffd50f30ec0, 
slotname=0x0, originname=0x7ffd50f30f10 "pg_16394_16395",
    originname_size=64, origin_startpos=0x7ffd50f30f08) at worker.c:4532
#6  0x0000000000935a3a in TablesyncWorkerMain (main_arg=1) at worker.c:4853
#7  0x00000000008e97f6 in StartBackgroundWorker () at bgworker.c:864
#8  0x00000000008f350b in do_start_bgworker (rw=0x12fc1a0) at postmaster.c:5762
#9  0x00000000008f38b7 in maybe_start_bgworkers () at postmaster.c:5986
#10 0x00000000008f2975 in process_pm_pmsignal () at postmaster.c:5149
#11 0x00000000008ee98a in ServerLoop () at postmaster.c:1770
#12 0x00000000008ee3bb in PostmasterMain (argc=3, argv=0x12c4af0) at 
postmaster.c:1463
#13 0x00000000007b6d3a in main (argc=3, argv=0x12c4af0) at main.c:198


The steps to reproduce: 
Worker1, in TablesyncWorkerMain(), the relstate of new table to sync (obtained
by GetSubscriptionRelations) is SUBREL_STATE_INIT, and in the foreach loop,
before the following Check (it needs a breakpoint before locking),

                        LWLockAcquire(LogicalRepWorkerLock, LW_EXCLUSIVE);
                        if (rstate->state != SUBREL_STATE_SYNCDONE &&
                                !logicalrep_worker_find(MySubscription->oid, 
rstate->relid, false))
                        {
                                /* Update worker state for the next table */
                                MyLogicalRepWorker->relid = rstate->relid;
                                MyLogicalRepWorker->relstate = rstate->state;
                                MyLogicalRepWorker->relstate_lsn = rstate->lsn;
                                LWLockRelease(LogicalRepWorkerLock);
                                break;
                        }
                        LWLockRelease(LogicalRepWorkerLock);

let this table to be synchronized by another table sync worker (Worker2), and
Worker2 has finished before logicalrep_worker_find was called(). Then Worker1
tried to sync a table whose state is SUBREL_STATE_READY and the segmentation
fault happened.

Regards,
Shi Yu

Reply via email to