On Thu, Jun 1, 2023 6:54 PM Melih Mutlu <m.melihmu...@gmail.com> wrote: > > Hi, > > I rebased the patch and addressed the following reviews. >
Thanks for updating the patch. Here are some comments on 0001 patch. 1. - ereport(LOG, - (errmsg("logical replication table synchronization worker for subscription \"%s\", table \"%s\" has finished", - MySubscription->name, - get_rel_name(MyLogicalRepWorker->relid)))); Could we move this to somewhere else instead of removing it? 2. + if (!OidIsValid(originid)) + originid = replorigin_create(originname); + replorigin_session_setup(originid, 0); + replorigin_session_origin = originid; + *origin_startpos = replorigin_session_get_progress(false); + CommitTransactionCommand(); + + /* Is the use of a password mandatory? */ + must_use_password = MySubscription->passwordrequired && + !superuser_arg(MySubscription->owner); + LogRepWorkerWalRcvConn = walrcv_connect(MySubscription->conninfo, true, + must_use_password, + MySubscription->name, &err); It seems that there is a problem when refactoring. See commit e7e7da2f8d5. 3. + /* Set this to false for safety, in case we're already reusing the worker */ + MyLogicalRepWorker->ready_to_reuse = false; + I am not sure do we need to lock when setting it. 4. + /* + * Allocate the origin name in long-lived context for error context + * message. + */ + StartTransactionCommand(); + ReplicationOriginNameForLogicalRep(MySubscription->oid, + MyLogicalRepWorker->relid, + originname, + originname_size); + CommitTransactionCommand(); Do we need the call to StartTransactionCommand() and CommitTransactionCommand() here? Besides, the comment here is the same as the comment atop set_apply_error_context_origin(), do we need it? 5. I saw a segmentation fault when debugging. It happened when calling sync_worker_exit() called (see the code below in LogicalRepSyncTableStart()). In the case that this is not the first table the worker synchronizes, clean_sync_worker() has been called before (in TablesyncWorkerMain()), and LogRepWorkerWalRcvConn has been set to NULL. Then, a segmentation fault happened because LogRepWorkerWalRcvConn is a null pointer. switch (relstate) { case SUBREL_STATE_SYNCDONE: case SUBREL_STATE_READY: case SUBREL_STATE_UNKNOWN: sync_worker_exit(); /* doesn't return */ } Here is the backtrace. #0 0x00007fc8a8ce4c95 in libpqrcv_disconnect (conn=0x0) at libpqwalreceiver.c:757 #1 0x000000000092b8c0 in clean_sync_worker () at tablesync.c:150 #2 0x000000000092b8ed in sync_worker_exit () at tablesync.c:164 #3 0x000000000092d8f6 in LogicalRepSyncTableStart (origin_startpos=0x7ffd50f30f08) at tablesync.c:1293 #4 0x0000000000934f76 in start_table_sync (origin_startpos=0x7ffd50f30f08, myslotname=0x7ffd50f30e80) at worker.c:4457 #5 0x000000000093513b in run_tablesync_worker (options=0x7ffd50f30ec0, slotname=0x0, originname=0x7ffd50f30f10 "pg_16394_16395", originname_size=64, origin_startpos=0x7ffd50f30f08) at worker.c:4532 #6 0x0000000000935a3a in TablesyncWorkerMain (main_arg=1) at worker.c:4853 #7 0x00000000008e97f6 in StartBackgroundWorker () at bgworker.c:864 #8 0x00000000008f350b in do_start_bgworker (rw=0x12fc1a0) at postmaster.c:5762 #9 0x00000000008f38b7 in maybe_start_bgworkers () at postmaster.c:5986 #10 0x00000000008f2975 in process_pm_pmsignal () at postmaster.c:5149 #11 0x00000000008ee98a in ServerLoop () at postmaster.c:1770 #12 0x00000000008ee3bb in PostmasterMain (argc=3, argv=0x12c4af0) at postmaster.c:1463 #13 0x00000000007b6d3a in main (argc=3, argv=0x12c4af0) at main.c:198 The steps to reproduce: Worker1, in TablesyncWorkerMain(), the relstate of new table to sync (obtained by GetSubscriptionRelations) is SUBREL_STATE_INIT, and in the foreach loop, before the following Check (it needs a breakpoint before locking), LWLockAcquire(LogicalRepWorkerLock, LW_EXCLUSIVE); if (rstate->state != SUBREL_STATE_SYNCDONE && !logicalrep_worker_find(MySubscription->oid, rstate->relid, false)) { /* Update worker state for the next table */ MyLogicalRepWorker->relid = rstate->relid; MyLogicalRepWorker->relstate = rstate->state; MyLogicalRepWorker->relstate_lsn = rstate->lsn; LWLockRelease(LogicalRepWorkerLock); break; } LWLockRelease(LogicalRepWorkerLock); let this table to be synchronized by another table sync worker (Worker2), and Worker2 has finished before logicalrep_worker_find was called(). Then Worker1 tried to sync a table whose state is SUBREL_STATE_READY and the segmentation fault happened. Regards, Shi Yu