To reproduce the subscription-startup hang that Thomas Munro observed,
I changed src/backend/replication/logical/launcher.c like this:

@@ -427,7 +427,8 @@ retry:
        bgw.bgw_notify_pid = MyProcPid;
        bgw.bgw_main_arg = Int32GetDatum(slot);
 
-       if (!RegisterDynamicBackgroundWorker(&bgw, &bgw_handle))
+       if (random() < 1000000000 ||
+               !RegisterDynamicBackgroundWorker(&bgw, &bgw_handle))
        {
                /* Failed to start worker, so clean up the worker slot. */
                LWLockAcquire(LogicalRepWorkerLock, LW_EXCLUSIVE);

This causes about 50% of worker launch requests to fail.
With the fix I just committed, 002_types.pl gets through fine,
but 005_encoding.pl does not; it sometimes fails like this:

t/005_encoding.pl ..... 1/1 
#   Failed test 'data replicated to subscriber'
#   at t/005_encoding.pl line 49.
#          got: ''
#     expected: '1'
# Looks like you failed 1 test of 1.
t/005_encoding.pl ..... Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/1 subtests 

The reason seems to be that its method of waiting for replication
to happen is completely inapropos.  It's watching for the master
to say that the slave has received all the WAL, but that does not
ensure that the logicalrep apply workers have caught up, does it?

                        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to