On Tue, Jan 24, 2023 7:41 PM Amit Kapila <[email protected]> wrote: > > On Tue, Jan 24, 2023 at 8:53 AM Andres Freund <[email protected]> wrote: > > > > cfbot, the buildfarm and locally I have seen 100_bugs.pl fail > > occasionally. Just rarely enough that I never got around to looking into it > > for real. > > > ... > > > > We see t2 added to the publication: > > 2023-01-24 00:57:30.099 UTC [73654][client backend] [100_bugs.pl][7/5:0] > LOG: statement: ALTER PUBLICATION testpub ADD TABLE t2 > > > > And that *then* "t" was synchronized: > > 2023-01-24 00:57:30.102 UTC [73640][logical replication worker] LOG: > > logical > replication table synchronization worker for subscription "testsub", table > "t" has > finished > > > > and then that the refresh was issued: > > 2023-01-24 00:57:30.128 UTC [73657][client backend] [100_bugs.pl][5/10:0] > LOG: statement: ALTER SUBSCRIPTION testsub REFRESH PUBLICATION > > > > And we see a walsender starting and the query to get the new tables being > executed: > > 2023-01-24 00:57:30.139 UTC [73660][walsender] [testsub][6/8:0] LOG: > statement: SELECT DISTINCT t.schemaname, t.tablename > > , t.attnames > > FROM pg_catalog.pg_publication_tables t > > WHERE t.pubname IN ('testpub') > > > > > > And that's it, the rest of the time is just polling. > > > > > > Perhaps wait_for_subscription_sync() should dump the set of rel states to > make > > something like this more debuggable? > > > > > > The fact that the synchronization for t finished just before the refresh > > makes > > me wonder if a wakeup or a cache invalidation got lost? > > > > From the LOGs, the only thing one could draw is lost invalidation > because the nap time of the apply worker is 1s, so it should process > invalidation during the time we are polling. Also, the rel should be > added to pg_subscription_rel because the test is still polling for > rels to be in 'ready' or 'done' state. > > I think we can do three things to debug (a) as you suggest dump the > rel state in wait_for_subscription_sync; (b) add some DEBUG log in > invalidate_syncing_table_states() to ensure that invalidation has been > processed; (c) print rel states and relids from table_states_not_ready > in process_syncing_tables_for_apply() to see if t2 has ever appeared > in that list. >
There was a similar buildfarm failure on francolin recently[1]. I think the
problem is that after REFRESH PUBLICATION, the table sync worker for the new
table test_mismatching_types was not started. So, the test timed out while
waiting for an ERROR message that should have been reported by the table sync
worker.
--
regress_log_014_binary:
timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)? incorrect binary data
format) at
/home/bf/bf-build/francolin/HEAD/pgsql/src/test/subscription/t/014_binary.pl
line 269.
014_binary_subscriber.log:
2023-04-16 18:18:38.455 UTC [3079482] 014_binary.pl LOG: statement: ALTER
SUBSCRIPTION tsub REFRESH PUBLICATION;
2023-04-16 18:21:39.219 UTC [3079474] ERROR: could not receive data from WAL
stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
--
I wrote a patch to dump rel state in wait_for_subscription_sync() as suggested.
Please see the attached patch.
I will try to add some debug logs in code later.
[1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=francolin&dt=2023-04-16%2018%3A17%3A09
Regards,
Shi Yu
v1-0001-dump-rel-state-in-wait_for_subscription_sync.patch
Description: v1-0001-dump-rel-state-in-wait_for_subscription_sync.patch
