On Fri, 26 Jan 2024 at 11:30, Alexander Lakhin <exclus...@gmail.com> wrote: > > Hello hackers, > > After determining a possible cause for intermittent failures of the test > subscription/031_column_list [1], I was wondering what makes another > subscription test (014_binary) fail on the buildfarm: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=2024-01-22%2001%3A19%3A03 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-01-14%2018%3A19%3A20 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt=2023-12-21%2001%3A11%3A52 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2023-11-27%2001%3A42%3A39 > > All those failures caused by a timeout when waiting for a message expected > in _subscriber.log. For example, in the snakefly's case: > [01:14:46.158](1.937s) ok 7 - check synced data on subscriber with custom type > timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)? incorrect binary data > format) at > /home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_binary.pl > line 269. > > _subscriber.log contains: > 2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG: statement: ALTER > SUBSCRIPTION tsub REFRESH PUBLICATION; > 2023-12-21 01:17:46.756 UTC [409999] ERROR: could not receive data from WAL > stream: server closed the connection > unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > 2023-12-21 01:17:46.760 UTC [405057] LOG: background worker "logical > replication apply worker" (PID 409999) exited with > exit code 1 > 2023-12-21 01:17:46.779 UTC [532857] LOG: logical replication apply worker > for subscription "tsub" has started > ... > > While _subscriber.log from a successful test run contains: > 2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG: statement: ALTER > SUBSCRIPTION tsub REFRESH PUBLICATION; > 2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection: > session time: 0:00:00.014 user=postgres > database=postgres host=[local] > 2024-01-26 03:49:07.558 UTC [9729:1] LOG: logical replication apply worker > for subscription "tsub" has started > 2024-01-26 03:49:07.563 UTC [9731:1] LOG: logical replication table > synchronization worker for subscription "tsub", > table "test_mismatching_types" has started > 2024-01-26 03:49:07.585 UTC [9731:2] ERROR: incorrect binary data format > 2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT: COPY test_mismatching_types, > line 1, column a > > In this case, "logical replication apply worker for subscription "tsub" has > started" appears just after "ALTER SUBSCRIPTION", not 3 minutes later. > > I've managed to reproduce this failure locally by running multiple tests in > parallel, and my analysis shows that it is caused by a race condition when > accessing variable table_states_valid inside tablesync.c. > > tablesync.c does the following with table_states_valid: > /* > * Callback from syscache invalidation. > */ > void > invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue) > { > table_states_valid = false; > } > ... > static bool > FetchTableStates(bool *started_tx) > { > ... > if (!table_states_valid) > { > ... > /* Fetch all non-ready tables. */ > rstates = GetSubscriptionRelations(MySubscription->oid, true); > ... > table_states_valid = true; > } > > So, when syscache invalidation occurs inside the code block > "if (!table_states_valid)", that invalidation is effectively ignored. > > In a failed case I observe the following events: > 1. logical replication apply worker performs > LogicalRepApplyLoop() -> process_syncing_tables() -> > process_syncing_tables_for_apply() -> FetchTableStates() periodically. > > 2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates syscache > for SUBSCRIPTIONRELMAP, and that leads to calling > invalidate_syncing_table_states(). > > 3. If the apply worker manages to fetch no non-ready tables in > FetchTableStates() and ignore "table_states_valid = false" from > invalidate_syncing_table_states(), then it just misses the invalidation > event, so it keeps working without noticing non-ready tables appeared as > the result of ALTER SUBSCRIPTION (process_syncing_tables_for_apply() skips > a loop "foreach(lc, table_states_not_ready) ..." until some other event > occurs). > > pg_usleep(100000) added just below GetSubscriptionRelations(...) proves my > analysis -- without it, I need tens of iterations (with 50 tests running in > parallel) to catch the failure, but with it, I get the failure on the first > iteration.
Thanks for the analysis, I was able to reproduce the issue with the steps you had shared. I agree with your analysis. I added some logs to verify that the invalidation was getting missed. I felt that this invalidation is getting ignored because we have used a boolean variable here, how about changing it slightly so that table_states_invalid gets incremented for every invalidation and then decrementing table_states_invalid after getting the non-ready tables like in the attached patch. I was able to verify that the test passes with the attached patch. Regards, Vignesh
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c index 4207b9356c..92eb86a249 100644 --- a/src/backend/replication/logical/tablesync.c +++ b/src/backend/replication/logical/tablesync.c @@ -123,7 +123,7 @@ #include "utils/syscache.h" #include "utils/usercontext.h" -static bool table_states_valid = false; +static unsigned int table_states_invalid = 1; static List *table_states_not_ready = NIL; static bool FetchTableStates(bool *started_tx); @@ -273,7 +273,7 @@ wait_for_worker_state_change(char expected_state) void invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue) { - table_states_valid = false; + table_states_invalid++; } /* @@ -1568,7 +1568,7 @@ FetchTableStates(bool *started_tx) *started_tx = false; - if (!table_states_valid) + if (table_states_invalid) { MemoryContext oldctx; List *rstates; @@ -1608,7 +1608,7 @@ FetchTableStates(bool *started_tx) has_subrels = (table_states_not_ready != NIL) || HasSubscriptionRelations(MySubscription->oid); - table_states_valid = true; + table_states_invalid--; } return has_subrels;