On Tues, Nov 30, 2021 9:39 PM Masahiko Sawada <sawada.m...@gmail.com> wrote: > On Tue, Nov 30, 2021 at 8:41 PM Masahiko Sawada <sawada.m...@gmail.com> > wrote: > > > > On Tue, Nov 30, 2021 at 6:28 PM Amit Kapila <amit.kapil...@gmail.com> > wrote: > > > > > > On Mon, Nov 29, 2021 at 11:38 AM vignesh C <vignes...@gmail.com> > wrote: > > > > > > > > > > I have pushed this patch and there is a buildfarm failure for it. See: > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&d > > > t=2021-11-30%2005%3A05%3A25 > > > > > > Sawada-San has shared his initial analysis on pgsql-committers [1] > > > and I am responding here as the fix requires some more discussion. > > > > > > > Looking at the result the test actually got, we had two error > > > > entries for test_tab1 instead of one: > > > > > > > > # Failed test 'check the error reported by the apply worker' > > > > # at t/026_worker_stats.pl line 33. > > > > # got: 'tap_sub|INSERT|test_tab1|t > > > > # tap_sub||test_tab1|t' > > > > # expected: 'tap_sub|INSERT|test_tab1|t' > > > > > > > > The possible scenarios are: > > > > > > > > The table sync worker for test_tab1 failed due to an error > > > > unrelated to apply changes: > > > > > > > > 2021-11-30 06:24:02.137 CET [18990:2] ERROR: replication origin > > > > with OID 2 is already active for PID 23706 > > > > > > > > At this time, the view had one error entry for the table sync worker. > > > > After retrying table sync, it succeeded: > > > > > > > > 2021-11-30 06:24:04.202 CET [28117:2] LOG: logical replication > > > > table synchronization worker for subscription "tap_sub", table > "test_tab1" > > > > has finished > > > > > > > > Then after inserting a row on the publisher, the apply worker > > > > inserted the row but failed due to violating a unique key > > > > violation, which is > > > > expected: > > > > > > > > 2021-11-30 06:24:04.307 CET [4806:2] ERROR: duplicate key value > > > > violates unique constraint "test_tab1_pkey" > > > > 2021-11-30 06:24:04.307 CET [4806:3] DETAIL: Key (a)=(1) already > > > > exists. > > > > 2021-11-30 06:24:04.307 CET [4806:4] CONTEXT: processing remote > > > > data during "INSERT" for replication target relation > > > > "public.test_tab1" in transaction 721 at 2021-11-30 > > > > 06:24:04.305096+01 > > > > > > > > As a result, we had two error entries for test_tab1: the table > > > > sync worker error and the apply worker error. I didn't expect that > > > > the table sync worker for test_tab1 failed due to "replication > > > > origin with OID 2 is already active for PID 23706” error. > > > > > > > > Looking at test_subscription_error() in 026_worker_stats.pl, we > > > > have two checks; in the first check, we wait for the view to show > > > > the error entry for the given relation name and xid. This check > > > > was passed since we had the second error (i.g., apply worker > > > > error). In the second check, we get error entries from > > > > pg_stat_subscription_workers by specifying only the relation name. > > > > Therefore, we ended up getting two entries and failed the tests. > > > > > > > > To fix this issue, I think that in the second check, we can get > > > > the error from pg_stat_subscription_workers by specifying the > > > > relation name *and* xid like the first check does. I've attached the > > > > patch. > > > > What do you think? > > > > > > > > > > I think this will fix the reported failure but there is another race > > > condition in the test. Isn't it possible that for table test_tab2, > > > we get an error "replication origin with OID ..." or some other > > > error before copy, in that case also, we will proceed from the > > > second call of test_subscription_error() which is not what we expect in > > > the > test? > > > > Right. > > > > > Shouldn't we someway check that the error message also starts with > > > "duplicate key value violates ..."? > > > > Yeah, I think it's a good idea to make the checks more specific. That > > is, probably we can specify the prefix of the error message and > > subrelid in addition to the current conditions: relid and xid. That > > way, we can check what error was reported by which workers (tablesync > > or apply) for which relations. And both check queries in > > test_subscription_error() can have the same WHERE clause. > > I've attached a patch that fixes this issue. Please review it. >
I have a question about the testcase (I could be wrong here). Is it possible that the race condition happen between apply worker(test_tab1) and table sync worker(test_tab2) ? If so, it seems the error("replication origin with OID") could happen randomly until we resolve the conflict. Based on this, for the following code: ----- # Wait for the error statistics to be updated. my $check_sql = qq[SELECT count(1) > 0 ] . $part_sql; $node->poll_query_until( 'postgres', $check_sql, ) or die "Timed out while waiting for statistics to be updated"; * [1] * $check_sql = qq[ SELECT subname, last_error_command, last_error_relid::regclass, last_error_count > 0 ] . $part_sql; my $result = $node->safe_psql('postgres', $check_sql); is($result, $expected, $msg); ----- Is it possible that the error("replication origin with OID") happen again at the place [1]. In this case, the error message we have checked could be replaced by another error("replication origin ...") and then the test fail ? Best regards, Hou zj