On Wed, Dec 1, 2021 11:22 AM Amit Kapila <amit.kapil...@gmail.com> wrote: > On Wed, Dec 1, 2021 at 8:24 AM houzj.f...@fujitsu.com > <houzj.f...@fujitsu.com> wrote: > > > > On Tues, Nov 30, 2021 9:39 PM Masahiko Sawada > <sawada.m...@gmail.com> wrote: > > > > > > > > > Shouldn't we someway check that the error message also starts with > > > > > "duplicate key value violates ..."? > > > > > > > > Yeah, I think it's a good idea to make the checks more specific. That > > > > is, probably we can specify the prefix of the error message and > > > > subrelid in addition to the current conditions: relid and xid. That > > > > way, we can check what error was reported by which workers (tablesync > > > > or apply) for which relations. And both check queries in > > > > test_subscription_error() can have the same WHERE clause. > > > > > > I've attached a patch that fixes this issue. Please review it. > > > > > > > I have a question about the testcase (I could be wrong here). > > > > Is it possible that the race condition happen between apply > worker(test_tab1) > > and table sync worker(test_tab2) ? If so, it seems the error("replication > > origin with OID") could happen randomly until we resolve the conflict. > > Based on this, for the following code: > > ----- > > # Wait for the error statistics to be updated. > > my $check_sql = qq[SELECT count(1) > 0 ] . $part_sql; > > $node->poll_query_until( > > 'postgres', $check_sql, > > ) or die "Timed out while waiting for statistics to be updated"; > > > > * [1] * > > > > $check_sql = > > qq[ > > SELECT subname, last_error_command, last_error_relid::regclass, > > last_error_count > 0 ] . $part_sql; > > my $result = $node->safe_psql('postgres', $check_sql); > > is($result, $expected, $msg); > > ----- > > > > Is it possible that the error("replication origin with OID") happen again > > at the > > place [1]. In this case, the error message we have checked could be replaced > by > > another error("replication origin ...") and then the test fail ? > > > > Once we get the "duplicate key violation ..." error before * [1] * via > apply_worker then we shouldn't get replication origin-specific error > because the origin set up is done before starting to apply changes. > Also, even if that or some other happens after * [1] * because of > errmsg_prefix check it should still succeed. Does that make sense?
Oh, I missed the point that the origin set up is done once we get the expected error. Thanks for the explanation, and I think the patch looks good. Best regards, Hou zj