On Wed, Mar 2, 2022 at 6:38 PM osumi.takami...@fujitsu.com <osumi.takami...@fujitsu.com> wrote: > > On Wednesday, March 2, 2022 12:47 PM Masahiko Sawada <sawada.m...@gmail.com> > wrote: > > After more thoughts, should we do both AbortOutOfAnyTransaction() and error > > message handling while holding interrupts? That is, > > > > HOLD_INTERRUPTS(); > > EmitErrorReport(); > > FlushErrorState(); > > AbortOutOfAny Transaction(); > > RESUME_INTERRUPTS(); > > pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_work > > er()); > > > > I think it's better that we do clean up first and then do other works such > > as > > sending the message to the stats collector and updating the catalog. > I agree. Fixed. Along with this change, I corrected the header comment of > DisableSubscriptionOnError, too. > > > > Here are some comments on v24 patch: > > > > + /* Look up our subscription in the catalogs */ > > + tup = SearchSysCacheCopy2(SUBSCRIPTIONNAME, MyDatabaseId, > > + > > CStringGetDatum(MySubscription->name)); > > > > s/catalogs/catalog/ > > > > Why don't we use SUBSCRIPTIONOID with MySubscription->oid? > Changed. > > > > --- > > + if (!HeapTupleIsValid(tup)) > > + ereport(ERROR, > > + errcode(ERRCODE_UNDEFINED_OBJECT), > > + errmsg("subscription \"%s\" does not > > exist", > > + MySubscription->name)); > > > > I think we should use elog() here rather than ereport() since it's a > > should-not-happen error. > Fixed > > > > --- > > + /* Notify the subscription will be no longer valid */ > > > > I'd suggest rephrasing it to like "Notify the subscription will be > > disabled". (the > > subscription is still valid actually, but just disabled). > Fixed. Also, I've made this sentence past one, because of the code place > change below. > > > > --- > > + /* Notify the subscription will be no longer valid */ > > + ereport(LOG, > > + errmsg("logical replication subscription > > \"%s\" will be disabled due to an error", > > + MySubscription->name)); > > + > > > > I think we can report the log at the end of this function rather than > > during the > > transaction. > Fixed. In this case, I needed to adjust the comment to indicate the processing > to disable the sub has *completed* as well. > > > --- > > +my $cmd = qq( > > +CREATE TABLE tbl (i INT); > > +ALTER TABLE tbl REPLICA IDENTITY FULL; > > +CREATE INDEX tbl_idx ON tbl(i)); > > > > I think we don't need to set REPLICA IDENTITY FULL to this table since > > there is > > notupdate/delete. > > > > Do we need tbl_idx? > Removed both the replica identity and tbl_idx; > > > > --- > > +$cmd = qq( > > +SELECT COUNT(1) = 1 FROM pg_catalog.pg_subscription_rel sr WHERE > > +sr.srsubstate IN ('s', 'r')); > > +$node_subscriber->poll_query_until('postgres', $cmd); > > > > It seems better to add a condition of srrelid just in case. > Makes sense. Fixed. > > > > --- > > +$cmd = qq( > > +SELECT count(1) = 1 FROM pg_catalog.pg_subscription s WHERE > > s.subname = > > +'sub' AND s.subenabled IS FALSE); > > +$node_subscriber->poll_query_until('postgres', $cmd) > > + or die "Timed out while waiting for subscriber to be disabled"; > > > > I think that it's more natural to directly check the subscription's > > subenabled. > > For example: > > > > SELECT subenabled = false FROM pg_subscription WHERE subname = 'sub'; > Fixed. I modified another code similar to this for tablesync error as well. > > > > --- > > +$cmd = q(ALTER SUBSCRIPTION sub ENABLE); > > +$node_subscriber->safe_psql('postgres', $cmd); $cmd = q(SELECT > > COUNT(1) > > += 3 FROM tbl WHERE i = 3); > > +$node_subscriber->poll_query_until('postgres', $cmd) > > + or die "Timed out while waiting for applying"; > > > > I think it's better to wait for the subscriber to catch up and check the > > query > > result instead of using poll_query_until() so that we can check the query > > result > > in case where the test fails. > I modified the code to wait for the subscriber and deleted poll_query_until. > Also, when I consider the test failure for this test as you mentioned, > expecting and checking the number of return value that equals 3 > would be better. So, I expressed this point in my test as well, > according to your advice. > > > > --- > > +$cmd = qq(DROP INDEX tbl_unique); > > +$node_subscriber->safe_psql('postgres', $cmd); > > > > In the newly added tap tests, all queries are consistently assigned to $cmd > > and > > executed even when the query is used only once. It seems a different style > > from > > the one in other tap tests. Is there any reason why we do this style for > > all queries > > in the tap tests? > Fixed. I removed the 'cmd' variable itself. > > > Attached an updated patch v26.
Thank you for updating the patch. Here are some comments on v26 patch: +/* + * Disable the current subscription. + */ +static void +DisableSubscriptionOnError(void) This function now just updates the pg_subscription catalog so can we move it to pg_subscritpion.c while having this function accept the subscription OID to disable? If you agree, the function comment will also need to be updated. --- + /* + * First, ensure that we log the error message so that it won't be + * lost if some other internal error occurs in the following code. + * Then, abort the current transaction and send the stats message of + * the table synchronization failure in an idle state. + */ + HOLD_INTERRUPTS(); + EmitErrorReport(); + FlushErrorState(); + AbortOutOfAnyTransaction(); + RESUME_INTERRUPTS(); + pgstat_report_subscription_error(MySubscription->oid, false); + + if (MySubscription->disableonerr) + { + DisableSubscriptionOnError(); + proc_exit(0); + } + + PG_RE_THROW(); If the disableonerr is false, the same error is reported twice. Also, the code in PG_CATCH() in both start_apply() and start_table_sync() are almost the same. Can we create a common function to do post-error processing? The worker should exit with return code 1. I've attached a fixup patch for changes to worker.c for your reference. Feel free to adopt the changes. --- + +# Confirm that we have finished the table sync. +is( $node_subscriber->safe_psql( + 'postgres', qq(SELECT MAX(i), COUNT(*) FROM tbl)), + "1|3", + "subscription sub replicated data"); + Can we store the result to a local variable to check? I think it's more consistent with other tap tests. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
0001-fixup-Optionally-disable-subscriptions-on-error.patch
Description: Binary data