On Fri, Jun 18, 2021 at 6:18 AM Mark Dilger
<[email protected]> wrote:
>
> Hackers,
>
> Logical replication apply workers for a subscription can easily get stuck in
> an infinite loop of attempting to apply a change, triggering an error (such
> as a constraint violation), exiting with an error written to the subscription
> worker log, and restarting.
>
> As things currently stand, only superusers can create subscriptions. Ongoing
> work to delegate superuser tasks to non-superusers creates the potential for
> even more errors to be triggered, specifically, errors where the apply worker
> does not have permission to make changes to the target table.
>
> The attached patch makes it possible to create a subscription using a new
> subscription_parameter, "disable_on_error", such that rather than going into
> an infinite loop, the apply worker will catch errors and automatically
> disable the subscription, breaking the loop. The new parameter defaults to
> false. When false, the PG_TRY/PG_CATCH overhead is avoided, so for
> subscriptions which do not use the feature, there shouldn't be any change.
> Users can manually clear the error after fixing the underlying issue with an
> ALTER SUBSCRIPTION .. ENABLE command.
>
> In addition to helping on production systems, this makes writing TAP tests
> involving error conditions simpler. I originally ran into the motivation to
> write this patch when frustrated that TAP tests needed to parse the apply
> worker log file to determine whether permission failures were occurring and
> what they were. It was also obnoxiously easy to have a test get stuck
> waiting for a permanently stuck subscription to catch up. This helps with
> both issues.
>
> I don't think this is quite ready for commit, but I'd like feedback if folks
> like this idea or want to suggest design changes.
I tried your patch.
It applied OK (albeit with whitespace warnings).
The code build and TAP tests are all OK.
Below are a few comments and observations.
COMMENTS
========
(1) PG Docs catalogs.sgml
Documented new column "suberrmsg" but did not document the other new
columns ("disable_on_error", "disabled_by_error")?
------
(2) New column "disabled_by_error".
I wondered if there was actually any need for this column. Isn't the
same information conveyed by just having "subenabled" = false, at same
time as as non-empty "suberrmsg"? This would remove any confusion for
having 2 booleans which both indicate disabled.
------
(3) New columns "disabled_by_error", "disabled_on_error".
All other columns of the pg_subscription have a "sub" prefix.
------
(4) errhint member used?
@@ -91,12 +100,16 @@ typedef struct Subscription
char *name; /* Name of the subscription */
Oid owner; /* Oid of the subscription owner */
bool enabled; /* Indicates if the subscription is enabled */
+ bool disable_on_error; /* Whether errors automatically disable */
+ bool disabled_by_error; /* Whether an error has disabled */
bool binary; /* Indicates if the subscription wants data in
* binary format */
bool stream; /* Allow streaming in-progress transactions. */
char *conninfo; /* Connection string to the publisher */
char *slotname; /* Name of the replication slot */
char *synccommit; /* Synchronous commit setting for worker */
+ char *errmsg; /* Message from error which disabled */
+ char *errhint; /* Hint from error which disabled */
List *publications; /* List of publication names to subscribe to */
} Subscription;
I did not find any code using that newly added member "errhint".
------
(5) dump.c
i. No mention of new columns "disabled_on_error" and
"disabled_by_error". Is that right?
ii. Shouldn't the code for the "suberrmsg" be qualified with some PG
version number checks?
------
(6) Patch only handles errors only from the Apply worker.
Tablesync can give similar errors (e.g. PK violation during DATASYNC
phase) which will trigger re-launch forever regardless of the setting
of "disabled_on_error".
(confirmed by observations below)
------
(7) TAP test code
+$node_subscriber->init(allows_streaming => 'logical');
AFAIK that "logical" configuration is not necessary for the subscriber side. So,
$node_subscriber->init();
////////////
Some Experiments/Observations
==============================
In general, I found this functionality is useful and it works as
advertised by your patch comment.
======
Test: Display pg_subscription with the new columns
Observation: As expected. But some new colnames are not prefixed like
their peers.
test_sub=# \pset x
Expanded display is on.
test_sub=# select * from pg_subscription;
-[ RECORD 1 ]-----+--------------------------------------------------------
oid | 16394
subdbid | 16384
subname | tap_sub
subowner | 10
subenabled | t
disable_on_error | t
disabled_by_error | f
subbinary | f
substream | f
subconninfo | host=localhost dbname=test_pub application_name=tap_sub
subslotname | tap_sub
subsynccommit | off
suberrmsg |
subpublications | {tap_pub}
======
Test: Cause a PK violation during normal Apply replication (when
"disabled_on_error=true")
Observation: Apply worker stops. Subscription is disabled. Error
message is in the catalog.
2021-06-18 15:12:45.905 AEST [25904] LOG: edata is true for
subscription 'tap_sub': message = "duplicate key value violates unique
constraint "test_tab_pkey"", hint = "<NONE>"
2021-06-18 15:12:45.905 AEST [25904] LOG: logical replication apply
worker for subscription "tap_sub" will stop because the subscription
was disabled due to error
2021-06-18 15:12:45.905 AEST [25904] ERROR: duplicate key value
violates unique constraint "test_tab_pkey"
2021-06-18 15:12:45.905 AEST [25904] DETAIL: Key (a)=(1) already exists.
2021-06-18 15:12:45.908 AEST [19924] LOG: background worker "logical
replication worker" (PID 25904) exited with exit code 1
test_sub=# select * from pg_subscription;
-[ RECORD 1
]-----+---------------------------------------------------------------
oid | 16394
subdbid | 16384
subname | tap_sub
subowner | 10
subenabled | f
disable_on_error | t
disabled_by_error | t
subbinary | f
substream | f
subconninfo | host=localhost dbname=test_pub application_name=tap_sub
subslotname | tap_sub
subsynccommit | off
suberrmsg | duplicate key value violates unique constraint
"test_tab_pkey"
subpublications | {tap_pub}
======
Test: Try to enable subscription (without fixing the PK violation problem).
Observation. OK. It just stops again
test_sub=# alter subscription tap_sub enable;
ALTER SUBSCRIPTION
test_sub=# 2021-06-18 15:17:18.067 AEST [10228] LOG: logical
replication apply worker for subscription "tap_sub" has started
2021-06-18 15:17:18.078 AEST [10228] LOG: edata is true for
subscription 'tap_sub': message = "duplicate key value violates unique
constraint "test_tab_pkey"", hint = "<NONE>"
2021-06-18 15:17:18.078 AEST [10228] LOG: logical replication apply
worker for subscription "tap_sub" will stop because the subscription
was disabled due to error
2021-06-18 15:17:18.078 AEST [10228] ERROR: duplicate key value
violates unique constraint "test_tab_pkey"
2021-06-18 15:17:18.078 AEST [10228] DETAIL: Key (a)=(1) already exists.
2021-06-18 15:17:18.079 AEST [19924] LOG: background worker "logical
replication worker" (PID 10228) exited with exit code 1
======
Test: Manually disable the subscription (which had previously already
been disabled due to error)
Observation: OK. The suberrmsg gets reset to an empty string.
alter subscription tap_sub disable;
=====
Test: Turn off the disable_on_error
Observation: As expected, now the Apply worker goes into re-launch
forever loop every time it hits PK violation
test_sub=# alter subscription tap_sub set (disable_on_error=false);
ALTER SUBSCRIPTION
...
======
Test: Cause a PK violation in the Tablesync copy (DATASYNC) phase.
(when disable_on_error = true)
Observation: This patch changes nothing for this case. The Tablesyn
re-launchs in a forever loop the same as current functionality.
test_sub=# CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost
dbname=test_pub application_name=tap_sub' PUBLICATION tap_pub WITH
(disable_on_error=false);
NOTICE: created replication slot "tap_sub" on publisher
CREATE SUBSCRIPTION
test_sub=# 2021-06-18 15:38:19.547 AEST [18205] LOG: logical
replication apply worker for subscription "tap_sub" has started
2021-06-18 15:38:19.557 AEST [18207] LOG: logical replication table
synchronization worker for subscription "tap_sub", table "test_tab"
has started
2021-06-18 15:38:19.610 AEST [18207] ERROR: duplicate key value
violates unique constraint "test_tab_pkey"
2021-06-18 15:38:19.610 AEST [18207] DETAIL: Key (a)=(1) already exists.
2021-06-18 15:38:19.610 AEST [18207] CONTEXT: COPY test_tab, line 1
2021-06-18 15:38:19.611 AEST [19924] LOG: background worker "logical
replication worker" (PID 18207) exited with exit code 1
2021-06-18 15:38:24.634 AEST [18369] LOG: logical replication table
synchronization worker for subscription "tap_sub", table "test_tab"
has started
2021-06-18 15:38:24.689 AEST [18369] ERROR: duplicate key value
violates unique constraint "test_tab_pkey"
2021-06-18 15:38:24.689 AEST [18369] DETAIL: Key (a)=(1) already exists.
2021-06-18 15:38:24.689 AEST [18369] CONTEXT: COPY test_tab, line 1
2021-06-18 15:38:24.690 AEST [19924] LOG: background worker "logical
replication worker" (PID 18369) exited with exit code 1
2021-06-18 15:38:29.701 AEST [18521] LOG: logical replication table
synchronization worker for subscription "tap_sub", table "test_tab"
has started
2021-06-18 15:38:29.765 AEST [18521] ERROR: duplicate key value
violates unique constraint "test_tab_pkey"
2021-06-18 15:38:29.765 AEST [18521] DETAIL: Key (a)=(1) already exists.
2021-06-18 15:38:29.765 AEST [18521] CONTEXT: COPY test_tab, line 1
2021-06-18 15:38:29.766 AEST [19924] LOG: background worker "logical
replication worker" (PID 18521) exited with exit code 1
etc...
-[ RECORD 1 ]-----+--------------------------------------------------------
oid | 16399
subdbid | 16384
subname | tap_sub
subowner | 10
subenabled | t
disable_on_error | f
disabled_by_error | f
subbinary | f
substream | f
subconninfo | host=localhost dbname=test_pub application_name=tap_sub
subslotname | tap_sub
subsynccommit | off
suberrmsg |
subpublications | {tap_pub}
------
Kind Regards,
Peter Smith.
Fujitsu Australia