On Tue, 31 Dec 2024 at 02:48, Peter Smith <[email protected]> wrote:
>
> On Thu, Dec 26, 2024 at 1:37 AM vignesh C <[email protected]> wrote:
> >
> > Hi,
> >
> > Currently, we restart the table synchronization worker after the
> > duration specified by wal_retrieve_retry_interval following the last
> > failure. While this behavior is documented for apply workers, it is
> > not mentioned for table synchronization workers. I believe this detail
> > should be included in the documentation for table synchronization
> > workers as well. Attached is a patch to address this omission.
> >
> > Regards,
> > Vignesh
>
> Hi Vignesh,
>
> Here are some review comments for your v1 patch.
>
> +1 to enhance the documentation.
>
> ======
>
> 1.
> <para>
> In logical replication, this parameter also limits how often a
> failing
> - replication apply worker will be respawned.
> + replication apply worker, and table synchronization worker will be
> + respawned.
> </para>
>
> /, and/or/
>
>
> SUGGESTION
> In logical replication, this parameter also limits how often a failing
> replication apply worker or table synchronization worker will be
> respawned.
Modified
> ======
>
> 2.
> I think the reader might never be aware of any of this (throttled
> relaunch) behaviour unless they accidentally stumble across the docs
> for this GUC, so IMO this information should be mentioned elsewhere --
> wherever the tablesync worker errors are documented. But, TBH, I can't
> find anywhere in the PostgreSQL docs where it even mentions
> re-launching failed tablesync workers!
>
> Anyway, I think it might be good to include such information in some
> suitable place (maybe in the CREATE SUBSCRIPTION notes? or maybe in
> Chapter 29?) to say something like...
>
> SUGGESTION:
> In practice, if a table synchronization worker fails during logical
> replication, the apply worker detects the failure and attempts to
> respawn the table synchronization worker to continue the
> synchronization process. This behaviour ensures that transient errors
> do not permanently disrupt the replication setup. See also
> wal_retrieve_retry_interval.
Yes, adding it to logical replication Initial Snapshot seemed more
appropriate to me.
The attached v2 version patch has the changes for the same.
Regards,
Vignesh
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index fbdd6ce574..b58c7f25f7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5094,7 +5094,8 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</para>
<para>
In logical replication, this parameter also limits how often a failing
- replication apply worker will be respawned.
+ replication apply worker or table synchronization worker will be
+ respawned.
</para>
</listitem>
</varlistentry>
diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml
index 8290cd1a08..925e0dd101 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -1993,18 +1993,17 @@ CONTEXT: processing remote data for replication origin "pg_16395" during "INSER
<title>Initial Snapshot</title>
<para>
The initial data in existing subscribed tables are snapshotted and
- copied in a parallel instance of a special kind of apply process.
- This process will create its own replication slot and copy the existing
- data. As soon as the copy is finished the table contents will become
- visible to other backends. Once existing data is copied, the worker
- enters synchronization mode, which ensures that the table is brought
- up to a synchronized state with the main apply process by streaming
- any changes that happened during the initial data copy using standard
- logical replication. During this synchronization phase, the changes
- are applied and committed in the same order as they happened on the
- publisher. Once synchronization is done, control of the
- replication of the table is given back to the main apply process where
- replication continues as normal.
+ copied in a parallel instance of a special kind of table synchronization
+ worker process. This process will create its own replication slot and copy
+ the existing data. As soon as the copy is finished the table contents will
+ become visible to other backends. Once existing data is copied, the worker
+ enters synchronization mode, which ensures that the table is brought up to
+ a synchronized state with the main apply process by streaming any changes
+ that happened during the initial data copy using standard logical
+ replication. During this synchronization phase, the changes are applied
+ and committed in the same order as they happened on the publisher. Once
+ synchronization is done, control of the replication of the table is given
+ back to the main apply process where replication continues as normal.
</para>
<note>
<para>
@@ -2015,6 +2014,15 @@ CONTEXT: processing remote data for replication origin "pg_16395" during "INSER
when copying the existing table data.
</para>
</note>
+ <note>
+ <para>
+ If a table synchronization worker fails during copy, the apply worker
+ detects the failure and respawns the table synchronization worker to
+ continue the synchronization process. This behaviour ensures that
+ transient errors do not permanently disrupt the replication setup. See
+ also <link linkend="guc-wal-retrieve-retry-interval"><varname>wal_retrieve_retry_interval</varname></link>.
+ </para>
+ </note>
</sect2>
</sect1>