Re: Documentation update of wal_retrieve_retry_interval to mention table sync worker

vignesh C Sun, 05 Jan 2025 04:39:17 -0800

On Tue, 31 Dec 2024 at 02:48, Peter Smith <[email protected]> wrote:
>
> On Thu, Dec 26, 2024 at 1:37 AM vignesh C <[email protected]> wrote:
> >
> > Hi,
> >
> > Currently, we restart the table synchronization worker after the
> > duration specified by wal_retrieve_retry_interval following the last
> > failure. While this behavior is documented for apply workers, it is
> > not mentioned for table synchronization workers. I believe this detail
> > should be included in the documentation for table synchronization
> > workers as well. Attached is a patch to address this omission.
> >
> > Regards,
> > Vignesh
>
> Hi Vignesh,
>
> Here are some review comments for your v1 patch.
>
> +1 to enhance the documentation.
>
> ======
>
> 1.
>         <para>
>          In logical replication, this parameter also limits how often a 
> failing
> -        replication apply worker will be respawned.
> +        replication apply worker, and table synchronization worker will be
> +        respawned.
>         </para>
>
> /, and/or/
>
>
> SUGGESTION
> In logical replication, this parameter also limits how often a failing
> replication apply worker or table synchronization worker will be
> respawned.


Modified

> ======
>
> 2.
> I think the reader might never be aware of any of this (throttled
> relaunch) behaviour unless they accidentally stumble across the docs
> for this GUC, so IMO this information should be mentioned elsewhere --
> wherever the tablesync worker errors are documented. But, TBH, I can't
> find anywhere in the PostgreSQL docs where it even mentions
> re-launching failed tablesync workers!
>
> Anyway, I think it might be good to include such information in some
> suitable place (maybe in the CREATE SUBSCRIPTION notes? or maybe in
> Chapter 29?) to say something like...
>
> SUGGESTION:
> In practice, if a table synchronization worker fails during logical
> replication, the apply worker detects the failure and attempts to
> respawn the table synchronization worker to continue the
> synchronization process. This behaviour ensures that transient errors
> do not permanently disrupt the replication setup. See also
> wal_retrieve_retry_interval.

Yes, adding it to logical replication Initial Snapshot seemed more
appropriate to me.

The attached v2 version patch has the changes for the same.

Regards,
Vignesh

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index fbdd6ce574..b58c7f25f7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5094,7 +5094,8 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
        </para>
        <para>
         In logical replication, this parameter also limits how often a failing
-        replication apply worker will be respawned.
+        replication apply worker or table synchronization worker will be
+        respawned.
        </para>
       </listitem>
      </varlistentry>
diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml
index 8290cd1a08..925e0dd101 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -1993,18 +1993,17 @@ CONTEXT:  processing remote data for replication origin "pg_16395" during "INSER
     <title>Initial Snapshot</title>
     <para>
      The initial data in existing subscribed tables are snapshotted and
-     copied in a parallel instance of a special kind of apply process.
-     This process will create its own replication slot and copy the existing
-     data.  As soon as the copy is finished the table contents will become
-     visible to other backends.  Once existing data is copied, the worker
-     enters synchronization mode, which ensures that the table is brought
-     up to a synchronized state with the main apply process by streaming
-     any changes that happened during the initial data copy using standard
-     logical replication.  During this synchronization phase, the changes
-     are applied and committed in the same order as they happened on the
-     publisher.  Once synchronization is done, control of the
-     replication of the table is given back to the main apply process where
-     replication continues as normal.
+     copied in a parallel instance of a special kind of table synchronization
+     worker process. This process will create its own replication slot and copy
+     the existing data.  As soon as the copy is finished the table contents will
+     become visible to other backends.  Once existing data is copied, the worker
+     enters synchronization mode, which ensures that the table is brought up to
+     a synchronized state with the main apply process by streaming any changes
+     that happened during the initial data copy using standard logical
+     replication.  During this synchronization phase, the changes are applied
+     and committed in the same order as they happened on the publisher.  Once
+     synchronization is done, control of the replication of the table is given
+     back to the main apply process where replication continues as normal.
     </para>
     <note>
      <para>
@@ -2015,6 +2014,15 @@ CONTEXT:  processing remote data for replication origin "pg_16395" during "INSER
       when copying the existing table data.
      </para>
     </note>
+    <note>
+     <para>
+      If a table synchronization worker fails during copy, the apply worker
+      detects the failure and respawns the table synchronization worker to
+      continue the synchronization process. This behaviour ensures that
+      transient errors do not permanently disrupt the replication setup. See
+      also <link linkend="guc-wal-retrieve-retry-interval"><varname>wal_retrieve_retry_interval</varname></link>.
+     </para>
+    </note>
   </sect2>
  </sect1>

Re: Documentation update of wal_retrieve_retry_interval to mention table sync worker

Reply via email to