Subscription sometimes loses txns after initial table sync

Pritam Baral Mon, 09 Dec 2024 05:21:02 -0800

This was discovered when testing the plan for a major version upgrade via
logical replication. Said plan requires that some tables be synced before
others. So I implemented it using ALTER PUBLICATION ... ADD TABLE ... followed
by ALTER SUBSCRIPTION ... REFRESH PUBLICATION. A test for correctness revealed
that sometimes, for some tables added this way, txns after the initial data copy
are lost by the subscription.


A reproducer script is attached. It has been tested with PG 17.2, 14.15, and
even 12.22 (on either side of the replication setup). The script runs at a
default scale of 100 tables with 10k inserts each. This scale is enough to
demonstrate a failure rate of 1% to 9% of tables on my modest laptop.

In attempts to analyse why this happens, it has been observed that the sender
sometimes does not pick up a published table, even when the receiver that
started the sender process has seen the table as available (as returned by
pg_get_publication_tables()) and has thus begun COPYing its data. When the COPY
finishes (and the tablesync worker is finished), the apply loop on the receiver
expects to receive (and apply) subsequent changes for such tables, but simply
isn't sent any. This was observed by dumping every CopyData message sent over
the wire.

The attached script (like the original migration plan) uses a single publication
and adds tables to it successively. Curiously, when the script was changed to
use a dedicated publication per table (and thus, ALTER SUBSCRIPTION ... ADD
PUBLICATION instead of ALTER SUBSCRIPTION ... REFRESH PUBLICATION), the no. of
tables with data loss jumped to 100%.

-- 
#!/usr/bin/env regards
Chhatoi Pritam Baral

sub-loss-repro.sh
Description: application/shellscript

Subscription sometimes loses txns after initial table sync

Reply via email to