This was discovered when testing the plan for a major version upgrade via logical replication. Said plan requires that some tables be synced before others. So I implemented it using ALTER PUBLICATION ... ADD TABLE ... followed by ALTER SUBSCRIPTION ... REFRESH PUBLICATION. A test for correctness revealed that sometimes, for some tables added this way, txns after the initial data copy are lost by the subscription.
A reproducer script is attached. It has been tested with PG 17.2, 14.15, and even 12.22 (on either side of the replication setup). The script runs at a default scale of 100 tables with 10k inserts each. This scale is enough to demonstrate a failure rate of 1% to 9% of tables on my modest laptop. In attempts to analyse why this happens, it has been observed that the sender sometimes does not pick up a published table, even when the receiver that started the sender process has seen the table as available (as returned by pg_get_publication_tables()) and has thus begun COPYing its data. When the COPY finishes (and the tablesync worker is finished), the apply loop on the receiver expects to receive (and apply) subsequent changes for such tables, but simply isn't sent any. This was observed by dumping every CopyData message sent over the wire. The attached script (like the original migration plan) uses a single publication and adds tables to it successively. Curiously, when the script was changed to use a dedicated publication per table (and thus, ALTER SUBSCRIPTION ... ADD PUBLICATION instead of ALTER SUBSCRIPTION ... REFRESH PUBLICATION), the no. of tables with data loss jumped to 100%. -- #!/usr/bin/env regards Chhatoi Pritam Baral
sub-loss-repro.sh
Description: application/shellscript