On Monday, December 9, 2024 9:21 PM Pritam Baral <pri...@pritambaral.com> wrote: > To: pgsql-hackers <pgsql-hack...@postgresql.org> > Subject: Subscription sometimes loses txns after initial table sync > > This was discovered when testing the plan for a major version upgrade via > logical replication. Said plan requires that some tables be synced before > others. So I implemented it using ALTER PUBLICATION ... ADD TABLE ... > followed > by ALTER SUBSCRIPTION ... REFRESH PUBLICATION. A test for correctness > revealed > that sometimes, for some tables added this way, txns after the initial data > copy > are lost by the subscription. > > A reproducer script is attached. It has been tested with PG 17.2, 14.15, and > even 12.22 (on either side of the replication setup). The script runs at a > default scale of 100 tables with 10k inserts each. This scale is enough to > demonstrate a failure rate of 1% to 9% of tables on my modest laptop. > > In attempts to analyse why this happens, it has been observed that the sender > sometimes does not pick up a published table, even when the receiver that > started the sender process has seen the table as available (as returned by > pg_get_publication_tables()) and has thus begun COPYing its data. When the > COPY > finishes (and the tablesync worker is finished), the apply loop on the > receiver > expects to receive (and apply) subsequent changes for such tables, but simply > isn't sent any. This was observed by dumping every CopyData message sent > over > the wire. > > The attached script (like the original migration plan) uses a single > publication > and adds tables to it successively. Curiously, when the script was changed to > use a dedicated publication per table (and thus, ALTER SUBSCRIPTION ... > ADD > PUBLICATION instead of ALTER SUBSCRIPTION ... REFRESH PUBLICATION), > the no. of > tables with data loss jumped to 100%.
Thanks for reporting the issue. The described behavior looks similar to another bug discussed in [1]. If possible, could you please check if the latest patch in that thread can fix the bug you reported ? If it does, it would be helpful to share the feedback in that thread. [1] https://www.postgresql.org/message-id/flat/de52b282-1166-1180-45a2-8d8917ca74c6%40enterprisedb.com Best Regards, Hou zj