On Thu, May 19, 2022 at 3:16 PM Amit Kapila <amit.kapil...@gmail.com> wrote:
>
> On Thu, May 19, 2022 at 12:28 PM Kyotaro Horiguchi
> <horikyota....@gmail.com> wrote:
> >
> > At Thu, 19 May 2022 14:26:56 +1000, Peter Smith <smithpb2...@gmail.com> 
> > wrote in
> > > Hi hackers.
> > >
> > > FYI, I saw that there was a recent Build-farm error on the "grison" 
> > > machine [1]
> > > [1] 
> > > https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=grison&br=HEAD
> > >
> > > The error happened during "subscriptionCheck" phase in the TAP test
> > > t/031_column_list.pl
> > > This test file was added by this [2] commit.
> > > [2] 
> > > https://github.com/postgres/postgres/commit/923def9a533a7d986acfb524139d8b9e5466d0a5
> >
> > What is happening for all of them looks like that the name of a
> > publication created by CREATE PUBLICATION without a failure report is
> > missing for a walsender came later. It seems like CREATE PUBLICATION
> > can silently fail to create a publication, or walsender somehow failed
> > to find existing one.
> >
>
> Do you see anything in LOGS which indicates CREATE SUBSCRIPTION has failed?
>
> >
> > > ~~
> > >
> >
> > 2022-04-17 00:16:04.278 CEST [293659][client 
> > backend][4/270:0][031_column_list.pl] LOG:  statement: CREATE PUBLICATION 
> > pub9 FOR TABLE test_part_d (a) WITH (publish_via_partition_root = true);
> > 2022-04-17 00:16:04.279 CEST [293659][client 
> > backend][:0][031_column_list.pl] LOG:  disconnection: session time: 
> > 0:00:00.002 user=bf database=postgres host=[local]
> >
> > "CREATE PUBLICATION pub9" is executed at 00:16:04.278 on 293659 then
> > the session has been disconnected. But the following request for the
> > same publication fails due to the absense of the publication.
> >
> > 2022-04-17 00:16:08.147 CEST [293856][walsender][3/0:0][sub1] STATEMENT:  
> > START_REPLICATION SLOT "sub1" LOGICAL 0/153DB88 (proto_version '3', 
> > publication_names '"pub9"')
> > 2022-04-17 00:16:08.148 CEST [293856][walsender][3/0:0][sub1] ERROR:  
> > publication "pub9" does not exist
> >
>
> This happens after "ALTER SUBSCRIPTION sub1 SET PUBLICATION pub9". The
> probable theory is that ALTER SUBSCRIPTION will lead to restarting of
> apply worker (which we can see in LOGS as well) and after the restart,
> the apply worker will use the existing slot and replication origin
> corresponding to the subscription. Now, it is possible that before
> restart the origin has not been updated and the WAL start location
> points to a location prior to where PUBLICATION pub9 exists which can
> lead to such an error. Once this error occurs, apply worker will never
> be able to proceed and will always return the same error. Does this
> make sense?
>
> Unless you or others see a different theory, this seems to be the
> existing problem in logical replication which is manifested by this
> test. If we just want to fix these test failures, we can create a new
> subscription instead of altering the existing publication to point to
> the new publication.
>

If the above theory is correct then I think allowing the publisher to
catch up with "$node_publisher->wait_for_catchup('sub1');" before
ALTER SUBSCRIPTION should fix this problem. Because if before ALTER
both publisher and subscriber are in sync then the new publication
should be visible to WALSender.

-- 
With Regards,
Amit Kapila.


Reply via email to