On Tue, Feb 7, 2023 at 8:18 AM shiy.f...@fujitsu.com <shiy.f...@fujitsu.com> wrote: > > > On Thu, Feb 2, 2023 11:48 AM shveta malik <shveta.ma...@gmail.com> wrote: > > > > > > So to fix this, I think either we update origin and slot entries in > > the system catalog after the creation has passed or we clean-up the > > system catalog in case of failure. What do you think? > > > > I think the first way seems better.
Yes, I agree. > > I reproduced the problem I reported before with latest patch (v7-0001, > v10-0002), and looked into this problem. It is caused by a similar reason. > Here > is some analysis for the problem I reported [1].#6. > > First, a tablesync worker (worker-1) started for "tbl1", its originname is > "pg_16398_1". And it exited because of unique constraint. In > LogicalRepSyncTableStart(), originname in pg_subscription_rel is updated when > updating table state to DATASYNC, and the origin is created when updating > table > state to FINISHEDCOPY. So when it exited with state DATASYNC , the origin is > not > created but the originname has been updated in pg_subscription_rel. > > Then a tablesync worker (worker-2) started for "tbl2", its originname is > "pg_16398_2". After tablesync of "tbl2" finished, this worker moved to sync > table "tbl1". In LogicalRepSyncTableStart(), it got the originname of "tbl1" - > "pg_16398_1", by calling ReplicationOriginNameForLogicalRep(), and tried to > drop > the origin (although it is not actually created before). After that, it called > replorigin_by_name to get the originid whose name is "pg_16398_1" and the > result > is InvalidOid. Origin won't be created in this case because the sync worker > has > created a replication slot (when it synced tbl2), so the originid was still > invalid and it caused an assertion failure when calling replorigin_advance(). > > It seems we don't need to drop previous origin in worker-2 because the > previous > origin was not created in worker-1. I think one way to fix it is to not update > originname of pg_subscription_rel when setting state to DATASYNC, and only do > that when setting state to FINISHEDCOPY. If so, the originname in > pg_subscription_rel will be set at the same time the origin is created. +1. Update of system-catalog needs to be done carefully and only when origin is created. thanks Shveta