On Sat, Jan 23, 2021 at 5:56 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Sat, Jan 23, 2021 at 4:55 AM Peter Smith <smithpb2...@gmail.com> wrote: > > > > PSA the v18 patch for the Tablesync Solution1. > > 7. Have you tested with the new patch the scenario where we crash > after FINISHEDCOPY and before SYNCDONE, is it able to pick up the > replication using the new temporary slot? Here, we need to test the > case where during the catchup phase we have received few commits and > then the tablesync worker is crashed/errored out? Basically, check if > the replication is continued from the same point? >
I have tested this and it didn't work, see the below example. Publisher-side ================ CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120)); BEGIN; INSERT INTO mytbl1(somedata, text) VALUES (1, 1); INSERT INTO mytbl1(somedata, text) VALUES (1, 2); COMMIT; CREATE PUBLICATION mypublication FOR TABLE mytbl1; Subscriber-side ================ - Have a while(1) loop in LogicalRepSyncTableStart so that tablesync worker stops. CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120)); CREATE SUBSCRIPTION mysub CONNECTION 'host=localhost port=5432 dbname=postgres' PUBLICATION mypublication; During debug, stop after we mark FINISHEDCOPY state. Publisher-side ================ INSERT INTO mytbl1(somedata, text) VALUES (1, 3); INSERT INTO mytbl1(somedata, text) VALUES (1, 4); Subscriber-side ================ - Have a breakpoint in apply_dispatch - continue in debugger; - After we replay first commit (which will be for values(1,3), note down the origin position in apply_handle_commit_internal and somehow error out. I have forced the debugger to set to the last line in apply_dispatch where the error is raised. - After the error, again the tablesync worker is restarted and it starts from the position noted in the previous step - It exits without replaying the WAL for (1,4) So, on the subscriber-side, you will see 3 records. Fourth is missing. Now, if you insert more records on the publisher, it will anyway replay those but the fourth one got missing. The temporary slots didn't seem to work because we created again the new temporary slot after the crash and ask it to start decoding from the point we noted in origin_lsn. The publisher didn’t hold the required WAL as our slot was temporary so it started sending from some later point. We retain WAL based on the slots restart_lsn position and wal_keep_size. For our case, the positions of the slots will matter and as we have created temporary slots, there is no way for a publisher to save that WAL. In this particular case, even if the WAL would have been there we only pass the start_decoding_at position but didn’t pass restart_lsn, so it picked a random location (current insert position in WAL) which is ahead of start_decoding_at point so it never sent the required fourth record. Now, I don’t think it will work even if somehow sent the correct restart_lsn because of what I wrote earlier that there is no guarantee that the earlier WAL would have been saved. At this point, I can't think of any way to fix this problem except for going back to the previous approach of permanent slots but let me know if you have any ideas to salvage this approach? -- With Regards, Amit Kapila.