On Mon, Dec 21, 2020 at 11:36 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Mon, Dec 21, 2020 at 3:17 PM Peter Smith <smithpb2...@gmail.com> wrote: > > > > On Mon, Dec 21, 2020 at 4:23 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > > > > Few other comments: > > > ================== > > > > Thanks for your feedback. > > > > > 1. > > > * FIXME 3 - Crashed tablesync workers may also have remaining slots > > > because I don't think > > > + * such workers are even iterated by this loop, and nobody else is > > > removing them. > > > + */ > > > + if (slotname) > > > + { > > > > > > The above FIXME is not clear to me. Actually, the crashed workers > > > should restart, finish their work, and drop the slots. So not sure > > > what exactly this FIXME refers to? > > > > Yes, normally if the tablesync can complete it should behave like that. > > But I think there are other scenarios where it may be unable to > > clean-up after itself. For example: > > > > i) Maybe the crashed tablesync worker cannot finish. e.g. A row insert > > handled by tablesync can give a PK violation which also will crash > > again and again for each re-launched/replacement tablesync worker. > > This can be reproduced in the debugger. If the DropSubscription > > doesn't clean-up the tablesync's slot then nobody will. > > > > ii) Also DROP SUBSCRIPTION code has locking (see code commit) "to > > ensure that the launcher doesn't restart new worker during dropping > > the subscription". > > > > Yeah, I have also read that comment but do you know how it is > preventing relaunch? How does the subscription lock help?
Hmmm. I did see there is a matching lock in get_subscription_list of launcher.c, which may be what that code comment was referring to. But I also am currently unsure how this lock prevents anybody (e.g. process_syncing_tables_for_apply) from executing another logicalrep_worker_launch. > > > So executing DROP SUBSCRIPTION will prevent a newly > > crashed tablesync from re-launching, so it won’t be able to take care > > of its own slot. If the DropSubscription doesn't clean-up that > > tablesync's slot then nobody will. > > > > > > > > > > 2. > > > DropSubscription() > > > { > > > .. > > > ReplicationSlotDropAtPubNode( > > > + NULL, > > > + conninfo, /* use conninfo to make a new connection. */ > > > + subname, > > > + syncslotname); > > > .. > > > } > > > > > > With the above call, it will form a connection with the publisher and > > > drop the required slots. I think we need to save the connection info > > > so that we don't need to connect/disconnect for each slot to be > > > dropped. Later in this function, we again connect and drop the apply > > > worker slot. I think we should connect just once drop the apply and > > > table sync slots if any. > > > > OK. IIUC this is a suggestion for more efficient connection usage, > > rather than actual bug right? > > > > Yes, it is for effective connection usage. > I have addressed this in the latest patch [v6] > > > > > > > > 3. > > > ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn_given, char > > > *conninfo, char *subname, char *slotname) > > > { > > > .. > > > + PG_TRY(); > > > .. > > > + PG_CATCH(); > > > + { > > > + /* NOP. Just gobble any ERROR. */ > > > + } > > > + PG_END_TRY(); > > > > > > Why are we suppressing the error instead of handling it the error in > > > the same way as we do while dropping the apply worker slot in > > > DropSubscription? > > > > This function is common - it is also called from the tablesync > > finish_sync_worker. But in the finish_sync_worker case I wanted to > > avoid throwing an ERROR which would cause the tablesync to crash and > > relaunch (and crash/relaunch/repeat...) when all it was trying to do > > in the first place was just cleanup and exit the process. Perhaps the > > error suppression should be conditional depending where this function > > is called from? > > > > Yeah, that could be one way and if you follow my previous suggestion > this function might change a bit more. I have addressed this in the latest patch [v6] --- [v6] https://www.postgresql.org/message-id/CAHut%2BPuCLty2HGNT6neyOcUmBNxOLo%3DybQ2Yv-nTR4kFY-8QLw%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia.