I wrote: > More to the point, aren't these proposals just band-aids that > would stabilize the test without fixing the actual problem? > The same thing is likely to happen to people in the field, > unless we do something drastic like removing ALTER SUBSCRIPTION.
I've been able to make the 031_column_list.pl failure pretty reproducible by adding a delay in walsender, as attached. While I'm not too familiar with this code, it definitely does appear that the new walsender is told to start up at an LSN before the creation of the publication, and then if it needs to decide whether to stream a particular data change before it's reached that creation, kaboom! I read and understood the upthread worries about it not being a great idea to ignore publication lookup failures, but I really don't see that we have much choice. As an example, if a subscriber is humming along reading publication pub1, and then someone drops and then recreates pub1 on the publisher, I don't think that the subscriber will be able to advance through that gap if there are any operations within it that require deciding if they should be streamed. (That is, contrary to Amit's expectation that DROP/CREATE would mask the problem, I suspect it will instead turn it into a hard failure. I've not experimented though.) BTW, this same change breaks two other subscription tests: 015_stream.pl and 022_twophase_cascade.pl. The symptoms are different (no "publication does not exist" errors), so maybe these are just test problems not fundamental weaknesses. But "replication falls over if the walsender is slow" isn't something I'd call acceptable. regards, tom lane
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 77c8baa32a..c897ef4c60 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -2699,6 +2699,8 @@ WalSndLoop(WalSndSendDataCallback send_data) !pq_is_send_pending()) break; + pg_usleep(10000); + /* * If we don't have any pending data in the output buffer, try to send * some more. If there is some, we don't bother to call send_data