On 16.08.2011 16:25, Simon Riggs wrote:
On Tue, Aug 16, 2011 at 9:55 AM, Fujii Masao<[email protected]> wrote:When I tested the PITR on git master with max_wal_senders> 0, I found that the following inappropriate log meesage was always output even though cascading replication is not in progress. Attached patch fixes this problem. LOG: terminating all walsender processes to force cascaded standby(s) to update timeline and reconnect When making the patch, I found another problem about cascading replication; When promoting a cascading standby, postmaster sends SIGUSR2 to any cascading walsenders to kill them. But there is a orner-case where such walsender fails to receive SIGUSR2 and survives a standby promotion unexpectedly. This happens when postmaster sends SIGUSR2 before the walsender marks itself as a WAL sender, because postmaster sends SIGUSR2 to only the processes marked as a WAL sender. To avoid the corner-case, I changed walsender so that it checks whether recovery is in progress or not again after marking itself as a WAL sender. If recovery is not in progress even though the walsender is cascading one, it does the same thing as SIGUSR2 signal handler does, and then exits later. Attached patch also includes this fix.Looks like valid problems and appropriate fixes to me. Will commit.
I think there's a race condition here. If a walsender is just starting up, it might not have registered itself as a walsender yet. It's actually been there before this patch to suppress the log message.
-- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
