> The first problem I noticed is that the slave never seems to realize > that the master has gone away. Every time I crashed the master, I had > to kill the wal receiver process on the slave to get it to reconnect; > otherwise it just sat there waiting, either forever or at least for > longer than I was willing to wait.
Yes, I've noticed this. That was the reason for forcing walreceiver to shut down on a restart per prior discussion and patches. This needs to be on the open items list ... possibly it'll be fixed by Simon's keepalive patch? Or is it just a tcp_keeplalive issue? > More seriously, I was able to demonstrate that the problem linked in > the thread above is real: if the master crashes after streaming WAL > that it hasn't yet fsync'd, then on recovery the slave's xlog position > is ahead of the master. So far I've only been able to reproduce this > with fsync=off, but I believe it's possible anyway, ... and some users will turn fsync off. This is, in fact, one of the primary uses for streaming replication: Durability via replicas. > and this just > makes it more likely. After the most recent crash, the master thought > pg_current_xlog_location() was 1/86CD4000; the slave thought > pg_last_xlog_receive_location() was 1/8733C000. After reconnecting to > the master, the slave then thought that > pg_last_xlog_receive_location() was 1/87000000. So, *in this case*, detecting out-of-sequence xlogs (and PANICing) would have actually prevented the slave from being corrupted. My question, though, is detecting out-of-sequence xlogs *enough*? Are there any crash conditions on the master which would cause the master to reuse the same locations for different records, for example? I don't think so, but I'd like to be certain. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers