On 23 February 2015 at 15:38, Andres Freund <and...@2ndquadrant.com> wrote:
> Hi, > > On 2015-02-23 15:25:57 +0000, Thom Brown wrote: > > I've noticed that if the primary is started and then a base backup is > > immediately taken from it and started as as a synchronous standby, it > > doesn't replicate and the primary hangs indefinitely when trying to run > any > > WAL-generating statements. It only recovers when either the primary is > > restarted (which has to use a fast shutdown otherwise it also hangs > > forever), or the standby is restarted. > > > > Here's a way of reproducing it: > > ... > > Note that if you run the commands one by one, there isn't a problem. If > > you run it as a script, the standby doesn't connect to the primary. > There > > aren't any errors reported by either the standby or the primary. The > > primary's wal sender process reports the following: > > > > wal sender process rep_user 127.0.0.1(45243) startup waiting for > 0/3000158 > > > > Anyone know why this would be happening? And if this could be a problem > in > > other scenarios? > > Given that normally a walsender doesn't wait for syncrep I guess this is > the above backend just did authentication. If you gdb into the > walsender, what's the backtrace? > #0 0x00007f66d1725940 in poll () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x0000000000617faa in WaitLatchOrSocket () #2 0x000000000064741b in SyncRepWaitForLSN () #3 0x00000000004bbf8f in CommitTransaction () #4 0x00000000004be135 in CommitTransactionCommand () #5 0x0000000000757679 in InitPostgres () #6 0x0000000000675032 in PostgresMain () #7 0x00000000004617ef in ServerLoop () #8 0x0000000000627c9c in PostmasterMain () #9 0x000000000046223d in main () -- Thom