On Apr14, 2013, at 17:56 , Fujii Masao <masao.fu...@gmail.com> wrote: > At fast shutdown, after walsender sends the checkpoint record and > closes the replication connection, walreceiver can detect the close > of connection before receiving all WAL records. This means that, > even if walsender sends all WAL records, walreceiver cannot always > receive all of them.
That sounds like a bug in walreceiver to me. The following code in walreceiver's main loop looks suspicious: /* * Process the received data, and any subsequent data we * can read without blocking. */ for (;;) { if (len > 0) { /* Something was received from master, so reset timeout */ ... XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1); } else if (len == 0) break; else if (len < 0) { ereport(LOG, (errmsg("replication terminated by primary server"), errdetail("End of WAL reached on timeline %u at %X/%X", startpointTLI, (uint32) (LogstreamResult.Write >> 32), (uint32) LogstreamResult.Write))); ... } len = walrcv_receive(0, &buf); } /* Let the master know that we received some data. */ XLogWalRcvSendReply(false, false); /* * If we've written some records, flush them to disk and * let the startup process and primary server know about * them. */ XLogWalRcvFlush(false); The loop at the top looks fine - it specifically avoids throwing an error on EOF. But the code then proceeds to XLogWalRcvSendReply() which doesn't seem to have the same smarts - it simply does if (PQputCopyData(streamConn, buffer, nbytes) <= 0 || PQflush(streamConn)) ereport(ERROR, (errmsg("could not send data to WAL stream: %s", PQerrorMessage(streamConn)))); Unless I'm missing something, that certainly seems to explain how a standby can lag behind even after a controlled shutdown of the master. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers