While testing Fujii-san's patch to flush any already-written WAL on error in walreceiver, I added a debugging elog to XLogWalRcvFlush() to print out how far it has written and flushed.

I saw an unexpected pattern while the standby catches up with the master:

LOG:  streaming replication successfully connected to primary
LOG:   flushing flush=0/0 write=0/1E020000
LOG:   flushing flush=0/1E020000 write=0/1E040000
LOG:   flushing flush=0/1E040000 write=0/1E060000
LOG:   flushing flush=0/1E060000 write=0/1E080000
LOG:   flushing flush=0/1E080000 write=0/1E0A0000
LOG:   flushing flush=0/1E0A0000 write=0/1E0C0000
LOG:   flushing flush=0/1E0C0000 write=0/1E0E0000
LOG:   flushing flush=0/1E0E0000 write=0/1E100000
LOG:   flushing flush=0/1E100000 write=0/1E120000
LOG:   flushing flush=0/1E120000 write=0/1E140000

The master sends the WAL at full-speed, but walreceiver always fsyncs it in 128 kB chunks. That's excessive, it should be able to read and write to disk the whole WAL segment before flushing.

There's a little flaw in the walreceiver logic that tries to read all the available WAL before flushing and sleeping. The way libpqrcv_receive is written, when it's called with timeout==0 it will not call PQconsumeInput. So what happens is that when walreceiver main loop calls libpqrcv_receive() in a loop to fetch all WAL that's available without blocking, it actually only reads the WAL that's in the libpq receive buffer - it will not read the WAL that's in the TCP read buffer.

Attached patch fixes libpqrcv_receive() so that it calls PQconsumeInput() before concluding that there's no data available. The excessive fsyncing can lead to very bad performance, so this needs to be appled to 9.0 too.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 5aac85d..9e8504b 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -351,28 +351,33 @@ libpqrcv_receive(int timeout, unsigned char *type, char **buffer, int *len)
 		PQfreemem(recvBuf);
 	recvBuf = NULL;
 
-	/*
-	 * If the caller requested to block, wait for data to arrive. But if this
-	 * is the first call after connecting, don't wait, because there might
-	 * already be some data in libpq buffer that we haven't returned to
-	 * caller.
-	 */
-	if (timeout > 0 && !justconnected)
+	/* Try to receive a CopyData message */
+	rawlen = PQgetCopyData(streamConn, &recvBuf, 1);
+	if (rawlen == 0)
 	{
-		if (!libpq_select(timeout))
-			return false;
+		/*
+		 * No data available yet. If the caller requested to block, wait for
+		 * more data to arrive. But if this is the first call after connecting,
+		 * don't wait, because there might already be some data in libpq buffer
+		 * that we haven't returned to caller.
+		 */
+		if (timeout > 0 && !justconnected)
+		{
+			if (!libpq_select(timeout))
+				return false;
+		}
+		justconnected = false;
 
 		if (PQconsumeInput(streamConn) == 0)
 			ereport(ERROR,
 					(errmsg("could not receive data from WAL stream: %s",
 							PQerrorMessage(streamConn))));
-	}
-	justconnected = false;
 
-	/* Receive CopyData message */
-	rawlen = PQgetCopyData(streamConn, &recvBuf, 1);
-	if (rawlen == 0)			/* no data available yet, then return */
-		return false;
+		/* Now that we've consumed some input, try again */
+		rawlen = PQgetCopyData(streamConn, &recvBuf, 1);
+		if (rawlen == 0)
+			return false;
+	}
 	if (rawlen == -1)			/* end-of-streaming or error */
 	{
 		PGresult   *res;
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to