On Sun, Mar 18, 2018 at 08:49:01AM +0900, Michael Paquier wrote:
> On Fri, Mar 16, 2018 at 06:02:25AM +0000, Tsunakawa, Takayuki wrote:
>> Ouch, you're right.  If memory allocation fails, the startup process
>> would emit a LOG message and continue to fetch new WAL records.  Then,
>> I'm completely happy with your patch.
> 
> Thanks for double-checking, Tsunakawa-san.

As this is one of those small bug fixes for which we can do something,
attached is an updated patch with a commit description, and
tutti-quanti.  At the end, I have moved the size check within
allocate_recordbuf().  Even if the size calculated is not going to be
higher than total_len, it feels safer in the long term for two reasons:
- Allocation size calculation could be changed, or let's say made
smarter.
- New callers of allocate_recordbuf would not be able to see the problem
for the backend, and this could become a problem if the WAL reader
facility is extended.

Thoughts?
--
Michael
From 7688dceee5590b8430a90cfae2b98917f710e824 Mon Sep 17 00:00:00 2001
From: Michael Paquier <mich...@paquier.xyz>
Date: Tue, 12 Jun 2018 14:52:35 +0900
Subject: [PATCH] Prevent hard failures of standbys caused by recycled WAL
 segments

When a standby stops reading WAL from a WAL stream, it writes data to
the current WAL segment without having priorily zero'ed the page written
to, which can cause the WAL reader to read junk data from a past
recycled segment and then it would try to get a record from it.  While
sanity checks in place provide most of the protection needed, in some
rare circumstances, with chances increasing when a record header crosses
a page boundary, then the startup process could fail violently on an
allocation failure, which is unhelpful as this would require in the
worst case a manual restart of the instance, impacting potentially the
availability of the cluster.

The chances of seeing failures are higher if the connection between the
standby and its root node is flacky, causing WAL pages to be written in
the middle.  A couple of approaches have been discussed, like zero-ing
new WAL pages within the WAL receiver itself but this has the
disadvantage of impacting performance of any existing instances as this
breaks the serializable writes done by the WAL receiver.  This commit
deals with the problem with a more simple approach, which has no
performance impact without reducing the detection of the problem: if a
record is found with a length higher than 1GB for backends, then do not
try any allocation and report a soft failure.  It could be possible that
the allocation call passes and that an unnecessary amount of memory is
allocated, however follow-up checks on records would just fail, making
this allocation short-lived anyway.

This patch owes a great deal to Tsunakawa Takayuki for reporting the
failure first, and then discussing a couple of potential approaches to
the problem.

Reported-by: Tsunakawa Takayuki
Reviewed-by: Tsunakawa Takayuki
Author: Michael Paquier
Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05
---
 src/backend/access/transam/xlogreader.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index a7953f323b..5929ce30af 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -25,6 +25,10 @@
 #include "common/pg_lzcompress.h"
 #include "replication/origin.h"
 
+#ifndef FRONTEND
+#include "utils/memutils.h"
+#endif
+
 static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
 
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
@@ -160,6 +164,27 @@ allocate_recordbuf(XLogReaderState *state, uint32 reclength)
 	newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ);
 	newSize = Max(newSize, 5 * Max(BLCKSZ, XLOG_BLCKSZ));
 
+#ifndef FRONTEND
+
+	/*
+	 * Note that in much unlucky circumstances, the random data read from a
+	 * recycled segment can cause this routine to be called with a size
+	 * causing a hard failure at allocation.  While for frontend utilities
+	 * this likely means that the end of WAL has been reached, for a standby
+	 * this would cause the instance to stop suddendly with a hard failure,
+	 * preventing it to retry fetching WAL from one of its sources which could
+	 * allow it to move on with replay without a manual restart. If the data
+	 * comes from a past recycled segment and is still valid, then the
+	 * allocation would succeed but record checks are going to fail so this
+	 * would be short-lived.  If the allocation fails because of a memory
+	 * shortage, then this is not a hard failure either per the guarantee
+	 * given by MCXT_ALLOC_NO_OOM.
+	 */
+	if (!AllocSizeIsValid(newSize))
+		return false;
+
+#endif
+
 	if (state->readRecordBuf)
 		pfree(state->readRecordBuf);
 	state->readRecordBuf =
-- 
2.17.1

Attachment: signature.asc
Description: PGP signature

Reply via email to