On Sun, Mar 18, 2018 at 08:49:01AM +0900, Michael Paquier wrote: > On Fri, Mar 16, 2018 at 06:02:25AM +0000, Tsunakawa, Takayuki wrote: >> Ouch, you're right. If memory allocation fails, the startup process >> would emit a LOG message and continue to fetch new WAL records. Then, >> I'm completely happy with your patch. > > Thanks for double-checking, Tsunakawa-san.
As this is one of those small bug fixes for which we can do something, attached is an updated patch with a commit description, and tutti-quanti. At the end, I have moved the size check within allocate_recordbuf(). Even if the size calculated is not going to be higher than total_len, it feels safer in the long term for two reasons: - Allocation size calculation could be changed, or let's say made smarter. - New callers of allocate_recordbuf would not be able to see the problem for the backend, and this could become a problem if the WAL reader facility is extended. Thoughts? -- Michael
From 7688dceee5590b8430a90cfae2b98917f710e824 Mon Sep 17 00:00:00 2001 From: Michael Paquier <mich...@paquier.xyz> Date: Tue, 12 Jun 2018 14:52:35 +0900 Subject: [PATCH] Prevent hard failures of standbys caused by recycled WAL segments When a standby stops reading WAL from a WAL stream, it writes data to the current WAL segment without having priorily zero'ed the page written to, which can cause the WAL reader to read junk data from a past recycled segment and then it would try to get a record from it. While sanity checks in place provide most of the protection needed, in some rare circumstances, with chances increasing when a record header crosses a page boundary, then the startup process could fail violently on an allocation failure, which is unhelpful as this would require in the worst case a manual restart of the instance, impacting potentially the availability of the cluster. The chances of seeing failures are higher if the connection between the standby and its root node is flacky, causing WAL pages to be written in the middle. A couple of approaches have been discussed, like zero-ing new WAL pages within the WAL receiver itself but this has the disadvantage of impacting performance of any existing instances as this breaks the serializable writes done by the WAL receiver. This commit deals with the problem with a more simple approach, which has no performance impact without reducing the detection of the problem: if a record is found with a length higher than 1GB for backends, then do not try any allocation and report a soft failure. It could be possible that the allocation call passes and that an unnecessary amount of memory is allocated, however follow-up checks on records would just fail, making this allocation short-lived anyway. This patch owes a great deal to Tsunakawa Takayuki for reporting the failure first, and then discussing a couple of potential approaches to the problem. Reported-by: Tsunakawa Takayuki Reviewed-by: Tsunakawa Takayuki Author: Michael Paquier Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05 --- src/backend/access/transam/xlogreader.c | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index a7953f323b..5929ce30af 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -25,6 +25,10 @@ #include "common/pg_lzcompress.h" #include "replication/origin.h" +#ifndef FRONTEND +#include "utils/memutils.h" +#endif + static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength); static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr, @@ -160,6 +164,27 @@ allocate_recordbuf(XLogReaderState *state, uint32 reclength) newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ); newSize = Max(newSize, 5 * Max(BLCKSZ, XLOG_BLCKSZ)); +#ifndef FRONTEND + + /* + * Note that in much unlucky circumstances, the random data read from a + * recycled segment can cause this routine to be called with a size + * causing a hard failure at allocation. While for frontend utilities + * this likely means that the end of WAL has been reached, for a standby + * this would cause the instance to stop suddendly with a hard failure, + * preventing it to retry fetching WAL from one of its sources which could + * allow it to move on with replay without a manual restart. If the data + * comes from a past recycled segment and is still valid, then the + * allocation would succeed but record checks are going to fail so this + * would be short-lived. If the allocation fails because of a memory + * shortage, then this is not a hard failure either per the guarantee + * given by MCXT_ALLOC_NO_OOM. + */ + if (!AllocSizeIsValid(newSize)) + return false; + +#endif + if (state->readRecordBuf) pfree(state->readRecordBuf); state->readRecordBuf = -- 2.17.1
signature.asc
Description: PGP signature