Hello,

Our customer encountered a rare bug of PostgreSQL which prevents a cascaded 
standby from starting up.  The attached patch is a fix for it.  I hope this 
will be back-patched.  I'll add this to the next CF.


PROBLEM
==============================

The PostgreSQL version is 9.5.  The cluster consists of a master, a cascading 
standby (SB1), and a cascaded standby (SB2).  The WAL flows like this: master 
-> SB1 -> SB2.

The user shut down SB2 and tried to restart it, but failed with the following 
message:

    FATAL:  invalid memory alloc request size 3075129344

The master and SB1 continued to run.


CAUSE
==============================

total_len in the code below was about 3GB, so palloc() rejected the memory 
allocation request.

[xlogreader.c]
        record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
        total_len = record->xl_tot_len;
...
        /*
         * Enlarge readRecordBuf as needed.
         */
        if (total_len > state->readRecordBufSize &&
                !allocate_recordbuf(state, total_len))
        {

Why was XLogRecord->xl_tot_len so big?  That's because the XLOG reader read the 
garbage portion in a WAL file, which is immediately after the end of valid WAL 
records.

Why was there the garbage?  Because the cascading standby sends just the valid 
WAL records, not the whole WAL blocks, to the cascaded standby.  When the 
cascaded standby receives those WAL records and write them in a recycled WAL 
file, the last WAL block in the file contains the mix of valid WAL records and 
the garbage left over.

Why did the XLOG reader try to allocate memory for a garbage?  Doesn't it stop 
reading the WAL?  As the following code shows, when the WAL record header 
crosses a WAL block boundary, the XLOG reader first allocates memory for the 
whole record, and then checks the validity of the record header as soon as it 
reads the entire header.

[xlogreader.c]
        /*
         * If the whole record header is on this page, validate it immediately.
         * Otherwise do just a basic sanity check on xl_tot_len, and validate 
the
         * rest of the header after reading it from the next page.  The 
xl_tot_len
         * check is necessary here to ensure that we enter the "Need to 
reassemble
         * record" code path below; otherwise we might fail to apply
         * ValidXLogRecordHeader at all.
         */
        if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
        {
                if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr, 
record,
                                                                   randAccess))
                        goto err;
                gotheader = true;
        }
        else
        {
                /* XXX: more validation should be done here */
                if (total_len < SizeOfXLogRecord)
                {
                        report_invalid_record(state,
                                                                  "invalid 
record length at %X/%X: wanted %u, got %u",
                                                                  (uint32) 
(RecPtr >> 32), (uint32) RecPtr,
                                                                  (uint32) 
SizeOfXLogRecord, total_len);
                        goto err;
                }
                gotheader = false;
        }


FIX
==============================

One idea is to defer the memory allocation until the entire record header is 
read and ValidXLogRecordHeader() is called.  However, ValidXLogRecordHeader() 
might misjudge the validity if the garbage contains xl_prev seeming correct.

So, I modified walreceiver to zero-fill the last partial WAL block.  This is 
the guard that the master does in AdvanceXLInsertBuffer() as below:

                /*
                 * Be sure to re-zero the buffer so that bytes beyond what we've
                 * written will look like zeroes and not valid XLOG records...
                 */
                MemSet((char *) NewPage, 0, XLOG_BLCKSZ);

FYI, the following unsolved problem may be solved, too.

https://www.postgresql.org/message-id/CAE2gYzzVZNsGn%3D-E6grO4sVQs04J02zNKQofQEO8gu8%3DqCFR%2BQ%40mail.gmail.com


Regards
Takayuki Tsunakawa

Attachment: zerofill_walblock_on_standby.patch
Description: zerofill_walblock_on_standby.patch

Reply via email to