On 2013-04-04 02:58:43 +0200, Andres Freund wrote:
> On 2013-04-03 20:45:51 -0400, Tom Lane wrote:
> > and...@anarazel.de (Andres Freund) writes:
> > > Looking at the page lsn's with dd I noticed something peculiar:
> >
> > > page 0:
> > > 01 00 00 00 18 c2 00 31 => 1/3100C218
> > > page 1:
> > > 01 00 00 00 80 44 01 31 => 1/31014480
> > > page 10:
> > > 01 00 00 00 60 ce 05 31 => 1/3105ce60
> > > page 43:
> > > 01 00 00 00 58 7a 16 31 => 1/31167a58
> > > page 44:
> > > 01 00 00 00 f0 99 16 31 => 1/311699f0
> > > page 45:
> > > 00 00 00 00 00 00 00 00 => 0/0
> > > page 90:
> > > 01 00 00 00 90 17 1d 32 => 1/321d1790
> > > page 91:
> > > 01 00 00 00 38 ef 1b 32 => 1/321bef38
> >
> > > So we have written out pages that are after pages without a LSN that
> > > have an LSN thats *beyond* the point XLOG has successfully been written
> > > to disk (1/31169A38)?
> >
> > If you're looking into the FPIs, those would contain the page's older
> > LSN, not the one assigned by the current WAL record.
>
> Nope, thats from the heap, and the LSNs are *newer* than what startup
> recovered to. I am pretty sure by now we are missing out on valid WAL, I
> am just not sure why.

Ok, I think I see the bug. And I think its been introduced in the
checkpoints patch.

Remember, as explained in other mails, I am pretty sure the end of wal
was errorneously detected, resulting in the following error:
27692  2013-04-03 10:09:15.647 PDT:LOG:  incorrect resource manager data 
checksum in record at 1/31169A68

Not that without a *hardware* crash end of wal is normally found first
by the checks for wrong LSNs or wrong record lengths. Not here.

My theory is that a page that was full page written changed while it was
inserted into the WAL which caused the error. A possibly scenario for
that would e.g. two concurrent calls to MarkBufferDirtyHint().
Note:
 * 2. The caller might have only share-lock instead of exclusive-lock on the
 *    buffer's content lock.

        if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) !=
                (BM_DIRTY | BM_JUST_DIRTIED))
        {
                if (DataChecksumsEnabled() && (bufHdr->flags & BM_PERMANENT))
                {
                        MyPgXact->delayChkpt = delayChkpt = true;
                        lsn = XLogSaveBufferForHint(buffer); /* PART 1 */
                }
                }

                LockBufHdr(bufHdr);
                Assert(bufHdr->refcount > 0);

                if (!(bufHdr->flags & BM_DIRTY))
                {
                        dirtied = true;         /* Means "will be dirtied by 
this action" */

                        /*
                         * Set the page LSN if we wrote a backup block. We 
aren't
                         * supposed to set this when only holding a share lock 
but
                         * as long as we serialise it somehow we're OK. We 
choose to
                         * set LSN while holding the buffer header lock, which 
causes
                         * any reader of an LSN who holds only a share lock to 
also
                         * obtain a buffer header lock before using 
PageGetLSN().
                         * Fortunately, thats not too many places.
                         *
                         * If checksums are enabled, you might think we should 
reset the
                         * checksum here. That will happen when the page is 
written
                         * sometime later in this checkpoint cycle.
                         */
                        if (!XLogRecPtrIsInvalid(lsn))
                                PageSetLSN(page, lsn); /* PART 2 */
                }

So XLogSaveBufferForHint(buffer) is called for all buffers which are not
yet dirty. Without any locks. It just does:

XLogSaveBufferForHint(Buffer buffer)
{
        rdata[0].data = (char *) (&watermark);
        rdata[0].len = sizeof(int);
        rdata[0].next = &(rdata[1]);

        rdata[1].data = NULL;
        rdata[1].len = 0;
        rdata[1].buffer = buffer;
        rdata[1].buffer_std = true;
        rdata[1].next = NULL;

        return XLogInsert(RM_XLOG_ID, XLOG_HINT, rdata);
}

I think what happens is that that one backend grabs WALInsertLock first,
it computes a full page write for the buffer, including a CRC. And
returns a valid LSN. Then it continues towards the PageSetLSN() after
LockBufHdr().
Allthewhile the second backend notices that WALInsertLock is free again
and continues towards the WALInsertLock protected parts of XLogInsert().

Only that it already has done that part:
        /*
         * Calculate CRC of the data, including all the backup blocks
         *
         * Note that the record header isn't added into the CRC initially since 
we
         * don't know the prev-link yet.  Thus, the CRC will represent the CRC 
of
         * the whole record in the order: rdata, then backup blocks, then record
         * header.
         */
        INIT_CRC32(rdata_crc);
        for (rdt = rdata; rdt != NULL; rdt = rdt->next)
                COMP_CRC32(rdata_crc, rdt->data, rdt->len);

It goes on to write all the data to the WAL after successfully getting
WALInsertLock:
        while (write_len)
        {
                while (rdata->data == NULL)
                        rdata = rdata->next;

                if (freespace > 0)
                {
                        if (rdata->len > freespace)
                        {
                                memcpy(Insert->currpos, rdata->data, freespace);
                                rdata->data += freespace;
                                rdata->len -= freespace;
                                write_len -= freespace;
                        }
                        else
                        {
                                memcpy(Insert->currpos, rdata->data, 
rdata->len);
                                freespace -= rdata->len;
                                write_len -= rdata->len;
                                Insert->currpos += rdata->len;
                                rdata = rdata->next;
                                continue;
                        }
                }

                /* Use next buffer */
                updrqst = AdvanceXLInsertBuffer(false);
                curridx = Insert->curridx;
                /* Mark page header to indicate this record continues on the 
page */
                Insert->currpage->xlp_info |= XLP_FIRST_IS_CONTRECORD;
                Insert->currpage->xlp_rem_len = write_len;
                freespace = INSERT_FREESPACE(Insert);
        }

If by now the first backend has proceeded to PageSetLSN() we are writing
different data to disk than the one we computed the checksum of
before. Boom.

I think the whole locking interactions in MarkBufferDirtyHint() need to
be thought over pretty carefully.

Does this explanation make sense?

Greetings,

Andres Freund

--
 Andres Freund                     http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to