On Sun, Jan 8, 2012 at 2:03 PM, Simon Riggs <[email protected]> wrote:
> On Sat, Jan 7, 2012 at 11:09 AM, Simon Riggs <[email protected]> wrote:
>> On Sat, Jan 7, 2012 at 10:55 AM, Simon Riggs <[email protected]> wrote:
>>
>>> So there isn't any problem with there being incorrect checksums on
>>> blocks and you can turn the parameter on and off as often and as
>>> easily as you want. I think it can be USERSET but I wouldn't want to
>>> encourage users to see turning it off as a performance tuning feature.
>>> If the admin turns it on for the server, its on, so its SIGHUP.
>>>
>>> Any holes in that I haven't noticed?
>>
>> And of course, as soon as I wrote that I thought of the problem. We
>> mustn't make a write that hasn't been covered by a FPW, so we must
>> know ahead of time whether to WAL log hints or not. We can't simply
>> turn it on/off any longer, now that we have to WAL log hint bits also.
>> So thanks for making me think of that.
>>
>> We *could* make it turn on/off at each checkpoint, but its easier just
>> to say that it can be turned on/off at server start.
>
> Attached patch v6 now handles hint bits and checksums correctly,
> following Heikki's comments.
>
> In recovery, setting a hint doesn't dirty a block if it wasn't already
> dirty. So we can write some hints, and we can set others but not write
> them.
>
> Lots of comments in the code.
v7
* Fixes merge conflict
* Minor patch cleanups
* Adds checksum of complete page including hole
* Calcs checksum in mdwrite() so we pickup all non-shared buffer writes also
Robert mentioned to me there were outstanding concerns on this patch.
I know of none, and have double checked the thread to confirm all
concerns are fully addressed. Adding to CF.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0cc3296..3cb8d2a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1701,6 +1701,47 @@ SET ENABLE_SEQSCAN TO OFF;
</listitem>
</varlistentry>
+ <varlistentry id="guc-page-checksums" xreflabel="page_checksums">
+ <indexterm>
+ <primary><varname>page_checksums</> configuration parameter</primary>
+ </indexterm>
+ <term><varname>page_checksums</varname> (<type>boolean</type>)</term>
+ <listitem>
+ <para>
+ When this parameter is on, the <productname>PostgreSQL</> server
+ calculates checksums when it writes main database pages to disk,
+ flagging the page as checksum protected. When this parameter is off,
+ no checksum is written, only a standard watermark in the page header.
+ The database may thus contain a mix of pages with checksums and pages
+ without checksums.
+ </para>
+
+ <para>
+ When pages are read into shared buffers any page flagged with a
+ checksum has the checksum re-calculated and compared against the
+ stored value to provide greatly improved validation of page contents.
+ </para>
+
+ <para>
+ Writes via temp_buffers are not checksummed.
+ </para>
+
+ <para>
+ Turning this parameter off speeds normal operation, but
+ might allow data corruption to go unnoticed. The checksum uses
+ 16-bit checksums, using the fast Fletcher 16 algorithm. With this
+ parameter enabled there is still a non-zero probability that an error
+ could go undetected, as well as a non-zero probability of false
+ positives.
+ </para>
+
+ <para>
+ This parameter can only be set at server start.
+ The default is <literal>off</>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-wal-buffers" xreflabel="wal_buffers">
<term><varname>wal_buffers</varname> (<type>integer</type>)</term>
<indexterm>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 19ef66b..7c7b20e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -709,6 +709,7 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
bool updrqst;
bool doPageWrites;
bool isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
+ bool IsHint = (rmid == RM_SMGR_ID && info == XLOG_SMGR_HINT);
uint8 info_orig = info;
/* cross-check on whether we should be here or not */
@@ -955,6 +956,18 @@ begin:;
}
/*
+ * If this is a hint record and we don't need a backup block then
+ * we have no more work to do and can exit quickly without inserting
+ * a WAL record at all. In that case return InvalidXLogRecPtr.
+ */
+ if (IsHint && !(info & XLR_BKP_BLOCK_MASK))
+ {
+ LWLockRelease(WALInsertLock);
+ END_CRIT_SECTION();
+ return InvalidXLogRecPtr;
+ }
+
+ /*
* If there isn't enough space on the current XLOG page for a record
* header, advance to the next page (leaving the unused space as zeroes).
*/
@@ -3650,6 +3663,13 @@ RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup)
BLCKSZ - (bkpb.hole_offset + bkpb.hole_length));
}
+ /*
+ * Any checksum set on this page will be invalid. We don't need
+ * to reset it here since it will be reset before being written
+ * but it seems worth doing this for general sanity and hygiene.
+ */
+ PageSetPageSizeAndVersion(page, BLCKSZ, PG_PAGE_LAYOUT_VERSION);
+
PageSetLSN(page, lsn);
PageSetTLI(page, ThisTimeLineID);
MarkBufferDirty(buffer);
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index a017101..618c8f9 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -20,6 +20,7 @@
#include "postgres.h"
#include "access/visibilitymap.h"
+#include "access/transam.h"
#include "access/xact.h"
#include "access/xlogutils.h"
#include "catalog/catalog.h"
@@ -70,6 +71,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_HINT 0x40
typedef struct xl_smgr_create
{
@@ -477,19 +479,74 @@ AtSubAbort_smgr(void)
smgrDoPendingDeletes(false);
}
+/*
+ * Write a backup block if needed when we are setting a hint.
+ *
+ * Deciding the "if needed" bit is delicate and requires us to either
+ * grab WALInsertLock or check the info_lck spinlock. If we check the
+ * spinlock and it says Yes then we will need to get WALInsertLock as well,
+ * so the design choice here is to just go straight for the WALInsertLock
+ * and trust that calls to this function are minimised elsewhere.
+ *
+ * Callable while holding share lock on the buffer content.
+ *
+ * Possible that multiple concurrent backends could attempt to write
+ * WAL records. In that case, more than one backup block may be recorded
+ * though that isn't important to the outcome and the backup blocks are
+ * likely to be identical anyway.
+ */
+#define SMGR_HINT_WATERMARK 13579
+void
+smgr_buffer_hint(Buffer buffer)
+{
+ /*
+ * Make an XLOG entry reporting the hint
+ */
+ XLogRecPtr lsn;
+ XLogRecData rdata[2];
+ int watermark = SMGR_HINT_WATERMARK;
+
+ /*
+ * Not allowed to have zero-length records, so use a small watermark
+ */
+ rdata[0].data = (char *) (&watermark);
+ rdata[0].len = sizeof(int);
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].buffer_std = false;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = NULL;
+ rdata[1].len = 0;
+ rdata[1].buffer = buffer;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ lsn = XLogInsert(RM_SMGR_ID, XLOG_SMGR_HINT, rdata);
+
+ /*
+ * Set the page LSN if we wrote a backup block.
+ */
+ if (!XLByteEQ(InvalidXLogRecPtr, lsn))
+ {
+ Page page = BufferGetPage(buffer);
+ PageSetLSN(page, lsn);
+ elog(LOG, "inserted backup block for hint bit");
+ }
+}
+
void
smgr_redo(XLogRecPtr lsn, XLogRecord *record)
{
uint8 info = record->xl_info & ~XLR_INFO_MASK;
- /* Backup blocks are not used in smgr records */
- Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
-
if (info == XLOG_SMGR_CREATE)
{
xl_smgr_create *xlrec = (xl_smgr_create *) XLogRecGetData(record);
SMgrRelation reln;
+ /* Backup blocks are not used in smgr truncate records */
+ Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+
reln = smgropen(xlrec->rnode, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
@@ -499,6 +556,9 @@ smgr_redo(XLogRecPtr lsn, XLogRecord *record)
SMgrRelation reln;
Relation rel;
+ /* Backup blocks are not used in smgr truncate records */
+ Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+
reln = smgropen(xlrec->rnode, InvalidBackendId);
/*
@@ -524,6 +584,28 @@ smgr_redo(XLogRecPtr lsn, XLogRecord *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_HINT)
+ {
+ int *watermark = (int *) XLogRecGetData(record);
+
+ /* Check the watermark is correct for the hint record */
+ Assert(*watermark == SMGR_HINT_WATERMARK);
+
+ /* Backup blocks must be present for smgr hint records */
+ Assert(record->xl_info & XLR_BKP_BLOCK_MASK);
+
+ /*
+ * Hint records have no information that needs to be replayed.
+ * The sole purpose of them is to ensure that a hint bit does
+ * not cause a checksum invalidation if a hint bit write should
+ * cause a torn page. So the body of the record is empty but
+ * there can be one backup block.
+ *
+ * Since the only change in the backup block is a hint bit,
+ * there is no confict with Hot Standby.
+ */
+ RestoreBkpBlocks(lsn, record, false);
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
@@ -550,6 +632,8 @@ smgr_desc(StringInfo buf, uint8 xl_info, char *rec)
xlrec->blkno);
pfree(path);
}
+ else if (info == XLOG_SMGR_HINT)
+ appendStringInfo(buf, "buffer hint");
else
appendStringInfo(buf, "UNKNOWN");
}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8f68bcc..dad34f2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -34,6 +34,7 @@
#include <unistd.h>
#include "catalog/catalog.h"
+#include "catalog/storage.h"
#include "executor/instrument.h"
#include "miscadmin.h"
#include "pg_trace.h"
@@ -440,7 +441,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
/* check for garbage data */
- if (!PageHeaderIsValid((PageHeader) bufBlock))
+ if (!PageIsVerified((Page) bufBlock))
{
if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
{
@@ -1860,6 +1861,7 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
{
XLogRecPtr recptr;
ErrorContextCallback errcontext;
+ Block bufBlock;
/*
* Acquire the buffer's io_in_progress lock. If StartBufferIO returns
@@ -1907,10 +1909,15 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
buf->flags &= ~BM_JUST_DIRTIED;
UnlockBufHdr(buf);
+ bufBlock = BufHdrGetBlock(buf);
+
+ /*
+ * bufToWrite is either the shared buffer or a copy, as appropriate.
+ */
smgrwrite(reln,
buf->tag.forkNum,
buf->tag.blockNum,
- (char *) BufHdrGetBlock(buf),
+ (char *) bufBlock,
false);
pgBufferUsage.shared_blks_written++;
@@ -1921,6 +1928,8 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
*/
TerminateBufferIO(buf, true, 0);
+ /* XXX Assert(buf is not BM_JUST_DIRTIED) */
+
TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(buf->tag.forkNum,
buf->tag.blockNum,
reln->smgr_rnode.node.spcNode,
@@ -2341,6 +2350,41 @@ SetBufferCommitInfoNeedsSave(Buffer buffer)
if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) !=
(BM_DIRTY | BM_JUST_DIRTIED))
{
+ /*
+ * If we're writing checksums and we care about torn pages then we
+ * cannot dirty a page during recovery as a result of a hint.
+ * We can set the hint, just not dirty the page as a result.
+ *
+ * See long discussion in bufpage.c
+ */
+ if (HintsMustNotDirtyPage())
+ return;
+
+ /*
+ * Write a full page into WAL iff this is the first change on the
+ * block since the last checkpoint. That will never be the case
+ * if the block is already dirty because we either made a change
+ * or set a hint already. Note that aggressive cleaning of blocks
+ * dirtied by hint bit setting would increase the call rate.
+ * Bulk setting of hint bits would reduce the call rate...
+ *
+ * We must issue the WAL record before we mark the buffer dirty.
+ * Otherwise we might write the page before we write the WAL.
+ * That causes a race condition, since a checkpoint might
+ * occur between writing the WAL record and marking the buffer dirty.
+ * We solve that with a kluge, but one that is already in use
+ * during transaction commit to prevent race conditions.
+ * Basically, we simply prevent the checkpoint WAL record from
+ * being written until we have marked the buffer dirty. We don't
+ * start the checkpoint flush until we have marked dirty, so our
+ * checkpoint must flush the change to disk successfully or the
+ * checkpoint never gets written, so crash recovery will set us right.
+ *
+ * XXX rename PGPROC variable later; keep it same now for clarity
+ */
+ MyPgXact->inCommit = true;
+ smgr_buffer_hint(buffer);
+
LockBufHdr(bufHdr);
Assert(bufHdr->refcount > 0);
if (!(bufHdr->flags & BM_DIRTY))
@@ -2351,6 +2395,7 @@ SetBufferCommitInfoNeedsSave(Buffer buffer)
}
bufHdr->flags |= (BM_DIRTY | BM_JUST_DIRTIED);
UnlockBufHdr(bufHdr);
+ MyPgXact->inCommit = false;
}
}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 096d36a..a220310 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -200,6 +200,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
/* Find smgr relation for buffer */
oreln = smgropen(bufHdr->tag.rnode, MyBackendId);
+ /* XXX do we want to write checksums for local buffers? An option? */
+
/* And write... */
smgrwrite(oreln,
bufHdr->tag.forkNum,
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 90a731c..c8d15bd 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -16,6 +16,12 @@
#include "access/htup.h"
+bool page_checksums = false;
+
+static char pageCopy[BLCKSZ]; /* temporary buffer to allow checksum calculation */
+
+static bool PageVerificationInfoOK(Page page);
+static uint16 PageCalcChecksum16(Page page);
/* ----------------------------------------------------------------
* Page support functions
@@ -25,6 +31,10 @@
/*
* PageInit
* Initializes the contents of a page.
+ * Note that we don't automatically add a checksum, or flag that the
+ * page has a checksum field. We start with a normal page layout and defer
+ * the decision on what page verification will be written just before
+ * we write the block to disk.
*/
void
PageInit(Page page, Size pageSize, Size specialSize)
@@ -67,20 +77,20 @@ PageInit(Page page, Size pageSize, Size specialSize)
* will clean up such a page and make it usable.
*/
bool
-PageHeaderIsValid(PageHeader page)
+PageIsVerified(Page page)
{
+ PageHeader p = (PageHeader) page;
char *pagebytes;
int i;
/* Check normal case */
- if (PageGetPageSize(page) == BLCKSZ &&
- PageGetPageLayoutVersion(page) == PG_PAGE_LAYOUT_VERSION &&
- (page->pd_flags & ~PD_VALID_FLAG_BITS) == 0 &&
- page->pd_lower >= SizeOfPageHeaderData &&
- page->pd_lower <= page->pd_upper &&
- page->pd_upper <= page->pd_special &&
- page->pd_special <= BLCKSZ &&
- page->pd_special == MAXALIGN(page->pd_special))
+ if (PageVerificationInfoOK(page) &&
+ (p->pd_flags & ~PD_VALID_FLAG_BITS) == 0 &&
+ p->pd_lower >= SizeOfPageHeaderData &&
+ p->pd_lower <= p->pd_upper &&
+ p->pd_upper <= p->pd_special &&
+ p->pd_special <= BLCKSZ &&
+ p->pd_special == MAXALIGN(p->pd_special))
return true;
/* Check all-zeroes case */
@@ -93,7 +103,6 @@ PageHeaderIsValid(PageHeader page)
return true;
}
-
/*
* PageAddItem
*
@@ -827,3 +836,266 @@ PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
pfree(itemidbase);
}
+
+/*
+ * Test whether the page verification information is correct or not.
+ *
+ * IMPORTANT NOTE -
+ * Verification info is not valid at all times on a data page. We set
+ * verification info before we flush page/buffer, and implicitly invalidate
+ * verification info when we write to the page. A heavily accessed buffer
+ * might then spend most of its life with invalid page verification info,
+ * so testing verification info on random pages in the buffer pool will tell
+ * you nothing. The reason for this is that page verification info protects
+ * Postgres data from errors on the filesystems on which we rely. We do not
+ * protect buffers against uncorrectable memory errors, since these have a
+ * very low measured incidence according to research on large server farms,
+ * http://www.google.com/research/pubs/archive/35162.pdf, discussed 2010/12/22.
+ *
+ * To confirm your understanding that means that WAL-logged changes to a page
+ * do NOT update the page verification info, so full page images may not have
+ * correct verification information on them. But those page images have the
+ * WAL CRC covering them and so are verified separately from this mechanism.
+ *
+ * Any write of a data block can cause a torn page if the write is unsuccessful.
+ * Full page writes protect us from that, which are stored in WAL. Setting
+ * hint bits when a page is already dirty is OK because a full page write
+ * must already have been written for that since the last checkpoint.
+ * Setting hint bits on an otherwise clean page can allow torn pages; this
+ * doesn't normally matter since they are just hints. When the page has
+ * checksums, losing a few bits would cause the checksum to be invalid.
+ * So if we have full_page_writes = on and page_checksums = on then we must
+ * write a WAL record specifically so that we record a full page image in WAL.
+ * New WAL records cannot be written during recovery, so hint bits set
+ * during recovery must not dirty the page if the buffer is not already dirty,
+ * when page_checksums = on. Enforced by checking HintsMustNotDirtyPage()
+ *
+ * So we cannot enable/disable page_checksums except at a checkpoint if
+ * full_page_writes is enabled. We choose to only allow changes at server start.
+ *
+ * WAL replay ignores page verification info unless it writes out or reads in
+ * blocks from disk; restoring full page writes does not check verification
+ * info via this function. So we zero the checksum when restoring backup blocks.
+ * In recovery, since we only dirty a block when we have a full page image
+ * available if we crash, we are fully OK to use page verification.
+ *
+ * The best way to understand this is that WAL CRCs protect records entering
+ * the WAL stream, and page verification protects blocks entering and leaving
+ * the buffer pool. They are similar in purpose, yet completely separate.
+ * Together they ensure we are able to detect errors in data leaving and
+ * re-entering PostgreSQL controlled memory.
+ *
+ * Note also that the verification mechanism can vary from page to page.
+ * All we do here is look at what the page itself says is the verification
+ * mechanism and then apply that test. This allows us to run without the CPU
+ * cost of verification if we choose, as well as to provide an upgrade path
+ * for anyone doing direct upgrades using pg_upgrade.
+ *
+ * There is some concern that trusting page data to say how to check page
+ * data is dangerously self-referential. To ensure no mistakes we set two
+ * non-adjacent bits to signify that the page has a checksum and
+ * should be verified when that block is read back into a buffer.
+ * We use two bits in case a multiple bit error removes one of the checksum
+ * flags *and* destroys data, which would lead to skipping the checksum check
+ * and silently accepting bad data.
+ *
+ * Note also that this returns a boolean, not a full damage assessment.
+ */
+static bool
+PageVerificationInfoOK(Page page)
+{
+ PageHeader p = (PageHeader) page;
+
+ /*
+ * We set two non-adjacent bits to signify that the page has a checksum and
+ * should be verified against that block is read back into a buffer.
+ * We use two bits in case a multiple bit error removes one of the checksum
+ * flags and destroys data, which would lead to skipping the checksum check
+ * and silently accepting bad data.
+ */
+ if (PageHasChecksumFlag1(p) && PageHasChecksumFlag2(p))
+ {
+ uint16 checksum = PageCalcChecksum16(page);
+
+ if (checksum == p->pd_verify.pd_checksum16)
+ {
+#ifdef CHECK_HOLE
+ /* Also check page hole is all-zeroes */
+ char *pagebytes;
+ bool empty = true;
+ int i;
+
+ pagebytes = (char *) page;
+ for (i = p->pd_lower; i < p->pd_upper; i++)
+ {
+ if (pagebytes[i] != 0)
+ {
+ empty = false;
+ break;
+ }
+ }
+
+ if (!empty)
+ elog(LOG, "hole was not empty at byte %d pd_lower %d pd_upper %d",
+ i, p->pd_lower, p->pd_upper);
+#endif
+ return true;
+ }
+
+ elog(LOG, "page verification failed - checksum was %u page checksum field is %u",
+ checksum, p->pd_verify.pd_checksum16);
+ }
+ else if (!PageHasChecksumFlag1(p) && !PageHasChecksumFlag2(p))
+ {
+ if (PageGetPageLayoutVersion(p) == PG_PAGE_LAYOUT_VERSION &&
+ PageGetPageSize(p) == BLCKSZ)
+ return true;
+ }
+ else
+ elog(LOG, "page verification failed - page has one checksum flag set");
+
+ return false;
+}
+
+/*
+ * Set verification info for page.
+ *
+ * Either we set a new checksum, or we set the standard watermark. We must
+ * not leave an invalid checksum in place. Note that the verification info is
+ * not WAL logged, whereas the data changes to pages are, so data is safe
+ * whether or not we have page_checksums enabled. The purpose of checksums
+ * is to detect page corruption to allow replacement from backup.
+ *
+ * Returns a pointer to the block-sized data that needs to be written. That
+ * allows us to either copy, or not, depending upon whether we checksum.
+ */
+char *
+PageSetVerificationInfo(Page page)
+{
+ PageHeader p;
+
+ if (PageIsNew(page))
+ return (char *) page;
+
+ if (page_checksums)
+ {
+ /*
+ * We make a copy iff we need to calculate a checksum because other
+ * backends may set hint bits on this page while we write, which
+ * would mean the checksum differs from the page contents. It doesn't
+ * matter if we include or exclude hints during the copy, as long
+ * as we write a valid page and associated checksum.
+ */
+ memcpy(&pageCopy, page, BLCKSZ);
+
+ p = (PageHeader) &pageCopy;
+ p->pd_flags |= PD_CHECKSUM;
+ p->pd_verify.pd_checksum16 = PageCalcChecksum16((Page) &pageCopy);
+
+ return (char *) &pageCopy;
+ }
+
+ p = (PageHeader) page;
+
+ if (PageHasChecksumFlag1(p) || PageHasChecksumFlag2(p))
+ {
+ /* ensure any older checksum info is overwritten with watermark */
+ p->pd_flags &= ~PD_CHECKSUM;
+ PageSetPageSizeAndVersion(p, BLCKSZ, PG_PAGE_LAYOUT_VERSION);
+ }
+
+ return (char *) page;
+}
+
+/*
+ * Calculate checksum for a PostgreSQL Page. We do this in 3 steps, first
+ * we calculate the checksum for the header, avoiding the verification
+ * info, which will be added afterwards. Next, we add the line pointers up to
+ * the hole in the middle of the block at pd_lower. Last, we add the tail
+ * of the page from pd_upper to the end of page.
+ */
+static uint16
+PageCalcChecksum16(Page page)
+{
+#define PAGE_VERIFICATION_USES_FLETCHER16 (true)
+#ifdef PAGE_VERIFICATION_USES_FLETCHER16
+ /*
+ * Following calculation is a Flecther's 16 checksum. The calc is isolated
+ * here and tuning and/or replacement algorithms are possible.
+ */
+ PageHeader p = (PageHeader) page;
+ uint page_header_stop = (uint)(offsetof(PageHeaderData, pd_special) + sizeof(LocationIndex));
+ uint page_lower_start = (uint)(offsetof(PageHeaderData, pd_prune_xid));
+ uint page_lower_stop;
+ uint sum1 = 0;
+ uint64 sum2 = 0;
+ int i;
+
+ /*
+ * Avoid calculating checksum if page is new, just return a value that
+ * will cause the check to fail. We may still pass the all-zeroes check.
+ */
+ if (PageIsNew(page))
+ return 1;
+
+ /*
+ * Just add in the pd_prune_xid if there are no line pointers yet.
+ */
+ page_lower_stop = p->pd_lower;
+ if (page_lower_stop == 0)
+ page_lower_stop = page_lower_start + sizeof(TransactionId);
+
+ Assert(p->pd_upper != 0);
+
+#ifdef DEBUG_CHECKSUM
+ elog(LOG, "calculating checksum for %u-%u %u-%u %u-%u",
+ 0, /* page_header_start */
+ page_header_stop,
+ page_lower_start,
+ page_lower_stop,
+ p->pd_upper,
+ BLCKSZ
+ );
+#endif
+
+#define COMP_F16(from, to) \
+do { \
+ for (i = from; i < to; i++) \
+ { \
+ sum1 = sum1 + page[i]; \
+ sum2 = sum1 + sum2; \
+ } \
+ sum1 %= 255; \
+ sum2 %= 255; \
+} while (0); \
+
+#ifdef IGNORE_PAGE_HOLE
+ COMP_F16(0,
+ page_header_stop);
+
+ /* ignore the checksum field since not done yet... */
+
+ COMP_F16(page_lower_start,
+ page_lower_stop);
+
+ /* ignore the hole in the middle of the block */
+
+ COMP_F16(p->pd_upper,
+ BLCKSZ - 1);
+#else
+ COMP_F16(0,
+ page_header_stop);
+
+ /* ignore the checksum field since not done yet... */
+
+ COMP_F16(page_lower_start,
+ BLCKSZ - 1);
+#endif
+
+#ifdef DEBUG_CHECKSUM
+ elog(LOG, "checksum %u", ((sum2 << 8) | sum1));
+#endif
+
+ return ((sum2 << 8) | sum1);
+#endif
+}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index bfc9f06..8897a9b 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -689,6 +689,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
off_t seekpos;
int nbytes;
MdfdVec *v;
+ char *bufCopy;
/* This assert is too expensive to have on normally ... */
#ifdef CHECK_WRITE_VS_EXTEND
@@ -701,6 +702,16 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
reln->smgr_rnode.node.relNode,
reln->smgr_rnode.backend);
+ /*
+ * Set page verification info immediately before we write the buffer to disk.
+ * Once we have flushed the buffer is marked clean again, meaning it can
+ * be replaced quickly and silently with another data block, so we must
+ * write verification info now. For efficiency, the process of cleaning
+ * and page replacement is asynchronous, so we can't do this *only* when
+ * we are about to replace the buffer, we need to do this for every flush.
+ */
+ bufCopy = PageSetVerificationInfo((Page) buffer);
+
v = _mdfd_getseg(reln, forknum, blocknum, skipFsync, EXTENSION_FAIL);
seekpos = (off_t) BLCKSZ *(blocknum % ((BlockNumber) RELSEG_SIZE));
@@ -713,7 +724,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
errmsg("could not seek to block %u in file \"%s\": %m",
blocknum, FilePathName(v->mdfd_vfd))));
- nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
+ nbytes = FileWrite(v->mdfd_vfd, bufCopy, BLCKSZ);
TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
reln->smgr_rnode.node.spcNode,
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 5c910dd..9a76bc8 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -830,6 +830,20 @@ static struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
{
+ {"page_checksums", PGC_POSTMASTER, WAL_SETTINGS,
+ gettext_noop("Marks database blocks with a checksum before writing them to disk. "),
+ gettext_noop("When enabled all database blocks will be marked with a checksum before writing to disk. "
+ "When we read a database block from disk the checksum is checked, if it exists. "
+ "If there is no checksum marked yet then no check is performed, though a "
+ "checksum will be added later when we re-write the database block. "
+ "When disabled checksums will be ignored, even if the block was marked "
+ "with checksum. When disabled checksums will not be added to database blocks.")
+ },
+ &page_checksums,
+ true,
+ NULL, NULL, NULL
+ },
+ {
{"full_page_writes", PGC_SIGHUP, WAL_SETTINGS,
gettext_noop("Writes full pages to WAL when first modified after a checkpoint."),
gettext_noop("A page write in process during an operating system crash might be "
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 315db46..6f81023 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -150,15 +150,21 @@
#------------------------------------------------------------------------------
-# WRITE AHEAD LOG
+# WRITE AHEAD LOG & RELIABILITY
#------------------------------------------------------------------------------
-# - Settings -
+# - Reliability -
-#wal_level = minimal # minimal, archive, or hot_standby
- # (change requires restart)
+#page_checksums = off # calculate checksum before database I/O
+#full_page_writes = on # recover from partial page writes
#fsync = on # turns forced synchronization on or off
+
#synchronous_commit = on # synchronization level; on, off, or local
+
+# - Write Ahead Log -
+
+#wal_level = minimal # minimal, archive, or hot_standby
+ # (change requires restart)
#wal_sync_method = fsync # the default is the first option
# supported by the operating system:
# open_datasync
@@ -166,7 +172,6 @@
# fsync
# fsync_writethrough
# open_sync
-#full_page_writes = on # recover from partial page writes
#wal_buffers = -1 # min 32kB, -1 sets based on shared_buffers
# (change requires restart)
#wal_writer_delay = 200ms # 1-10000 milliseconds
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index db6380f..eb32856 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -114,6 +114,8 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;
#define XLogPageHeaderSize(hdr) \
(((hdr)->xlp_info & XLP_LONG_HEADER) ? SizeOfXLogLongPHD : SizeOfXLogShortPHD)
+#define XLOG_SMGR_HINT 0x40
+
/*
* We break each logical log file (xlogid value) into segment files of the
* size indicated by XLOG_SEG_SIZE. One possible segment at the end of each
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index d5103a8..48a728c 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -36,6 +36,7 @@ extern void PostPrepare_smgr(void);
extern void log_smgrcreate(RelFileNode *rnode, ForkNumber forkNum);
+extern void smgr_buffer_hint(Buffer buffer);
extern void smgr_redo(XLogRecPtr lsn, XLogRecord *record);
extern void smgr_desc(StringInfo buf, uint8 xl_info, char *rec);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 1ab64e0..38708c0 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -18,6 +18,8 @@
#include "storage/item.h"
#include "storage/off.h"
+extern bool page_checksums;
+
/*
* A postgres disk page is an abstraction layered on top of a postgres
* disk block (which is simply a unit of i/o, see block.h).
@@ -93,7 +95,7 @@ typedef uint16 LocationIndex;
* pd_lower - offset to start of free space.
* pd_upper - offset to end of free space.
* pd_special - offset to start of special space.
- * pd_pagesize_version - size in bytes and page layout version number.
+ * pd_verify - page verification information of different kinds
* pd_prune_xid - oldest XID among potentially prunable tuples on page.
*
* The LSN is used by the buffer manager to enforce the basic rule of WAL:
@@ -106,7 +108,8 @@ typedef uint16 LocationIndex;
* pd_prune_xid is a hint field that helps determine whether pruning will be
* useful. It is currently unused in index pages.
*
- * The page version number and page size are packed together into a single
+ * For verification we store either a 16 bit checksum or a watermark of
+ * the page version number and page size packed together into a single
* uint16 field. This is for historical reasons: before PostgreSQL 7.3,
* there was no concept of a page version number, and doing it this way
* lets us pretend that pre-7.3 databases have page version number zero.
@@ -130,7 +133,13 @@ typedef struct PageHeaderData
LocationIndex pd_lower; /* offset to start of free space */
LocationIndex pd_upper; /* offset to end of free space */
LocationIndex pd_special; /* offset to start of special space */
- uint16 pd_pagesize_version;
+
+ union
+ {
+ uint16 pd_pagesize_version;
+ uint16 pd_checksum16;
+ } pd_verify; /* page verification data */
+
TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */
ItemIdData pd_linp[1]; /* beginning of line pointer array */
} PageHeaderData;
@@ -155,7 +164,16 @@ typedef PageHeaderData *PageHeader;
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x800F /* OR of all non-checksum pd_flags bits */
+
+#define PD_CHECKSUM1 0x0008 /* First checksum bit */
+#define PD_CHECKSUM2 0x8000 /* Second checksum bit */
+#define PD_CHECKSUM 0x8008 /* OR of both checksum flags */
+
+#define PageHasChecksumFlag1(page) \
+ ((((PageHeader) (page))->pd_flags & PD_CHECKSUM1) == PD_CHECKSUM1)
+#define PageHasChecksumFlag2(page) \
+ ((((PageHeader) (page))->pd_flags & PD_CHECKSUM2) == PD_CHECKSUM2)
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -165,6 +183,8 @@ typedef PageHeaderData *PageHeader;
* Release 8.3 uses 4; it changed the HeapTupleHeader layout again, and
* added the pd_flags field (by stealing some bits from pd_tli),
* as well as adding the pd_prune_xid field (which enlarges the header).
+ * Release 9.2 uses 4 as well, though with changed meaning of verification bits.
+ * We deliberately don't bump the page version for that, to allow upgrades.
*/
#define PG_PAGE_LAYOUT_VERSION 4
@@ -231,19 +251,22 @@ typedef PageHeaderData *PageHeader;
* PageGetPageSize
* Returns the page size of a page.
*
- * this can only be called on a formatted page (unlike
- * BufferGetPageSize, which can be called on an unformatted page).
- * however, it can be called on a page that is not stored in a buffer.
+ * Since PageSizeIsValid() when pagesize == BLCKSZ, just written BLCKSZ.
+ * This can be called on any page, initialised or not, in or out of buffers.
+ * You might think this can vary at runtime but you'd be wrong, since pages
+ * frequently need to occupy buffers and pages are copied from one to another
+ * so there are many hidden assumptions that this simple definition is true.
*/
-#define PageGetPageSize(page) \
- ((Size) (((PageHeader) (page))->pd_pagesize_version & (uint16) 0xFF00))
+#define PageGetPageSize(page) (BLCKSZ)
/*
* PageGetPageLayoutVersion
* Returns the page layout version of a page.
+ *
+ * Must not be used on a page that is flagged for checksums.
*/
#define PageGetPageLayoutVersion(page) \
- (((PageHeader) (page))->pd_pagesize_version & 0x00FF)
+ (((PageHeader) (page))->pd_verify.pd_pagesize_version & 0x00FF)
/*
* PageSetPageSizeAndVersion
@@ -251,14 +274,24 @@ typedef PageHeaderData *PageHeader;
*
* We could support setting these two values separately, but there's
* no real need for it at the moment.
+ *
+ * Must not be used on a page that is flagged for checksums.
*/
#define PageSetPageSizeAndVersion(page, size, version) \
( \
AssertMacro(((size) & 0xFF00) == (size)), \
AssertMacro(((version) & 0x00FF) == (version)), \
- ((PageHeader) (page))->pd_pagesize_version = (size) | (version) \
+ ((PageHeader) (page))->pd_verify.pd_pagesize_version = (size) | (version) \
)
+/*
+ * HintsMustNotDirtyPage
+ * See discussion for PageVerificationInfoOK()
+ */
+#define HintsMustNotDirtyPage() \
+ (page_checksums && fullPageWrites && RecoveryInProgress())
+extern bool fullPageWrites;
+
/* ----------------
* page special data macros
* ----------------
@@ -368,7 +401,7 @@ do { \
*/
extern void PageInit(Page page, Size pageSize, Size specialSize);
-extern bool PageHeaderIsValid(PageHeader page);
+extern bool PageIsVerified(Page page);
extern OffsetNumber PageAddItem(Page page, Item item, Size size,
OffsetNumber offsetNumber, bool overwrite, bool is_heap);
extern Page PageGetTempPage(Page page);
@@ -381,5 +414,6 @@ extern Size PageGetExactFreeSpace(Page page);
extern Size PageGetHeapFreeSpace(Page page);
extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
+extern char *PageSetVerificationInfo(Page page);
#endif /* BUFPAGE_H */
--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers