On 2013-12-03 15:46:09 -0500, Tom Lane wrote:
> Noah Misch <n...@leadboat.com> writes:
> > I'd rather have an readily-verifiable fix that changes WAL format than a
> > tricky fix that avoids doing so.  So, modulo not having seen the change, +1.
>
> Yeah, same here.

I am afraid it won't be *that* simple. We still need code to look into
multis, check whether all members are ok wrt. cutoff_xid and replace
them, either by the contained xid, or by a new multi with the still
living members. Ugly.

There's currently also the issue that heap_freeze_tuple() modifies the
tuple inplace without a critical section. We're executing a
HeapTupleSatisfiesVacuum() before we get to WAL logging things, that has
plenty of rope to hang itself on. So that doesn't really seem ok to me?

Attached is a pre-pre alpha patch for this. To fix the issue with the
missing critical section it splits freezing into
heap_prepare_freeze_tuple() and heap_execute_freeze_tuple(). The former
doesn't touch tuples and is executed on the primary, and the second
actually peforms the modifications and is executed both, during normal
processing and recovery.

Needs a fair bit of work:
* Should move parts of the multixact processing into multixact.c,
  specifically it shouldn't require CreateMultiXactId() to be exported.
* it relies on forward-declaring a struct in heapam.h that's actually
  defined heapam_xlog.h - that's pretty ugly.
* any form of testing but make check/isolationcheck across SR.
* lots of the comments around need to be added/reworked
* has a simpler version of Alvaro's patch to HTSV in there

Greetings,

Andres Freund

-- 
 Andres Freund                     http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
>From 330d128665fcf8633e60a42e8e4a497e2975dac0 Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Wed, 4 Dec 2013 00:11:20 +0100
Subject: [PATCH] WIP: new freezing format

---
 src/backend/access/heap/heapam.c       | 476 +++++++++++++++++++++++++--------
 src/backend/access/rmgrdesc/heapdesc.c |   9 +
 src/backend/access/transam/multixact.c |   3 +-
 src/backend/commands/vacuumlazy.c      |  28 +-
 src/backend/utils/time/tqual.c         |   6 +-
 src/include/access/heapam.h            |   7 +-
 src/include/access/heapam_xlog.h       |  32 ++-
 src/include/access/multixact.h         |   1 +
 8 files changed, 443 insertions(+), 119 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c13f87c..b80fa5b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -5242,14 +5242,17 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 		CacheInvalidateHeapTuple(relation, tuple, NULL);
 }
 
-
 /*
- * heap_freeze_tuple
+ * heap_prepare_freeze_tuple
  *
  * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
- * are older than the specified cutoff XID.  If so, replace them with
- * FrozenTransactionId or InvalidTransactionId as appropriate, and return
- * TRUE.  Return FALSE if nothing was changed.
+ * are older than the specified cutoff XID.  If so, return enough state to
+ * later execute and WAL log replacin them with FrozenTransactionId or
+ * InvalidTransactionId as appropriate, and return TRUE.  Return FALSE if
+ * nothing was changed.
+ *
+ * The 'off' field of the freeze state has to be set at the caller, not here,
+ * if required.
  *
  * It is assumed that the caller has checked the tuple with
  * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
@@ -5258,54 +5261,44 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  * NB: cutoff_xid *must* be <= the current global xmin, to ensure that any
  * XID older than it could neither be running nor seen as running by any
  * open transaction.  This ensures that the replacement will not change
- * anyone's idea of the tuple state.  Also, since we assume the tuple is
- * not HEAPTUPLE_DEAD, the fact that an XID is not still running allows us
- * to assume that it is either committed good or aborted, as appropriate;
- * so we need no external state checks to decide what to do.  (This is good
- * because this function is applied during WAL recovery, when we don't have
- * access to any such state, and can't depend on the hint bits to be set.)
- * There is an exception we make which is to assume GetMultiXactIdMembers can
- * be called during recovery.
- *
+ * anyone's idea of the tuple state.
  * Similarly, cutoff_multi must be less than or equal to the smallest
  * MultiXactId used by any transaction currently open.
  *
  * If the tuple is in a shared buffer, caller must hold an exclusive lock on
  * that buffer.
  *
- * Note: it might seem we could make the changes without exclusive lock, since
- * TransactionId read/write is assumed atomic anyway.  However there is a race
- * condition: someone who just fetched an old XID that we overwrite here could
- * conceivably not finish checking the XID against pg_clog before we finish
- * the VACUUM and perhaps truncate off the part of pg_clog he needs.  Getting
- * exclusive lock ensures no other backend is in process of checking the
- * tuple status.  Also, getting exclusive lock makes it safe to adjust the
- * infomask bits.
- *
- * NB: Cannot rely on hint bits here, they might not be set after a crash or
- * on a standby.
+ * NB: It is not enough that hint bits indicate something is committed/invalid
+ * - they might not be set on a standby/after crash recovery. So we really
+ * need to remove old xids.
  */
 bool
-heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
-				  MultiXactId cutoff_multi)
+heap_prepare_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
+				  TransactionId cutoff_multi, xl_heap_freeze_tuple *frz)
+
 {
 	bool		changed = false;
 	bool		freeze_xmax = false;
 	TransactionId xid;
 
+	frz->freeze_xmin = false;
+	frz->invalid_xvac = false;
+	frz->freeze_xvac = false;
+	frz->t_infomask2 = tuple->t_infomask2;
+	frz->t_infomask = tuple->t_infomask;
+	frz->xmax = HeapTupleHeaderGetRawXmax(tuple);
+
 	/* Process xmin */
 	xid = HeapTupleHeaderGetXmin(tuple);
 	if (TransactionIdIsNormal(xid) &&
 		TransactionIdPrecedes(xid, cutoff_xid))
 	{
-		HeapTupleHeaderSetXmin(tuple, FrozenTransactionId);
-
+		frz->freeze_xmin = true;
 		/*
 		 * Might as well fix the hint bits too; usually XMIN_COMMITTED will
 		 * already be set here, but there's a small chance not.
 		 */
-		Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
-		tuple->t_infomask |= HEAP_XMIN_COMMITTED;
+		frz->t_infomask |= HEAP_XMIN_COMMITTED;
 		changed = true;
 	}
 
@@ -5332,81 +5325,139 @@ heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
 			/*
 			 * This old multi cannot possibly be running.  If it was a locker
 			 * only, it can be removed without much further thought; but if it
-			 * contained an update, we need to preserve it.
+			 * contained an update, we might need to preserve it.
 			 */
 			if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask))
+			{
 				freeze_xmax = true;
+			}
 			else
 			{
-				TransactionId update_xid;
+				/* replace multi by update xid */
+				frz->xmax = HeapTupleGetUpdateXid(tuple);
+				frz->t_infomask &= ~HEAP_XMAX_BITS;
+				frz->t_infomask &= ~HEAP_XMAX_IS_MULTI;
 
-				update_xid = HeapTupleGetUpdateXid(tuple);
+				/* wasn't only a lock, xid needs to be valid */
+				Assert(TransactionIdIsValid(frz->xmax));
 
 				/*
-				 * The multixact has an update hidden within.  Get rid of it.
-				 *
-				 * If the update_xid is below the cutoff_xid, it necessarily
-				 * must be an aborted transaction.  In a primary server, such
-				 * an Xmax would have gotten marked invalid by
-				 * HeapTupleSatisfiesVacuum, but in a replica that is not
-				 * called before we are, so deal with it in the same way.
-				 *
-				 * If not below the cutoff_xid, then the tuple would have been
-				 * pruned by vacuum, if the update committed long enough ago,
-				 * and we wouldn't be freezing it; so it's either recently
-				 * committed, or in-progress.  Deal with this by setting the
-				 * Xmax to the update Xid directly and remove the IS_MULTI
-				 * bit.  (We know there cannot be running lockers in this
-				 * multi, because it's below the cutoff_multi value.)
+				 * If the xid is older than the cutoff, it has to have
+				 * aborted, otherwise it would have gotten pruned away.
 				 */
-
-				if (TransactionIdPrecedes(update_xid, cutoff_xid))
+				if (TransactionIdPrecedes(frz->xmax, cutoff_xid))
 				{
-					Assert(InRecovery || TransactionIdDidAbort(update_xid));
+					Assert(!TransactionIdDidCommit(frz->xmax));
 					freeze_xmax = true;
 				}
 				else
 				{
-					Assert(InRecovery || !TransactionIdIsInProgress(update_xid));
-					tuple->t_infomask &= ~HEAP_XMAX_BITS;
-					HeapTupleHeaderSetXmax(tuple, update_xid);
-					changed = true;
+					/* preserve xmax */
 				}
+				changed = true;
 			}
 		}
-		else if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask))
+		else if (MultiXactIdIsRunning(xid))
 		{
-			/* newer than the cutoff, so don't touch it */
-			;
+			/* cannot be below cutoff */
 		}
 		else
 		{
-			TransactionId	update_xid;
+			TransactionId	update_xid = InvalidTransactionId;
+			MultiXactMember *members = NULL;
+			MultiXactMember *newmembers = NULL;
+			int			nmembers;
+			int			nnewmembers = 0;
+			bool		has_live_members = false;
+			bool		mxact_needs_freeze = false;
+			int			i;
 
 			/*
-			 * This is a multixact which is not marked LOCK_ONLY, but which
-			 * is newer than the cutoff_multi.  If the update_xid is below the
-			 * cutoff_xid point, then we can just freeze the Xmax in the
-			 * tuple, removing it altogether.  This seems simple, but there
-			 * are several underlying assumptions:
-			 *
-			 * 1. A tuple marked by an multixact containing a very old
-			 * committed update Xid would have been pruned away by vacuum; we
-			 * wouldn't be freezing this tuple at all.
-			 *
-			 * 2. There cannot possibly be any live locking members remaining
-			 * in the multixact.  This is because if they were alive, the
-			 * update's Xid would had been considered, via the lockers'
-			 * snapshot's Xmin, as part the cutoff_xid.
-			 *
-			 * 3. We don't create new MultiXacts via MultiXactIdExpand() that
-			 * include a very old aborted update Xid: in that function we only
-			 * include update Xids corresponding to transactions that are
-			 * committed or in-progress.
+			 * For MultiXacts that are not below the cutoff, we need to check
+			 * whether any of the members are too old.
 			 */
-			update_xid = HeapTupleGetUpdateXid(tuple);
-			if (TransactionIdPrecedes(update_xid, cutoff_xid))
+			nmembers = GetMultiXactIdMembers(xid, &members, false);
+
+			if (nmembers <= 0)
+			{
+				/* pg_upgrade'd multi, just freeze away */
+				freeze_xmax = true;
+			}
+			else
+			{
+				newmembers = (MultiXactMember *)
+					palloc(sizeof(MultiXactMember) * (nmembers + 1));
+
+				for (i = 0; i < nmembers; i++)
+				{
+					bool keep = false;
+					bool isupdate = ISUPDATE_from_mxstatus(members[i].status);
+
+					if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
+					{
+						/*
+						 * A potential updater could not have committed, tuple
+						 * would have gotten vacuumed away already.
+						 */
+						Assert(!isupdate || TransactionIdDidAbort(cutoff_xid));
+						mxact_needs_freeze = true;
+					}
+					else if (TransactionIdIsInProgress(members[i].xid))
+					{
+						keep = true;
+						if (isupdate)
+							update_xid = members[i].xid;
+					}
+					else if (TransactionIdDidCommit(members[i].xid) && isupdate)
+					{
+						/*
+						 * Only updates need to be preserved when they have
+						 * committed, locks aren't interesting anymore.
+						 */
+						keep = true;
+						update_xid = members[i].xid;
+					}
+
+					if (keep)
+					{
+						newmembers[nnewmembers++] = members[i];
+						has_live_members = true;
+					}
+				}
+			}
+
+			if (!mxact_needs_freeze)
+			{
+				/* nothing to do */;
+			}
+			else if (has_live_members &&
+					 TransactionIdIsValid(update_xid) &&
+					 nnewmembers == 1)
+			{
+				/* only the updater is still alive, replace multixact by xid */
+				frz->xmax = update_xid;
+				frz->t_infomask &= ~HEAP_XMAX_BITS;
+				frz->t_infomask |= HEAP_XMAX_COMMITTED;
+				changed = true;
+				elog(LOG, "replace multi2");
+				/* do not clear HEAP_HOT_UPDATED, HEAP_KEYS_UPDATED just yet */
+			}
+			else if (has_live_members)
+			{
+				frz->xmax = CreateMultiXactId(nnewmembers, newmembers);
+				changed = true;
+				elog(LOG, "recreating multi");
+			}
+			else
+			{
 				freeze_xmax = true;
+			}
+
+			/* cleanup memory we might have allocated */
+			if (nmembers > 0)
+				pfree(members);
+			if (newmembers != NULL)
+				pfree(newmembers);
 		}
 	}
 	else if (TransactionIdIsNormal(xid) &&
@@ -5417,20 +5468,21 @@ heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
 
 	if (freeze_xmax)
 	{
-		HeapTupleHeaderSetXmax(tuple, InvalidTransactionId);
+		frz->xmax = InvalidTransactionId;
 
 		/*
 		 * The tuple might be marked either XMAX_INVALID or XMAX_COMMITTED +
 		 * LOCKED.	Normalize to INVALID just to be sure no one gets confused.
 		 * Also get rid of the HEAP_KEYS_UPDATED bit.
 		 */
-		tuple->t_infomask &= ~HEAP_XMAX_BITS;
-		tuple->t_infomask |= HEAP_XMAX_INVALID;
-		HeapTupleHeaderClearHotUpdated(tuple);
-		tuple->t_infomask2 &= ~HEAP_KEYS_UPDATED;
+		frz->t_infomask &= ~HEAP_XMAX_BITS;
+		frz->t_infomask |= HEAP_XMAX_INVALID;
+		frz->t_infomask2 &= ~HEAP_HOT_UPDATED;
+		frz->t_infomask2 &= ~HEAP_KEYS_UPDATED;
 		changed = true;
 	}
 
+
 	/*
 	 * Old-style VACUUM FULL is gone, but we have to keep this code as long as
 	 * we support having MOVED_OFF/MOVED_IN tuples in the database.
@@ -5447,16 +5499,16 @@ heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
 			 * xvac transaction succeeded.
 			 */
 			if (tuple->t_infomask & HEAP_MOVED_OFF)
-				HeapTupleHeaderSetXvac(tuple, InvalidTransactionId);
+				frz->freeze_xvac = true;
 			else
-				HeapTupleHeaderSetXvac(tuple, FrozenTransactionId);
+				frz->invalid_xvac = true;
 
 			/*
 			 * Might as well fix the hint bits too; usually XMIN_COMMITTED
 			 * will already be set here, but there's a small chance not.
 			 */
 			Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
-			tuple->t_infomask |= HEAP_XMIN_COMMITTED;
+			frz->t_infomask |= HEAP_XMIN_COMMITTED;
 			changed = true;
 		}
 	}
@@ -5464,6 +5516,59 @@ heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
 	return changed;
 }
 
+
+/*
+ * heap_freeze_tuple - freeze tuple inplace without WAL logging.
+ *
+ * Useful for callers like CLUSTER that perform their own WAL logging.
+ */
+bool
+heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
+				  TransactionId cutoff_multi)
+{
+	xl_heap_freeze_tuple frz;
+	bool do_freeze;
+
+	do_freeze = heap_prepare_freeze_tuple(tuple, cutoff_xid, cutoff_multi, &frz);
+	if (do_freeze)
+		heap_execute_freeze_tuple(tuple, &frz);
+	return do_freeze;
+}
+
+/*
+ * heap_execute_freeze_tuple
+ *
+ * Execute the prepared freezing of a tuple.
+ *
+ * Note: it might seem we could make the changes without exclusive lock, since
+ * TransactionId read/write is assumed atomic anyway.  However there is a race
+ * condition: someone who just fetched an old XID that we overwrite here could
+ * conceivably not finish checking the XID against pg_clog before we finish
+ * the VACUUM and perhaps truncate off the part of pg_clog he needs.  Getting
+ * exclusive lock ensures no other backend is in process of checking the
+ * tuple status.  Also, getting exclusive lock makes it safe to adjust the
+ * infomask bits.
+ *
+ * NB: All code in here must be safe to execute during crash recovery!
+ */
+void
+heap_execute_freeze_tuple(HeapTupleHeader tuple, xl_heap_freeze_tuple *frz)
+{
+	if (frz->freeze_xmin)
+		HeapTupleHeaderSetXmin(tuple, FrozenTransactionId);
+
+	HeapTupleHeaderSetXmax(tuple, frz->xmax);
+
+	if (frz->freeze_xvac)
+		HeapTupleHeaderSetXvac(tuple, FrozenTransactionId);
+
+	if (frz->invalid_xvac)
+		HeapTupleHeaderSetXvac(tuple, InvalidTransactionId);
+
+	tuple->t_infomask = frz->t_infomask;
+	tuple->t_infomask2 = frz->t_infomask2;
+}
+
 /*
  * For a given MultiXactId, return the hint bits that should be set in the
  * tuple's infomask.
@@ -5767,16 +5872,26 @@ heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 		}
 		else if (MultiXactIdPrecedes(multi, cutoff_multi))
 			return true;
-		else if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask))
-		{
-			/* only-locker multis don't need internal examination */
-			;
-		}
 		else
 		{
-			if (TransactionIdPrecedes(HeapTupleGetUpdateXid(tuple),
-									  cutoff_xid))
-				return true;
+			MultiXactMember *members;
+			int			nmembers;
+			int		i;
+
+			/* need to check whether any member of the mxact is too old */
+
+			nmembers = GetMultiXactIdMembers(multi, &members, false);
+
+			for (i = 0; i < nmembers; i++)
+			{
+				if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
+				{
+					pfree(members);
+					return true;
+				}
+			}
+			if (nmembers > 0)
+				pfree(members);
 		}
 	}
 	else
@@ -6031,22 +6146,22 @@ log_heap_clean(Relation reln, Buffer buffer,
  */
 XLogRecPtr
 log_heap_freeze(Relation reln, Buffer buffer,
-				TransactionId cutoff_xid, MultiXactId cutoff_multi,
-				OffsetNumber *offsets, int offcnt)
+				TransactionId cutoff_xid,
+				xl_heap_freeze_tuple *tuples, int ntuples)
 {
-	xl_heap_freeze xlrec;
+	xl_heap_freeze_page xlrec;
 	XLogRecPtr	recptr;
 	XLogRecData rdata[2];
 
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
 	/* nor when there are no tuples to freeze */
-	Assert(offcnt > 0);
+	Assert(ntuples > 0);
 
 	xlrec.node = reln->rd_node;
 	xlrec.block = BufferGetBlockNumber(buffer);
 	xlrec.cutoff_xid = cutoff_xid;
-	xlrec.cutoff_multi = cutoff_multi;
+	xlrec.ntuples = ntuples;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapFreeze;
@@ -6058,13 +6173,13 @@ log_heap_freeze(Relation reln, Buffer buffer,
 	 * it is.  When XLogInsert stores the whole buffer, the offsets array need
 	 * not be stored too.
 	 */
-	rdata[1].data = (char *) offsets;
-	rdata[1].len = offcnt * sizeof(OffsetNumber);
+	rdata[1].data = (char *) tuples;
+	rdata[1].len = ntuples * SizeOfHeapFreezeTuple;
 	rdata[1].buffer = buffer;
 	rdata[1].buffer_std = true;
 	rdata[1].next = NULL;
 
-	recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_FREEZE, rdata);
+	recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_FREEZE_PAGE, rdata);
 
 	return recptr;
 }
@@ -6406,6 +6521,99 @@ heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record)
 	XLogRecordPageWithFreeSpace(xlrec->node, xlrec->block, freespace);
 }
 
+/*
+ * Freeze a single tuple for XLOG_HEAP2_FREEZE
+ *
+ * NB: This type of record aren't generated anymore, since bugs around
+ * multixacts couldn't be fixed without a more robust type of freezing. This
+ * is kept around to be able to perform PITR.
+ */
+static bool
+heap_xlog_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
+				  MultiXactId cutoff_multi)
+{
+	bool		changed = false;
+	TransactionId xid;
+
+	xid = HeapTupleHeaderGetXmin(tuple);
+	if (TransactionIdIsNormal(xid) &&
+		TransactionIdPrecedes(xid, cutoff_xid))
+	{
+		HeapTupleHeaderSetXmin(tuple, FrozenTransactionId);
+
+		/*
+		 * Might as well fix the hint bits too; usually XMIN_COMMITTED will
+		 * already be set here, but there's a small chance not.
+		 */
+		Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
+		tuple->t_infomask |= HEAP_XMIN_COMMITTED;
+		changed = true;
+	}
+
+	/*
+	 * Note that this code handles IS_MULTI Xmax values, too, but only to mark
+	 * the tuple as not updated if the multixact is below the cutoff Multixact
+	 * given; it doesn't remove dead members of a very old multixact.
+	 */
+	xid = HeapTupleHeaderGetRawXmax(tuple);
+	if ((tuple->t_infomask & HEAP_XMAX_IS_MULTI) ?
+		(MultiXactIdIsValid(xid) &&
+		 MultiXactIdPrecedes(xid, cutoff_multi)) :
+		(TransactionIdIsNormal(xid) &&
+		 TransactionIdPrecedes(xid, cutoff_xid)))
+	{
+		HeapTupleHeaderSetXmax(tuple, InvalidTransactionId);
+
+		/*
+		 * The tuple might be marked either XMAX_INVALID or XMAX_COMMITTED +
+		 * LOCKED.	Normalize to INVALID just to be sure no one gets confused.
+		 * Also get rid of the HEAP_KEYS_UPDATED bit.
+		 */
+		tuple->t_infomask &= ~HEAP_XMAX_BITS;
+		tuple->t_infomask |= HEAP_XMAX_INVALID;
+		HeapTupleHeaderClearHotUpdated(tuple);
+		tuple->t_infomask2 &= ~HEAP_KEYS_UPDATED;
+		changed = true;
+	}
+
+	/*
+	 * Old-style VACUUM FULL is gone, but we have to keep this code as long as
+	 * we support having MOVED_OFF/MOVED_IN tuples in the database.
+	 */
+	if (tuple->t_infomask & HEAP_MOVED)
+	{
+		xid = HeapTupleHeaderGetXvac(tuple);
+		if (TransactionIdIsNormal(xid) &&
+			TransactionIdPrecedes(xid, cutoff_xid))
+		{
+			/*
+			 * If a MOVED_OFF tuple is not dead, the xvac transaction must
+			 * have failed; whereas a non-dead MOVED_IN tuple must mean the
+			 * xvac transaction succeeded.
+			 */
+			if (tuple->t_infomask & HEAP_MOVED_OFF)
+				HeapTupleHeaderSetXvac(tuple, InvalidTransactionId);
+			else
+				HeapTupleHeaderSetXvac(tuple, FrozenTransactionId);
+
+			/*
+			 * Might as well fix the hint bits too; usually XMIN_COMMITTED
+			 * will already be set here, but there's a small chance not.
+			 */
+			Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
+			tuple->t_infomask |= HEAP_XMIN_COMMITTED;
+			changed = true;
+		}
+	}
+
+	return changed;
+}
+
+/*
+ * NB: This type of record aren't generated anymore, since bugs around
+ * multixacts couldn't be fixed without a more robust type of freezing. This
+ * is kept around to be able to perform PITR.
+ */
 static void
 heap_xlog_freeze(XLogRecPtr lsn, XLogRecord *record)
 {
@@ -6454,7 +6662,7 @@ heap_xlog_freeze(XLogRecPtr lsn, XLogRecord *record)
 			ItemId		lp = PageGetItemId(page, *offsets);
 			HeapTupleHeader tuple = (HeapTupleHeader) PageGetItem(page, lp);
 
-			(void) heap_freeze_tuple(tuple, cutoff_xid, cutoff_multi);
+			(void) heap_xlog_freeze_tuple(tuple, cutoff_xid, cutoff_multi);
 			offsets++;
 		}
 	}
@@ -6578,6 +6786,59 @@ heap_xlog_visible(XLogRecPtr lsn, XLogRecord *record)
 	}
 }
 
+/*
+ * Replay XLOG_HEAP2_FREEZE_PAGE records
+ */
+static void
+heap_xlog_freeze_page(XLogRecPtr lsn, XLogRecord *record)
+{
+	xl_heap_freeze_page *xlrec = (xl_heap_freeze_page *) XLogRecGetData(record);
+	TransactionId cutoff_xid = xlrec->cutoff_xid;
+	Buffer		buffer;
+	Page		page;
+	int			ntup;
+
+	/*
+	 * In Hot Standby mode, ensure that there's no queries running which still
+	 * consider the frozen xids as running.
+	 */
+	if (InHotStandby)
+		ResolveRecoveryConflictWithSnapshot(cutoff_xid, xlrec->node);
+
+	/* If we have a full-page image, restore it and we're done */
+	if (record->xl_info & XLR_BKP_BLOCK(0))
+	{
+		(void) RestoreBackupBlock(lsn, record, 0, false, false);
+		return;
+	}
+
+	buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
+	if (!BufferIsValid(buffer))
+		return;
+
+	page = (Page) BufferGetPage(buffer);
+
+	if (lsn <= PageGetLSN(page))
+	{
+		UnlockReleaseBuffer(buffer);
+		return;
+	}
+
+	/* now execute freeze plan for each frozen tuple */
+	for (ntup = 0; ntup < xlrec->ntuples; ntup++)
+	{
+		xl_heap_freeze_tuple *xlrec_tp = &xlrec->tuples[ntup];
+		/* offsets are one-based */
+		ItemId			lp = PageGetItemId(page, xlrec_tp->off);
+		HeapTupleHeader tuple = (HeapTupleHeader) PageGetItem(page, lp);
+		heap_execute_freeze_tuple(tuple, xlrec_tp);
+	}
+
+	PageSetLSN(page, lsn);
+	MarkBufferDirty(buffer);
+	UnlockReleaseBuffer(buffer);
+}
+
 static void
 heap_xlog_newpage(XLogRecPtr lsn, XLogRecord *record)
 {
@@ -7433,6 +7694,9 @@ heap2_redo(XLogRecPtr lsn, XLogRecord *record)
 		case XLOG_HEAP2_CLEAN:
 			heap_xlog_clean(lsn, record);
 			break;
+		case XLOG_HEAP2_FREEZE_PAGE:
+			heap_xlog_freeze_page(lsn, record);
+			break;
 		case XLOG_HEAP2_CLEANUP_INFO:
 			heap_xlog_cleanup_info(lsn, record);
 			break;
diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c
index e14c053..1b244b1 100644
--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -149,6 +149,15 @@ heap2_desc(StringInfo buf, uint8 xl_info, char *rec)
 						 xlrec->node.relNode, xlrec->block,
 						 xlrec->latestRemovedXid);
 	}
+	if (info == XLOG_HEAP2_FREEZE_PAGE)
+	{
+		xl_heap_freeze_page *xlrec = (xl_heap_freeze_page *) rec;
+
+		appendStringInfo(buf, "freeze_page: rel %u/%u/%u; blk %u; cutoff xid %u ntuples %u",
+						 xlrec->node.spcNode, xlrec->node.dbNode,
+						 xlrec->node.relNode, xlrec->block,
+						 xlrec->cutoff_xid, xlrec->ntuples);
+	}
 	else if (info == XLOG_HEAP2_CLEANUP_INFO)
 	{
 		xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) rec;
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 2081470..2a1bf6f 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -286,7 +286,6 @@ static MemoryContext MXactContext = NULL;
 
 /* internal MultiXactId management */
 static void MultiXactIdSetOldestVisible(void);
-static MultiXactId CreateMultiXactId(int nmembers, MultiXactMember *members);
 static void RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 				   int nmembers, MultiXactMember *members);
 static MultiXactId GetNewMultiXactId(int nmembers, MultiXactOffset *offset);
@@ -672,7 +671,7 @@ ReadNextMultiXactId(void)
  *
  * NB: the passed members[] array will be sorted in-place.
  */
-static MultiXactId
+MultiXactId
 CreateMultiXactId(int nmembers, MultiXactMember *members)
 {
 	MultiXactId multi;
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index fe2d9e7..538f3b8 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -500,13 +500,14 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 		bool		tupgone,
 					hastup;
 		int			prev_dead_count;
-		OffsetNumber frozen[MaxOffsetNumber];
+		xl_heap_freeze_tuple frozen[MaxOffsetNumber]; /* FIXME: stack ok? */
 		int			nfrozen;
 		Size		freespace;
 		bool		all_visible_according_to_vm;
 		bool		all_visible;
 		bool		has_dead_tuples;
 		TransactionId visibility_cutoff_xid = InvalidTransactionId;
+		int			i;
 
 		if (blkno == next_not_all_visible_block)
 		{
@@ -894,9 +895,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 				 * Each non-removable tuple must be checked to see if it needs
 				 * freezing.  Note we already have exclusive buffer lock.
 				 */
-				if (heap_freeze_tuple(tuple.t_data, FreezeLimit,
-									  MultiXactCutoff))
-					frozen[nfrozen++] = offnum;
+				if (heap_prepare_freeze_tuple(tuple.t_data, FreezeLimit,
+											  MultiXactCutoff, &frozen[nfrozen]))
+					frozen[nfrozen++].off = offnum;
 			}
 		}						/* scan along page */
 
@@ -907,15 +908,32 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 		 */
 		if (nfrozen > 0)
 		{
+			START_CRIT_SECTION();
+
 			MarkBufferDirty(buf);
+
+			/* execute collected freezes */
+			for (i = 0; i < nfrozen; i++)
+			{
+				ItemId		itemid;
+				HeapTupleHeader htup;
+
+				itemid = PageGetItemId(page, frozen[i].off);
+				htup = (HeapTupleHeader) PageGetItem(page, itemid);
+
+				heap_execute_freeze_tuple(htup, &frozen[i]);
+			}
+
+			/* Now WAL-log freezing if neccessary */
 			if (RelationNeedsWAL(onerel))
 			{
 				XLogRecPtr	recptr;
 
 				recptr = log_heap_freeze(onerel, buf, FreezeLimit,
-										 MultiXactCutoff, frozen, nfrozen);
+										 frozen, nfrozen);
 				PageSetLSN(page, recptr);
 			}
+			END_CRIT_SECTION();
 		}
 
 		/*
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index 1ebc5ff..67d5fec 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -598,11 +598,13 @@ HeapTupleSatisfiesUpdate(HeapTuple htup, CommandId curcid,
 
 		/* no member, even just a locker, alive anymore */
 		if (!MultiXactIdIsRunning(HeapTupleHeaderGetRawXmax(tuple)))
+		{
 			SetHintBits(tuple, buffer, HEAP_XMAX_INVALID,
 						InvalidTransactionId);
+			return HeapTupleMayBeUpdated;
+		}
 
-		/* it must have aborted or crashed */
-		return HeapTupleMayBeUpdated;
+		return HeapTupleBeingUpdated;
 	}
 
 	if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmax(tuple)))
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 0d40398..e5864bb 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -148,8 +148,13 @@ extern HTSU_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
 				bool follow_update,
 				Buffer *buffer, HeapUpdateFailureData *hufd);
 extern void heap_inplace_update(Relation relation, HeapTuple tuple);
+
+struct xl_heap_freeze_tuple;
 extern bool heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
-				  TransactionId cutoff_multi);
+					  TransactionId cutoff_multi);
+extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
+					  TransactionId cutoff_multi, struct xl_heap_freeze_tuple *frz);
+extern void heap_execute_freeze_tuple(HeapTupleHeader tuple, struct xl_heap_freeze_tuple *xlrec_tp);
 extern bool heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 						MultiXactId cutoff_multi, Buffer buf);
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 4381778..d2baa5b 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -50,7 +50,7 @@
  */
 #define XLOG_HEAP2_FREEZE		0x00
 #define XLOG_HEAP2_CLEAN		0x10
-/* 0x20 is free, was XLOG_HEAP2_CLEAN_MOVE */
+#define XLOG_HEAP2_FREEZE_PAGE	0x20
 #define XLOG_HEAP2_CLEANUP_INFO 0x30
 #define XLOG_HEAP2_VISIBLE		0x40
 #define XLOG_HEAP2_MULTI_INSERT 0x50
@@ -251,6 +251,33 @@ typedef struct xl_heap_freeze
 
 #define SizeOfHeapFreeze (offsetof(xl_heap_freeze, cutoff_multi) + sizeof(MultiXactId))
 
+/* This is what we need to know about tuple freezing during vacuum */
+typedef struct xl_heap_freeze_tuple
+{
+	OffsetNumber off;
+	bool freeze_xmin;
+	bool invalid_xvac;
+	bool freeze_xvac;
+	TransactionId xmax;
+	uint16		t_infomask2;
+	uint16		t_infomask;
+} xl_heap_freeze_tuple;
+
+#define SizeOfHeapFreezeTuple sizeof(xl_heap_freeze_tuple)
+
+/* This is what we need to know about tuple freezing during vacuum */
+typedef struct xl_heap_freeze_block
+{
+	RelFileNode node;
+	BlockNumber block;
+	TransactionId cutoff_xid;
+	uint16		ntuples;
+	xl_heap_freeze_tuple tuples[1];
+} xl_heap_freeze_page;
+
+#define MinSizeOfHeapFreezeBlock (offsetof(xl_heap_freeze_block, tuples))
+
+
 /* This is what we need to know about setting a visibility map bit */
 typedef struct xl_heap_visible
 {
@@ -277,8 +304,7 @@ extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
 			   OffsetNumber *nowunused, int nunused,
 			   TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
-				TransactionId cutoff_xid, MultiXactId cutoff_multi,
-				OffsetNumber *offsets, int offcnt);
+		  TransactionId cutoff_xid,	xl_heap_freeze_tuple *tuples, int ntuples);
 extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
 				 Buffer vm_buffer, TransactionId cutoff_xid);
 extern XLogRecPtr log_newpage(RelFileNode *rnode, ForkNumber forkNum,
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 6085ea3..cad91e2 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -79,6 +79,7 @@ typedef struct xl_multixact_create
 extern MultiXactId MultiXactIdCreate(TransactionId xid1,
 				  MultiXactStatus status1, TransactionId xid2,
 				  MultiXactStatus status2);
+extern MultiXactId CreateMultiXactId(int nmembers, MultiXactMember *members);
 extern MultiXactId MultiXactIdExpand(MultiXactId multi, TransactionId xid,
 				  MultiXactStatus status);
 extern MultiXactId ReadNextMultiXactId(void);
-- 
1.8.5.rc2.dirty

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to