On 13/04/18 13:08, Michael Paquier wrote:
On Fri, Apr 13, 2018 at 02:15:35PM +0530, amul sul wrote:
I have looked into this and found that the issue is in heap_xlog_delete -- we
have missed to set the correct offset number from the target_tid when
XLH_DELETE_IS_PARTITION_MOVE flag is set.

Oh, this looks good to me.  So when a row was moved across partitions
this could have caused incorrect tuple references on a standby, which
could have caused corruptions.

Hmm. So, the problem was that HeapTupleHeaderSetMovedPartitions() only sets the block number to InvalidBlockNumber, and leaves the offset number unchanged. WAL replay didn't preserve the offset number, so the master and the standby had a different offset number in the ctid.

Why does HeapTupleHeaderSetMovedPartitions() leave the offset number unchanged? The old offset number is meaningless without the block number. Also, bits and magic values in the tuple header are scarce. We're squandering a whole range of values in the ctid, everything with ip_blkid==InvalidBlockNumber, to mean "moved to different partition", when a single value would suffice.

Let's tighten that up. In the attached (untested) patch, I changed the macros so that "moved to different partition" is indicated by the magic TID (InvalidBlockNumber, 0xfffd). Offset number 0xfffe was already used for speculative insertion tokens, so this follows that precedent.

I kept using InvalidBlockNumber there, so ItemPointerIsValid() still considers those item pointers as invalid. But my gut feeling is actually that it would be better to use e.g. 0 as the block number, so that these item pointers would appear valid. Again, to follow the precedent of speculative insertion tokens. But I'm not sure if there was some well-thought-out reason to make them appear invalid. A comment on that would be nice, at least.

(Amit hinted at this in https://www.postgresql.org/message-id/CAA4eK1KtsTqsGDggDCrz2O9Jgo7ma-Co-B8%2Bv3L2zWMA2NHm6A%40mail.gmail.com. He was OK with the current approach, but I feel pretty strongly that we should also set the offset number.)

- Heikki
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index cf56d4ace4..1867a70f6f 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -83,13 +83,15 @@
  *
  * A word about t_ctid: whenever a new tuple is stored on disk, its t_ctid
  * is initialized with its own TID (location).  If the tuple is ever updated,
- * its t_ctid is changed to point to the replacement version of the tuple or
- * the block number (ip_blkid) is invalidated if the tuple is moved from one
- * partition to another partition relation due to an update of the partition
- * key.  Thus, a tuple is the latest version of its row iff XMAX is invalid or
+ * its t_ctid is changed to point to the replacement version of the tuple.  Or
+ * if the tuple is moved from one partition to another, due to an update of
+ * the partition key, t_ctid is set to a special value to indicate that
+ * (see ItemPointerSetMovedPartitions).  Thus, a tuple is the latest version
+ * of its row iff XMAX is invalid or
  * t_ctid points to itself (in which case, if XMAX is valid, the tuple is
  * either locked or deleted).  One can follow the chain of t_ctid links
- * to find the newest version of the row.  Beware however that VACUUM might
+ * to find the newest version of the row, unless it was moved to a different
+ * partition.  Beware however that VACUUM might
  * erase the pointed-to (newer) tuple before erasing the pointing (older)
  * tuple.  Hence, when following a t_ctid link, it is necessary to check
  * to see if the referenced slot is empty or contains an unrelated tuple.
@@ -288,14 +290,6 @@ struct HeapTupleHeaderData
 #define HEAP_TUPLE_HAS_MATCH	HEAP_ONLY_TUPLE /* tuple has a join match */
 
 /*
- * Special value used in t_ctid.ip_posid, to indicate that it holds a
- * speculative insertion token rather than a real TID.  This must be higher
- * than MaxOffsetNumber, so that it can be distinguished from a valid
- * offset number in a regular item pointer.
- */
-#define SpecTokenOffsetNumber		0xfffe
-
-/*
  * HeapTupleHeader accessor macros
  *
  * Note: beware of multiple evaluations of "tup" argument.  But the Set
@@ -447,11 +441,12 @@ do { \
 	ItemPointerSet(&(tup)->t_ctid, token, SpecTokenOffsetNumber) \
 )
 
-#define HeapTupleHeaderSetMovedPartitions(tup) \
-	ItemPointerSetMovedPartitions(&(tup)->t_ctid)
-
 #define HeapTupleHeaderIndicatesMovedPartitions(tup) \
-	ItemPointerIndicatesMovedPartitions(&tup->t_ctid)
+	(ItemPointerGetOffsetNumber(&(tup)->t_ctid) == MovedPartitionsOffsetNumber && \
+	 ItemPointerGetBlockNumberNoCheck(&(tup)->t_ctid) == MovedPartitionsBlockNumber)
+
+#define HeapTupleHeaderSetMovedPartitions(tup) \
+	ItemPointerSet(&(tup)->t_ctid, MovedPartitionsBlockNumber, MovedPartitionsOffsetNumber)
 
 #define HeapTupleHeaderGetDatumLength(tup) \
 	VARSIZE(tup)
diff --git a/src/include/storage/itemptr.h b/src/include/storage/itemptr.h
index 626c98f969..d87101f270 100644
--- a/src/include/storage/itemptr.h
+++ b/src/include/storage/itemptr.h
@@ -49,6 +49,28 @@ ItemPointerData;
 typedef ItemPointerData *ItemPointer;
 
 /* ----------------
+ *		special values used in heap tuples (t_ctid)
+ * ----------------
+ */
+
+/*
+ * If a heap tuple holds a speculative insertion token rather than a real
+ * TID, ip_posid is set to SpecTokenOffsetNumber, and the token is stored in
+ * ip_blkid. SpecTokenOffsetNumber must be higher than MaxOffsetNumber, so
+ * that it can be distinguished from a valid offset number in a regular item
+ * pointer.
+ */
+#define SpecTokenOffsetNumber		0xfffe
+
+/*
+ * When a tuple is moved to a different partition by UPDATE, the t_ctid of
+ * the old tuple version is set to this magic value.
+ */
+#define MovedPartitionsOffsetNumber 0xfffd
+#define MovedPartitionsBlockNumber	InvalidBlockNumber
+
+
+/* ----------------
  *		support macros
  * ----------------
  */
@@ -160,7 +182,10 @@ typedef ItemPointerData *ItemPointer;
  *		partition.
  */
 #define ItemPointerIndicatesMovedPartitions(pointer) \
-	!BlockNumberIsValid(ItemPointerGetBlockNumberNoCheck(pointer))
+( \
+	ItemPointerGetOffsetNumber(pointer) == MovedPartitionsOffsetNumber && \
+	ItemPointerGetBlockNumberNoCheck(pointer) == MovedPartitionsBlockNumber \
+)
 
 /*
  * ItemPointerSetMovedPartitions
@@ -168,7 +193,7 @@ typedef ItemPointerData *ItemPointer;
  *		different partition.
  */
 #define ItemPointerSetMovedPartitions(pointer) \
-	ItemPointerSetBlockNumber((pointer), InvalidBlockNumber)
+	ItemPointerSet((pointer), MovedPartitionsBlockNumber, MovedPartitionsOffsetNumber)
 
 /* ----------------
  *		externs

Reply via email to