Attached is the new version of patch that addresses the comments from
Asim Praveen and Masahiko-san. It also improves the documentation to
some extent.


On Tue, Aug 18, 2020 at 1:46 PM Ashutosh Sharma <ashu.coe...@gmail.com> wrote:
>
> Hello Masahiko-san,
>
> I've spent some more time trying to understand the code in
> lazy_scan_heap function to know under what all circumstances a VACUUM
> can fail with "found xmin ... before relfrozenxid ..." error for a
> tuple whose xmin is behind relfrozenxid. Here are my observations:
>
> 1) It can fail with this error for a live tuple
>
> OR,
>
> 2) It can also fail with this error if a tuple (that went through
> update) is marked as HEAP_HOT_UPDATED or HEAP_ONLY_TUPLE.
>
> OR,
>
> 3) If there are any concurrent transactions, then the tuple might be
> marked as HEAPTUPLE_INSERT_IN_PROGRESS or HEAPTUPLE_DELETE_IN_PROGRESS
> or HEAPTUPLE_RECENTLY_DEAD in which case also VACUUM can fail with
> this error.
>
> Now, AFAIU, as we will be dealing with a damaged table, the chances of
> point #3 being the cause of this error looks impossible in our case
> because I don't think we will be doing anything in parallel when
> performing surgery on a damaged table, in fact we shouldn't be doing
> any such things. However, it is quite possible that reason #2 could
> cause VACUUM to fail with this sort of error, but, as we are already
> skipping redirected item pointers in heap_force_common(), I think, we
> would never be marking HEAP_HOT_UPDATED tuple as frozen and I don't
> see any problem in marking HEAP_ONLY_TUPLE as frozen. So, probably, we
> may not need to handle point #2 as well.
>
> Further, I also don't see VACUUM reporting this error for a tuple that
> has been moved from one partition to another. So, I think we might not
> need to do any special handling for a tuple that got updated and its
> new version was moved to another partition.
>
> If you feel I am missing something here, please correct me. Thank you.
>
> Moreover, while I was exploring on above, I noticed that in
> lazy_scan_heap(), before we call HeapTupleSatisfiesVacuum() we check
> for a redirected item pointers and if any redirected item pointer is
> detected we do not call HeapTupleSatisfiesVacuum(). So, not sure how
> HeapTupleSatisfiesVacuum would ever return a dead tuple that is marked
> with HEAP_HOT_UPDATED. I am referring to the following code in
> lazy_scan_heap().
>
>         for (offnum = FirstOffsetNumber;
>              offnum <= maxoff;
>              offnum = OffsetNumberNext(offnum))
>         {
>             ItemId      itemid;
>
>             itemid = PageGetItemId(page, offnum);
>
> .............
> .............
>
>
>             /* Redirect items mustn't be touched */ <-- this check
> would bypass the redirected item pointers from being checked for
> HeapTupleSatisfiesVacuum.
>             if (ItemIdIsRedirected(itemid))
>             {
>                 hastup = true;  /* this page won't be truncatable */
>                 continue;
>             }
>
> ..............
> ..............
>
>             switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
>             {
>                 case HEAPTUPLE_DEAD:
>
>                     if (HeapTupleIsHotUpdated(&tuple) ||
>                         HeapTupleIsHeapOnly(&tuple) ||
>                         params->index_cleanup == VACOPT_TERNARY_DISABLED)
>                         nkeep += 1;
>                     else
>                         tupgone = true; /* we can delete the tuple */
> ..............
> ..............
>              }
>
>
> So, the point is, would HeapTupleIsHotUpdated(&tuple) ever be true?
>
> --
> With Regards,
> Ashutosh Sharma
> EnterpriseDB:http://www.enterprisedb.com
>
> On Mon, Aug 17, 2020 at 11:35 AM Ashutosh Sharma <ashu.coe...@gmail.com> 
> wrote:
> >
> > Hello Masahiko-san,
> >
> > Thanks for the review. Please check the comments inline below:
> >
> > On Fri, Aug 14, 2020 at 10:07 AM Masahiko Sawada
> > <masahiko.saw...@2ndquadrant.com> wrote:
> >
> > > Thank you for updating the patch! Here are my comments on v5 patch:
> > >
> > > --- a/contrib/Makefile
> > > +++ b/contrib/Makefile
> > > @@ -35,6 +35,7 @@ SUBDIRS = \
> > >         pg_standby  \
> > >         pg_stat_statements \
> > >         pg_trgm     \
> > > +       pg_surgery  \
> > >         pgcrypto    \
> > >
> > > I guess we use alphabetical order here. So pg_surgery should be placed
> > > before pg_trgm.
> > >
> >
> > Okay, will take care of this in the next version of patch.
> >
> > > ---
> > > +           if (heap_force_opt == HEAP_FORCE_KILL)
> > > +               ItemIdSetDead(itemid);
> > >
> > > I think that if the page is an all-visible page, we should clear an
> > > all-visible bit on the visibility map corresponding to the page and
> > > PD_ALL_VISIBLE on the page header. Otherwise, index only scan would
> > > return the wrong results.
> > >
> >
> > I think we should let VACUUM do that. Please note that this module is
> > intended to be used only on a damaged relation and should only be
> > operated on damaged tuples of such relations. And the execution of any
> > of the functions provided by this module on a damaged relation must be
> > followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation.
> > This is necessary to bring back a damaged relation to the sane state
> > once a surgery is performed on it. I will try to add this note in the
> > documentation for this module.
> >
> > > ---
> > > +       /*
> > > +        * We do not mark the buffer dirty or do WAL logging for unmodifed
> > > +        * pages.
> > > +        */
> > > +       if (!did_modify_page)
> > > +           goto skip_wal;
> > > +
> > > +       /* Mark buffer dirty before we write WAL. */
> > > +       MarkBufferDirty(buf);
> > > +
> > > +       /* XLOG stuff */
> > > +       if (RelationNeedsWAL(rel))
> > > +           log_newpage_buffer(buf, true);
> > > +
> > > +skip_wal:
> > > +       END_CRIT_SECTION();
> > > +
> > >
> > > s/unmodifed/unmodified/
> > >
> >
> > okay, will fix this typo.
> >
> > > Do we really need to use goto? I think we can modify it like follows:
> > >
> > >     if (did_modity_page)
> > >     {
> > >        /* Mark buffer dirty before we write WAL. */
> > >        MarkBufferDirty(buf);
> > >
> > >        /* XLOG stuff */
> > >        if (RelationNeedsWAL(rel))
> > >            log_newpage_buffer(buf, true);
> > >     }
> > >
> > >     END_CRIT_SECTION();
> > >
> >
> > No, we don't need it. We can achieve the same by checking the status
> > of did_modify_page flag as you suggested. I will do this change in the
> > next version.
> >
> > > ---
> > > pg_force_freeze() can revival a tuple that is already deleted but not
> > > vacuumed yet. Therefore, the user might need to reindex indexes after
> > > using that function. For instance, with the following script, the last
> > > two queries: index scan and seq scan, will return different results.
> > >
> > > set enable_seqscan to off;
> > > set enable_bitmapscan to off;
> > > set enable_indexonlyscan to off;
> > > create table tbl (a int primary key);
> > > insert into tbl values (1);
> > >
> > > update tbl set a = a + 100 where a = 1;
> > >
> > > explain analyze select * from tbl where a < 200;
> > >
> > > -- revive deleted tuple on heap
> > > select heap_force_freeze('tbl', array['(0,1)'::tid]);
> > >
> > > -- index scan returns 2 tuples
> > > explain analyze select * from tbl where a < 200;
> > >
> > > -- seq scan returns 1 tuple
> > > set enable_seqscan to on;
> > > explain analyze select * from tbl;
> > >
> >
> > I am not sure if this is the right use-case of pg_force_freeze
> > function. I think we should only be running pg_force_freeze function
> > on a tuple for which VACUUM reports "found xmin ABC from before
> > relfrozenxid PQR" sort of error otherwise it might worsen the things
> > instead of making it better. Now, the question is - can VACUUM report
> > this type of error for a deleted tuple or it would only report it for
> > a live tuple? AFAIU this won't be reported for the deleted tuples
> > because VACUUM wouldn't consider freezing a tuple that has been
> > deleted.
> >
> > > Also, if a tuple updated and moved to another partition is revived by
> > > heap_force_freeze(), its ctid still has special values:
> > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
> > > see a problem yet caused by a visible tuple having the special ctid
> > > value, but it might be worth considering either to reset ctid value as
> > > well or to not freezing already-deleted tuple.
> > >
> >
> > For this as well, the answer remains the same as above.
> >
> > --
> > With Regards,
> > Ashutosh Sharma
> > EnterpriseDB:http://www.enterprisedb.com
From d765464116ee87ed64ea270ba75784954bc75377 Mon Sep 17 00:00:00 2001
From: ashu <ashutosh.sha...@enterprisedb.com>
Date: Tue, 18 Aug 2020 16:14:15 +0530
Subject: [PATCH] Add contrib/pg_surgery to perform surgery on a damaged heap
 table.

This contrib module basically adds a couple of functions named
heap_force_kill and heap_force_freeze that can be used in the scripts
to perform surgery on the damaged heap tables.

Ashutosh Sharma.
---
 contrib/Makefile                           |   1 +
 contrib/pg_surgery/Makefile                |  23 ++
 contrib/pg_surgery/expected/pg_surgery.out | 161 +++++++++++++
 contrib/pg_surgery/heap_surgery.c          | 375 +++++++++++++++++++++++++++++
 contrib/pg_surgery/pg_surgery--1.0.sql     |  18 ++
 contrib/pg_surgery/pg_surgery.control      |   5 +
 contrib/pg_surgery/sql/pg_surgery.sql      |  89 +++++++
 doc/src/sgml/contrib.sgml                  |   1 +
 doc/src/sgml/filelist.sgml                 |   1 +
 doc/src/sgml/pgsurgery.sgml                | 121 ++++++++++
 10 files changed, 795 insertions(+)
 create mode 100644 contrib/pg_surgery/Makefile
 create mode 100644 contrib/pg_surgery/expected/pg_surgery.out
 create mode 100644 contrib/pg_surgery/heap_surgery.c
 create mode 100644 contrib/pg_surgery/pg_surgery--1.0.sql
 create mode 100644 contrib/pg_surgery/pg_surgery.control
 create mode 100644 contrib/pg_surgery/sql/pg_surgery.sql
 create mode 100644 doc/src/sgml/pgsurgery.sgml

diff --git a/contrib/Makefile b/contrib/Makefile
index 1846d41..c8d2a16 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
 		pg_prewarm	\
 		pg_standby	\
 		pg_stat_statements \
+		pg_surgery	\
 		pg_trgm		\
 		pgcrypto	\
 		pgrowlocks	\
diff --git a/contrib/pg_surgery/Makefile b/contrib/pg_surgery/Makefile
new file mode 100644
index 0000000..ecf2e20
--- /dev/null
+++ b/contrib/pg_surgery/Makefile
@@ -0,0 +1,23 @@
+# contrib/pg_surgery/Makefile
+
+MODULE_big = pg_surgery
+OBJS = \
+	$(WIN32RES) \
+	heap_surgery.o
+
+EXTENSION = pg_surgery
+DATA = pg_surgery--1.0.sql
+PGFILEDESC = "pg_surgery - perform surgery on a damaged relation"
+
+REGRESS = pg_surgery
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_surgery
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_surgery/expected/pg_surgery.out b/contrib/pg_surgery/expected/pg_surgery.out
new file mode 100644
index 0000000..9858de2
--- /dev/null
+++ b/contrib/pg_surgery/expected/pg_surgery.out
@@ -0,0 +1,161 @@
+create extension pg_surgery;
+--
+-- check that using heap_force_kill and heap_force_freeze functions with the
+-- supported relations passes.
+--
+-- normal heap table.
+begin;
+create table htab(a int);
+insert into htab values (100), (200), (300), (400), (500);
+select * from htab where xmin = 2;
+ a 
+---
+(0 rows)
+
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+select ctid, xmax from htab where xmin = 2;
+ ctid  | xmax 
+-------+------
+ (0,4) |    0
+(1 row)
+
+select heap_force_kill('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select * from htab where ctid = '(0, 4)';
+ a 
+---
+(0 rows)
+
+rollback;
+-- materialized view.
+begin;
+create materialized view mvw as select a from generate_series(1, 3) a;
+select * from mvw where xmin = 2;
+ a 
+---
+(0 rows)
+
+select heap_force_freeze('mvw'::regclass, ARRAY['(0, 3)']::tid[]);
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+select * from mvw where xmin = 2;
+ a 
+---
+ 3
+(1 row)
+
+select heap_force_kill('mvw'::regclass, ARRAY['(0, 3)']::tid[]);
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select * from mvw where ctid = '(0, 3)';
+ a 
+---
+(0 rows)
+
+rollback;
+--
+-- check that using heap_force_kill and heap_force_freeze functions with the
+-- unsupported relations fails.
+--
+-- partitioned tables (the parent table) doesn't contain any tuple.
+create table ptab (a int) partition by list (a);
+select heap_force_kill('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only the relation using heap_tableam_handler is supported
+select heap_force_freeze('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only the relation using heap_tableam_handler is supported
+create index ptab_idx on ptab (a);
+-- indexes are not supported, should fail.
+select heap_force_kill('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only the relation using heap_tableam_handler is supported
+select heap_force_freeze('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only the relation using heap_tableam_handler is supported
+create view vw as select 1;
+-- views are not supported as well. so, all these should fail.
+select heap_force_kill('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only the relation using heap_tableam_handler is supported
+select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only the relation using heap_tableam_handler is supported
+create sequence seq;
+-- sequences are not supported as well. so, all these functions should fail.
+select heap_force_kill('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  "seq" is not a table, materialized view, or TOAST table
+select heap_force_freeze('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  "seq" is not a table, materialized view, or TOAST table
+--
+-- Some negative test-cases with invalid inputs.
+--
+begin;
+create table htab(a int);
+-- invalid block number, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+NOTICE:  skipping block 0 for relation "htab" because the block number is out of range
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+NOTICE:  skipping block 0 for relation "htab" because the block number is out of range
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+insert into htab values (10);
+-- invalid offset number, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 2)']::tid[]);
+NOTICE:  skipping tid (0, 2) for relation "htab" because the item number is out of range for this block
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 2)']::tid[]);
+NOTICE:  skipping tid (0, 2) for relation "htab" because the item number is out of range for this block
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+-- dead tuple, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+NOTICE:  skipping tid (0, 1) for relation "htab" because it is marked dead
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+NOTICE:  skipping tid (0, 1) for relation "htab" because it is marked dead
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+rollback;
+-- cleanup.
+drop table ptab;
+drop view vw;
+drop sequence seq;
+drop extension pg_surgery;
diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
new file mode 100644
index 0000000..68d8205
--- /dev/null
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -0,0 +1,375 @@
+/*-------------------------------------------------------------------------
+ *
+ * heap_surgery.c
+ *	  Functions to perform surgery on the damaged heap table.
+ *
+ * Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  contrib/pg_surgery/heap_surgery.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "catalog/pg_am_d.h"
+#include "catalog/pg_proc_d.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "utils/acl.h"
+#include "utils/rel.h"
+
+PG_MODULE_MAGIC;
+
+/* Options to forcefully change the state of a heap tuple. */
+typedef enum HeapTupleForceOption
+{
+	HEAP_FORCE_KILL,
+	HEAP_FORCE_FREEZE
+} HeapTupleForceOption;
+
+PG_FUNCTION_INFO_V1(heap_force_kill);
+PG_FUNCTION_INFO_V1(heap_force_freeze);
+
+static int32 tidcmp(const void *a, const void *b);
+static Datum heap_force_common(FunctionCallInfo fcinfo,
+							   HeapTupleForceOption heap_force_opt);
+static void sanity_check_tid_array(ArrayType *ta, int *ntids);
+static void sanity_check_relation(Relation rel);
+static BlockNumber tids_same_page_fetch_offnums(ItemPointer tids, int ntids,
+												OffsetNumber *next_start_ptr,
+												OffsetNumber *offnos);
+
+/*-------------------------------------------------------------------------
+ * heap_force_kill()
+ *
+ * Force kill the tuple(s) pointed to by the item pointer(s) stored in the
+ * given tid array.
+ *
+ * Usage: SELECT heap_force_kill(regclass, tid[]);
+ *-------------------------------------------------------------------------
+ */
+Datum
+heap_force_kill(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_DATUM(heap_force_common(fcinfo, HEAP_FORCE_KILL));
+}
+
+/*-------------------------------------------------------------------------
+ * heap_force_freeze()
+ *
+ * Force freeze the tuple(s) pointed to by the item pointer(s) stored in the
+ * given tid array.
+ *
+ * Usage: SELECT heap_force_freeze(regclass, tid[]);
+ *-------------------------------------------------------------------------
+ */
+Datum
+heap_force_freeze(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_DATUM(heap_force_common(fcinfo, HEAP_FORCE_FREEZE));
+}
+
+/*-------------------------------------------------------------------------
+ * heap_force_common()
+ *
+ * Common code for heap_force_kill and heap_force_freeze
+ *-------------------------------------------------------------------------
+ */
+static Datum
+heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
+{
+	Oid				relid = PG_GETARG_OID(0);
+	ArrayType	   *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
+	ItemPointer		tids;
+	int				ntids,
+					nblocks;
+	Relation		rel;
+	OffsetNumber   *offnos;
+	OffsetNumber	noffs,
+					curr_start_ptr,
+					next_start_ptr;
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("heap surgery functions cannot be executed during recovery.")));
+
+	/* Basic sanity checking. */
+	sanity_check_tid_array(ta, &ntids);
+
+	rel = relation_open(relid, RowExclusiveLock);
+
+	sanity_check_relation(rel);
+
+	tids = ((ItemPointer) ARR_DATA_PTR(ta));
+
+	/*
+	 * If there is more than one tid in the array, sort it so that we can
+	 * easily fetch all the tids belonging to one particular page from the
+	 * array.
+	 */
+	if (ntids > 1)
+		qsort((void*) tids, ntids, sizeof(ItemPointerData), tidcmp);
+
+	offnos = (OffsetNumber *) palloc(ntids * sizeof(OffsetNumber));
+	noffs = curr_start_ptr = next_start_ptr = 0;
+	nblocks = RelationGetNumberOfBlocks(rel);
+
+	do
+	{
+		Buffer			buf;
+		Page			page;
+		BlockNumber		blkno;
+		OffsetNumber	maxoffset;
+		int				i;
+		bool			did_modify_page = false;
+
+		/*
+		 * Get the offset numbers from the tids belonging to one particular page
+		 * and process them one by one.
+		 */
+		blkno = tids_same_page_fetch_offnums(tids, ntids, &next_start_ptr,
+											 offnos);
+
+		/* Calculate the number of offsets stored in offnos array. */
+		noffs = next_start_ptr - curr_start_ptr;
+
+		/*
+		 * Update the current start pointer so that next time when
+		 * tids_same_page_fetch_offnums() is called, we can calculate the number
+		 * of offsets present in the offnos array.
+		 */
+		curr_start_ptr = next_start_ptr;
+
+		/* Check whether the block number is valid. */
+		if (blkno >= nblocks)
+		{
+			ereport(NOTICE,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("skipping block %u for relation \"%s\" because the block number is out of range",
+							blkno, RelationGetRelationName(rel))));
+			continue;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		buf = ReadBuffer(rel, blkno);
+		LockBufferForCleanup(buf);
+
+		page = BufferGetPage(buf);
+
+		maxoffset = PageGetMaxOffsetNumber(page);
+
+		/* No ereport(ERROR) from here until all the changes are logged. */
+		START_CRIT_SECTION();
+
+		for (i = 0; i < noffs; i++)
+		{
+			OffsetNumber	offno;
+			ItemId			itemid;
+
+			if (offnos[i] == 0 || offnos[i] > maxoffset)
+			{
+				ereport(NOTICE,
+						 errmsg("skipping tid (%u, %u) for relation \"%s\" because the item number is out of range for this block",
+								blkno, offnos[i], RelationGetRelationName(rel)));
+				continue;
+			}
+
+			itemid = PageGetItemId(page, offnos[i]);
+
+			/* Follow any redirections until we find something useful. */
+			while (ItemIdIsRedirected(itemid))
+			{
+				offno = ItemIdGetRedirect(itemid);
+				itemid = PageGetItemId(page, offno);
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/* Nothing to do if the itemid is unused or already dead. */
+			if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
+			{
+				if (!ItemIdIsUsed(itemid))
+					ereport(NOTICE,
+							(errmsg("skipping tid (%u, %u) for relation \"%s\" because it is marked unused",
+									blkno, offnos[i], RelationGetRelationName(rel))));
+				else
+					ereport(NOTICE,
+							(errmsg("skipping tid (%u, %u) for relation \"%s\" because it is marked dead",
+									blkno, offnos[i], RelationGetRelationName(rel))));
+				continue;
+			}
+
+			Assert(ItemIdIsNormal(itemid));
+
+			did_modify_page = true;
+
+			if (heap_force_opt == HEAP_FORCE_KILL)
+				ItemIdSetDead(itemid);
+			else
+			{
+				HeapTupleHeader htup;
+
+				Assert(heap_force_opt == HEAP_FORCE_FREEZE);
+
+				htup = (HeapTupleHeader) PageGetItem(page, itemid);
+
+				HeapTupleHeaderSetXmin(htup, FrozenTransactionId);
+				HeapTupleHeaderSetXmax(htup, InvalidTransactionId);
+
+				/* We might have MOVED_OFF/MOVED_IN tuples in the database */
+				if (htup->t_infomask & HEAP_MOVED)
+				{
+					if (htup->t_infomask & HEAP_MOVED_OFF)
+						HeapTupleHeaderSetXvac(htup, InvalidTransactionId);
+					else
+						HeapTupleHeaderSetXvac(htup, FrozenTransactionId);
+				}
+
+				/*
+				 * Clear all the visibility-related bits of this tuple and mark
+				 * it as frozen. Also, get rid of HOT_UPDATED and KEYS_UPDATES
+				 * bits.
+				 */
+				htup->t_infomask &= ~HEAP_XACT_MASK;
+				htup->t_infomask |= (HEAP_XMIN_FROZEN | HEAP_XMAX_INVALID);
+				htup->t_infomask2 &= ~HEAP_HOT_UPDATED;
+				htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
+			}
+		}
+
+		/*
+		 * If the page was modified, only then, we mark the buffer dirty or do
+		 * the WAL logging.
+		 */
+		if (did_modify_page)
+		{
+			/* Mark buffer dirty before we write WAL. */
+			MarkBufferDirty(buf);
+
+			/* XLOG stuff */
+			if (RelationNeedsWAL(rel))
+				log_newpage_buffer(buf, true);
+		}
+
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(buf);
+	} while (next_start_ptr != ntids);
+
+	relation_close(rel, RowExclusiveLock);
+
+	pfree(ta);
+	pfree(offnos);
+
+	PG_RETURN_VOID();
+}
+
+/*-------------------------------------------------------------------------
+ * tidcmp()
+ *
+ * Compare two item pointers, return -1, 0, or +1.
+ *
+ * See ItemPointerCompare for details.
+ * ------------------------------------------------------------------------
+ */
+static int32
+tidcmp(const void *a, const void *b)
+{
+	ItemPointer iptr1 = ((const ItemPointer) a);
+	ItemPointer iptr2 = ((const ItemPointer) b);
+
+	return ItemPointerCompare(iptr1, iptr2);
+}
+
+/*-------------------------------------------------------------------------
+ * sanity_check_tid_array()
+ *
+ * Perform sanity check on the given tid array.
+ * ------------------------------------------------------------------------
+ */
+static void
+sanity_check_tid_array(ArrayType *ta, int *ntids)
+{
+	if (ARR_HASNULL(ta) && array_contains_nulls(ta))
+		ereport(ERROR,
+				(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
+				 errmsg("array must not contain nulls")));
+
+	*ntids = ArrayGetNItems(ARR_NDIM(ta), ARR_DIMS(ta));
+
+	if (*ntids == 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("empty tid array")));
+}
+
+/*-------------------------------------------------------------------------
+ * sanity_check_relation()
+ *
+ * Perform sanity check on the given relation.
+ * ------------------------------------------------------------------------
+ */
+static void
+sanity_check_relation(Relation rel)
+{
+	if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("only the relation using heap_tableam_handler is supported")));
+
+	if (rel->rd_rel->relkind != RELKIND_RELATION &&
+		rel->rd_rel->relkind != RELKIND_MATVIEW &&
+		rel->rd_rel->relkind != RELKIND_TOASTVALUE)
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("\"%s\" is not a table, materialized view, or TOAST table",
+						RelationGetRelationName(rel))));
+
+	/* Must be owner of the table or superuser. */
+	if (!pg_class_ownercheck(RelationGetRelid(rel), GetUserId()))
+		aclcheck_error(ACLCHECK_NOT_OWNER,
+					   get_relkind_objtype(rel->rd_rel->relkind),
+					   RelationGetRelationName(rel));
+}
+
+/*-------------------------------------------------------------------------
+ * tids_same_page_fetch_offnums()
+ *
+ * Find out all the tids residing in the same page as tids[next_start_ptr] and
+ * fetch the offset number stored in each of them into a caller-allocated offset
+ * number array.
+ * ------------------------------------------------------------------------
+ */
+static BlockNumber
+tids_same_page_fetch_offnums(ItemPointer tids, int ntids,
+							 OffsetNumber *next_start_ptr, OffsetNumber *offnos)
+{
+	int				i;
+	BlockNumber		prev_blkno,
+					blkno;
+	OffsetNumber	offno;
+
+	prev_blkno = blkno = InvalidBlockNumber;
+
+	for (i = *next_start_ptr; i < ntids; i++)
+	{
+		ItemPointerData tid = tids[i];
+
+		blkno = ItemPointerGetBlockNumberNoCheck(&tid);
+		offno = ItemPointerGetOffsetNumberNoCheck(&tid);
+
+		if (i == *next_start_ptr || (prev_blkno == blkno))
+			offnos[i - *next_start_ptr] = offno;
+		else
+			break;
+
+		prev_blkno = blkno;
+	}
+
+	*next_start_ptr = i;
+	return prev_blkno;
+}
\ No newline at end of file
diff --git a/contrib/pg_surgery/pg_surgery--1.0.sql b/contrib/pg_surgery/pg_surgery--1.0.sql
new file mode 100644
index 0000000..2ae7f22
--- /dev/null
+++ b/contrib/pg_surgery/pg_surgery--1.0.sql
@@ -0,0 +1,18 @@
+/* contrib/pg_surgery/pg_surgery--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_surgery" to load this file. \quit
+
+CREATE FUNCTION heap_force_kill(reloid regclass, tids tid[])
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'heap_force_kill'
+LANGUAGE C STRICT;
+
+REVOKE EXECUTE ON FUNCTION heap_force_kill(regclass, tid[]) FROM PUBLIC;
+
+CREATE FUNCTION heap_force_freeze(reloid regclass, tids tid[])
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'heap_force_freeze'
+LANGUAGE C STRICT;
+
+REVOKE EXECUTE ON FUNCTION heap_force_freeze(regclass, tid[]) FROM PUBLIC;
\ No newline at end of file
diff --git a/contrib/pg_surgery/pg_surgery.control b/contrib/pg_surgery/pg_surgery.control
new file mode 100644
index 0000000..2bcdad1
--- /dev/null
+++ b/contrib/pg_surgery/pg_surgery.control
@@ -0,0 +1,5 @@
+# pg_surgery extension
+comment = 'extension to perform surgery on a damaged relation'
+default_version = '1.0'
+module_pathname = '$libdir/pg_surgery'
+relocatable = true
diff --git a/contrib/pg_surgery/sql/pg_surgery.sql b/contrib/pg_surgery/sql/pg_surgery.sql
new file mode 100644
index 0000000..4aef380
--- /dev/null
+++ b/contrib/pg_surgery/sql/pg_surgery.sql
@@ -0,0 +1,89 @@
+create extension pg_surgery;
+
+--
+-- check that using heap_force_kill and heap_force_freeze functions with the
+-- supported relations passes.
+--
+
+-- normal heap table.
+begin;
+create table htab(a int);
+insert into htab values (100), (200), (300), (400), (500);
+
+select * from htab where xmin = 2;
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+select ctid, xmax from htab where xmin = 2;
+
+select heap_force_kill('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+select * from htab where ctid = '(0, 4)';
+rollback;
+
+-- materialized view.
+begin;
+create materialized view mvw as select a from generate_series(1, 3) a;
+
+select * from mvw where xmin = 2;
+select heap_force_freeze('mvw'::regclass, ARRAY['(0, 3)']::tid[]);
+select * from mvw where xmin = 2;
+
+select heap_force_kill('mvw'::regclass, ARRAY['(0, 3)']::tid[]);
+select * from mvw where ctid = '(0, 3)';
+rollback;
+
+--
+-- check that using heap_force_kill and heap_force_freeze functions with the
+-- unsupported relations fails.
+--
+
+-- partitioned tables (the parent table) doesn't contain any tuple.
+create table ptab (a int) partition by list (a);
+
+select heap_force_kill('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+
+create index ptab_idx on ptab (a);
+
+-- indexes are not supported, should fail.
+select heap_force_kill('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+
+create view vw as select 1;
+
+-- views are not supported as well. so, all these should fail.
+select heap_force_kill('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+
+create sequence seq;
+
+-- sequences are not supported as well. so, all these functions should fail.
+select heap_force_kill('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+
+--
+-- Some negative test-cases with invalid inputs.
+--
+begin;
+create table htab(a int);
+
+-- invalid block number, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+
+insert into htab values (10);
+
+-- invalid offset number, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 2)']::tid[]);
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 2)']::tid[]);
+
+-- dead tuple, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+
+rollback;
+
+-- cleanup.
+drop table ptab;
+drop view vw;
+drop sequence seq;
+drop extension pg_surgery;
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index 261a559..c82dde2 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -125,6 +125,7 @@ CREATE EXTENSION <replaceable>module_name</replaceable>;
  &pgrowlocks;
  &pgstatstatements;
  &pgstattuple;
+ &pgsurgery;
  &pgtrgm;
  &pgvisibility;
  &postgres-fdw;
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 64b5da0..828396d 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -139,6 +139,7 @@
 <!ENTITY pgstandby       SYSTEM "pgstandby.sgml">
 <!ENTITY pgstatstatements SYSTEM "pgstatstatements.sgml">
 <!ENTITY pgstattuple     SYSTEM "pgstattuple.sgml">
+<!ENTITY pgsurgery       SYSTEM "pgsurgery.sgml">
 <!ENTITY pgtrgm          SYSTEM "pgtrgm.sgml">
 <!ENTITY pgvisibility    SYSTEM "pgvisibility.sgml">
 <!ENTITY postgres-fdw    SYSTEM "postgres-fdw.sgml">
diff --git a/doc/src/sgml/pgsurgery.sgml b/doc/src/sgml/pgsurgery.sgml
new file mode 100644
index 0000000..8d2ed81
--- /dev/null
+++ b/doc/src/sgml/pgsurgery.sgml
@@ -0,0 +1,121 @@
+<!-- doc/src/sgml/pgsurgery.sgml -->
+
+<sect1 id="pgsurgery" xreflabel="pg_surgery">
+ <title>pg_surgery</title>
+
+ <indexterm zone="pgsurgery">
+  <primary>pg_surgery</primary>
+ </indexterm>
+
+ <para>
+  The <filename>pg_surgery</filename> module provides various functions to
+  perform surgery on a damaged relation. It is intended to be used only with a
+  damaged relation and must be operated only on damaged tuples of such
+  relations.
+ </para>
+
+ <para>
+  By default, only the superusers have <literal>EXECUTE</literal> privilege on
+  the functions provided by this module. Once the extension has been installed,
+  users may issue <command>GRANT</command> commands to change the privileges on
+  the functions to allow others to execute them.
+ </para>
+
+ <note>
+ <para>
+  After a surgery is performed on a damaged relation using this module, we must
+  run VACUUM with DISABLE_PAGE_SKIPPING option on that relation to bring it back
+  into a sane state.
+ </para>
+ </note>
+
+ <note>
+ <para>
+  While performing surgery on a damaged relation, we must not be doing anything
+  else on that relation in parallel. This is to ensure that when we are
+  operating on a damaged tuple there is no other transaction trying to modify
+  that tuple.
+ </para>
+ </note>
+
+ <sect2>
+  <title>Functions</title>
+
+  <variablelist>
+   <varlistentry>
+    <term>
+     <function>heap_force_kill(regclass, tid[]) returns void</function>
+    </term>
+
+    <listitem>
+     <para>
+      <function>heap_force_kill</function> marks a <quote>used</quote> line
+      pointer as <quote>dead</quote> without examining the tuple. This may help
+      users to force-kill a corrupted tuple in the table that may or may not be
+      accessible. It takes the relation oid and an array of item pointers as an
+      input arguments. For example:
+<programlisting>
+test=&gt; select * from t1 where ctid = '(0, 1)';
+ERROR:  could not access status of transaction 4007513275
+DETAIL:  Could not open file "pg_xact/0EED": No such file or directory.
+
+test=# select heap_force_kill('t1'::regclass, ARRAY['(0, 1)']::tid[]);
+-[ RECORD 1 ]---+-
+heap_force_kill | 
+
+test=# select * from t1 where ctid = '(0, 1)';
+(0 rows)
+</programlisting>
+    </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>
+     <function>heap_force_freeze(regclass, tid[]) returns void</function>
+    </term>
+
+    <listitem>
+     <para>
+      <function>heap_force_freeze</function> marks a live tuple as frozen by
+      setting its xmin to FrozenTransactionId and xmax to InvalidTransactionId.
+      It clears all the visibility related bits in the infomask of the tuple and
+      sets the infomask flag to HEAP_XMIN_FROZEN and HEAP_XMAX_INVALID. This may
+      help users to force-freeze a tuple in the table that actually should have
+      been marked as frozen. Similar to <function>heap_force_kill</function>
+      function, this also takes the relation oid and an array of item pointers
+      as an input arguments. For example:
+<programlisting>
+test=&gt; vacuum t1;
+ERROR:  found xmin 507 from before relfrozenxid 515
+CONTEXT:  while scanning block 0 of relation "public.t1"
+
+test=# select ctid from t1 where xmin = 507;
+ ctid  
+-------
+ (0,3)
+(1 row)
+
+test=# select heap_force_freeze('t1'::regclass, ARRAY['(0, 3)']::tid[]);
+-[ RECORD 1 ]-----+-
+heap_force_freeze | 
+
+test=# vacuum t1;
+VACUUM
+</programlisting>
+     </para>
+    </listitem>
+   </varlistentry>
+
+  </variablelist>
+ </sect2>
+
+ <sect2>
+  <title>Authors</title>
+
+  <para>
+   Ashutosh Sharma <email>ashu.coe...@gmail.com</email>
+  </para>
+ </sect2>
+
+</sect1>
-- 
1.8.3.1

Reply via email to