Attached is the new version of patch that addresses the comments from Asim Praveen and Masahiko-san. It also improves the documentation to some extent.
On Tue, Aug 18, 2020 at 1:46 PM Ashutosh Sharma <ashu.coe...@gmail.com> wrote: > > Hello Masahiko-san, > > I've spent some more time trying to understand the code in > lazy_scan_heap function to know under what all circumstances a VACUUM > can fail with "found xmin ... before relfrozenxid ..." error for a > tuple whose xmin is behind relfrozenxid. Here are my observations: > > 1) It can fail with this error for a live tuple > > OR, > > 2) It can also fail with this error if a tuple (that went through > update) is marked as HEAP_HOT_UPDATED or HEAP_ONLY_TUPLE. > > OR, > > 3) If there are any concurrent transactions, then the tuple might be > marked as HEAPTUPLE_INSERT_IN_PROGRESS or HEAPTUPLE_DELETE_IN_PROGRESS > or HEAPTUPLE_RECENTLY_DEAD in which case also VACUUM can fail with > this error. > > Now, AFAIU, as we will be dealing with a damaged table, the chances of > point #3 being the cause of this error looks impossible in our case > because I don't think we will be doing anything in parallel when > performing surgery on a damaged table, in fact we shouldn't be doing > any such things. However, it is quite possible that reason #2 could > cause VACUUM to fail with this sort of error, but, as we are already > skipping redirected item pointers in heap_force_common(), I think, we > would never be marking HEAP_HOT_UPDATED tuple as frozen and I don't > see any problem in marking HEAP_ONLY_TUPLE as frozen. So, probably, we > may not need to handle point #2 as well. > > Further, I also don't see VACUUM reporting this error for a tuple that > has been moved from one partition to another. So, I think we might not > need to do any special handling for a tuple that got updated and its > new version was moved to another partition. > > If you feel I am missing something here, please correct me. Thank you. > > Moreover, while I was exploring on above, I noticed that in > lazy_scan_heap(), before we call HeapTupleSatisfiesVacuum() we check > for a redirected item pointers and if any redirected item pointer is > detected we do not call HeapTupleSatisfiesVacuum(). So, not sure how > HeapTupleSatisfiesVacuum would ever return a dead tuple that is marked > with HEAP_HOT_UPDATED. I am referring to the following code in > lazy_scan_heap(). > > for (offnum = FirstOffsetNumber; > offnum <= maxoff; > offnum = OffsetNumberNext(offnum)) > { > ItemId itemid; > > itemid = PageGetItemId(page, offnum); > > ............. > ............. > > > /* Redirect items mustn't be touched */ <-- this check > would bypass the redirected item pointers from being checked for > HeapTupleSatisfiesVacuum. > if (ItemIdIsRedirected(itemid)) > { > hastup = true; /* this page won't be truncatable */ > continue; > } > > .............. > .............. > > switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf)) > { > case HEAPTUPLE_DEAD: > > if (HeapTupleIsHotUpdated(&tuple) || > HeapTupleIsHeapOnly(&tuple) || > params->index_cleanup == VACOPT_TERNARY_DISABLED) > nkeep += 1; > else > tupgone = true; /* we can delete the tuple */ > .............. > .............. > } > > > So, the point is, would HeapTupleIsHotUpdated(&tuple) ever be true? > > -- > With Regards, > Ashutosh Sharma > EnterpriseDB:http://www.enterprisedb.com > > On Mon, Aug 17, 2020 at 11:35 AM Ashutosh Sharma <ashu.coe...@gmail.com> > wrote: > > > > Hello Masahiko-san, > > > > Thanks for the review. Please check the comments inline below: > > > > On Fri, Aug 14, 2020 at 10:07 AM Masahiko Sawada > > <masahiko.saw...@2ndquadrant.com> wrote: > > > > > Thank you for updating the patch! Here are my comments on v5 patch: > > > > > > --- a/contrib/Makefile > > > +++ b/contrib/Makefile > > > @@ -35,6 +35,7 @@ SUBDIRS = \ > > > pg_standby \ > > > pg_stat_statements \ > > > pg_trgm \ > > > + pg_surgery \ > > > pgcrypto \ > > > > > > I guess we use alphabetical order here. So pg_surgery should be placed > > > before pg_trgm. > > > > > > > Okay, will take care of this in the next version of patch. > > > > > --- > > > + if (heap_force_opt == HEAP_FORCE_KILL) > > > + ItemIdSetDead(itemid); > > > > > > I think that if the page is an all-visible page, we should clear an > > > all-visible bit on the visibility map corresponding to the page and > > > PD_ALL_VISIBLE on the page header. Otherwise, index only scan would > > > return the wrong results. > > > > > > > I think we should let VACUUM do that. Please note that this module is > > intended to be used only on a damaged relation and should only be > > operated on damaged tuples of such relations. And the execution of any > > of the functions provided by this module on a damaged relation must be > > followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation. > > This is necessary to bring back a damaged relation to the sane state > > once a surgery is performed on it. I will try to add this note in the > > documentation for this module. > > > > > --- > > > + /* > > > + * We do not mark the buffer dirty or do WAL logging for unmodifed > > > + * pages. > > > + */ > > > + if (!did_modify_page) > > > + goto skip_wal; > > > + > > > + /* Mark buffer dirty before we write WAL. */ > > > + MarkBufferDirty(buf); > > > + > > > + /* XLOG stuff */ > > > + if (RelationNeedsWAL(rel)) > > > + log_newpage_buffer(buf, true); > > > + > > > +skip_wal: > > > + END_CRIT_SECTION(); > > > + > > > > > > s/unmodifed/unmodified/ > > > > > > > okay, will fix this typo. > > > > > Do we really need to use goto? I think we can modify it like follows: > > > > > > if (did_modity_page) > > > { > > > /* Mark buffer dirty before we write WAL. */ > > > MarkBufferDirty(buf); > > > > > > /* XLOG stuff */ > > > if (RelationNeedsWAL(rel)) > > > log_newpage_buffer(buf, true); > > > } > > > > > > END_CRIT_SECTION(); > > > > > > > No, we don't need it. We can achieve the same by checking the status > > of did_modify_page flag as you suggested. I will do this change in the > > next version. > > > > > --- > > > pg_force_freeze() can revival a tuple that is already deleted but not > > > vacuumed yet. Therefore, the user might need to reindex indexes after > > > using that function. For instance, with the following script, the last > > > two queries: index scan and seq scan, will return different results. > > > > > > set enable_seqscan to off; > > > set enable_bitmapscan to off; > > > set enable_indexonlyscan to off; > > > create table tbl (a int primary key); > > > insert into tbl values (1); > > > > > > update tbl set a = a + 100 where a = 1; > > > > > > explain analyze select * from tbl where a < 200; > > > > > > -- revive deleted tuple on heap > > > select heap_force_freeze('tbl', array['(0,1)'::tid]); > > > > > > -- index scan returns 2 tuples > > > explain analyze select * from tbl where a < 200; > > > > > > -- seq scan returns 1 tuple > > > set enable_seqscan to on; > > > explain analyze select * from tbl; > > > > > > > I am not sure if this is the right use-case of pg_force_freeze > > function. I think we should only be running pg_force_freeze function > > on a tuple for which VACUUM reports "found xmin ABC from before > > relfrozenxid PQR" sort of error otherwise it might worsen the things > > instead of making it better. Now, the question is - can VACUUM report > > this type of error for a deleted tuple or it would only report it for > > a live tuple? AFAIU this won't be reported for the deleted tuples > > because VACUUM wouldn't consider freezing a tuple that has been > > deleted. > > > > > Also, if a tuple updated and moved to another partition is revived by > > > heap_force_freeze(), its ctid still has special values: > > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't > > > see a problem yet caused by a visible tuple having the special ctid > > > value, but it might be worth considering either to reset ctid value as > > > well or to not freezing already-deleted tuple. > > > > > > > For this as well, the answer remains the same as above. > > > > -- > > With Regards, > > Ashutosh Sharma > > EnterpriseDB:http://www.enterprisedb.com
From d765464116ee87ed64ea270ba75784954bc75377 Mon Sep 17 00:00:00 2001 From: ashu <ashutosh.sha...@enterprisedb.com> Date: Tue, 18 Aug 2020 16:14:15 +0530 Subject: [PATCH] Add contrib/pg_surgery to perform surgery on a damaged heap table. This contrib module basically adds a couple of functions named heap_force_kill and heap_force_freeze that can be used in the scripts to perform surgery on the damaged heap tables. Ashutosh Sharma. --- contrib/Makefile | 1 + contrib/pg_surgery/Makefile | 23 ++ contrib/pg_surgery/expected/pg_surgery.out | 161 +++++++++++++ contrib/pg_surgery/heap_surgery.c | 375 +++++++++++++++++++++++++++++ contrib/pg_surgery/pg_surgery--1.0.sql | 18 ++ contrib/pg_surgery/pg_surgery.control | 5 + contrib/pg_surgery/sql/pg_surgery.sql | 89 +++++++ doc/src/sgml/contrib.sgml | 1 + doc/src/sgml/filelist.sgml | 1 + doc/src/sgml/pgsurgery.sgml | 121 ++++++++++ 10 files changed, 795 insertions(+) create mode 100644 contrib/pg_surgery/Makefile create mode 100644 contrib/pg_surgery/expected/pg_surgery.out create mode 100644 contrib/pg_surgery/heap_surgery.c create mode 100644 contrib/pg_surgery/pg_surgery--1.0.sql create mode 100644 contrib/pg_surgery/pg_surgery.control create mode 100644 contrib/pg_surgery/sql/pg_surgery.sql create mode 100644 doc/src/sgml/pgsurgery.sgml diff --git a/contrib/Makefile b/contrib/Makefile index 1846d41..c8d2a16 100644 --- a/contrib/Makefile +++ b/contrib/Makefile @@ -34,6 +34,7 @@ SUBDIRS = \ pg_prewarm \ pg_standby \ pg_stat_statements \ + pg_surgery \ pg_trgm \ pgcrypto \ pgrowlocks \ diff --git a/contrib/pg_surgery/Makefile b/contrib/pg_surgery/Makefile new file mode 100644 index 0000000..ecf2e20 --- /dev/null +++ b/contrib/pg_surgery/Makefile @@ -0,0 +1,23 @@ +# contrib/pg_surgery/Makefile + +MODULE_big = pg_surgery +OBJS = \ + $(WIN32RES) \ + heap_surgery.o + +EXTENSION = pg_surgery +DATA = pg_surgery--1.0.sql +PGFILEDESC = "pg_surgery - perform surgery on a damaged relation" + +REGRESS = pg_surgery + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/pg_surgery +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/pg_surgery/expected/pg_surgery.out b/contrib/pg_surgery/expected/pg_surgery.out new file mode 100644 index 0000000..9858de2 --- /dev/null +++ b/contrib/pg_surgery/expected/pg_surgery.out @@ -0,0 +1,161 @@ +create extension pg_surgery; +-- +-- check that using heap_force_kill and heap_force_freeze functions with the +-- supported relations passes. +-- +-- normal heap table. +begin; +create table htab(a int); +insert into htab values (100), (200), (300), (400), (500); +select * from htab where xmin = 2; + a +--- +(0 rows) + +select heap_force_freeze('htab'::regclass, ARRAY['(0, 4)']::tid[]); + heap_force_freeze +------------------- + +(1 row) + +select ctid, xmax from htab where xmin = 2; + ctid | xmax +-------+------ + (0,4) | 0 +(1 row) + +select heap_force_kill('htab'::regclass, ARRAY['(0, 4)']::tid[]); + heap_force_kill +----------------- + +(1 row) + +select * from htab where ctid = '(0, 4)'; + a +--- +(0 rows) + +rollback; +-- materialized view. +begin; +create materialized view mvw as select a from generate_series(1, 3) a; +select * from mvw where xmin = 2; + a +--- +(0 rows) + +select heap_force_freeze('mvw'::regclass, ARRAY['(0, 3)']::tid[]); + heap_force_freeze +------------------- + +(1 row) + +select * from mvw where xmin = 2; + a +--- + 3 +(1 row) + +select heap_force_kill('mvw'::regclass, ARRAY['(0, 3)']::tid[]); + heap_force_kill +----------------- + +(1 row) + +select * from mvw where ctid = '(0, 3)'; + a +--- +(0 rows) + +rollback; +-- +-- check that using heap_force_kill and heap_force_freeze functions with the +-- unsupported relations fails. +-- +-- partitioned tables (the parent table) doesn't contain any tuple. +create table ptab (a int) partition by list (a); +select heap_force_kill('ptab'::regclass, ARRAY['(0, 1)']::tid[]); +ERROR: only the relation using heap_tableam_handler is supported +select heap_force_freeze('ptab'::regclass, ARRAY['(0, 1)']::tid[]); +ERROR: only the relation using heap_tableam_handler is supported +create index ptab_idx on ptab (a); +-- indexes are not supported, should fail. +select heap_force_kill('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]); +ERROR: only the relation using heap_tableam_handler is supported +select heap_force_freeze('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]); +ERROR: only the relation using heap_tableam_handler is supported +create view vw as select 1; +-- views are not supported as well. so, all these should fail. +select heap_force_kill('vw'::regclass, ARRAY['(0, 1)']::tid[]); +ERROR: only the relation using heap_tableam_handler is supported +select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]); +ERROR: only the relation using heap_tableam_handler is supported +create sequence seq; +-- sequences are not supported as well. so, all these functions should fail. +select heap_force_kill('seq'::regclass, ARRAY['(0, 1)']::tid[]); +ERROR: "seq" is not a table, materialized view, or TOAST table +select heap_force_freeze('seq'::regclass, ARRAY['(0, 1)']::tid[]); +ERROR: "seq" is not a table, materialized view, or TOAST table +-- +-- Some negative test-cases with invalid inputs. +-- +begin; +create table htab(a int); +-- invalid block number, should be skipped. +select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]); +NOTICE: skipping block 0 for relation "htab" because the block number is out of range + heap_force_kill +----------------- + +(1 row) + +select heap_force_freeze('htab'::regclass, ARRAY['(0, 1)']::tid[]); +NOTICE: skipping block 0 for relation "htab" because the block number is out of range + heap_force_freeze +------------------- + +(1 row) + +insert into htab values (10); +-- invalid offset number, should be skipped. +select heap_force_kill('htab'::regclass, ARRAY['(0, 2)']::tid[]); +NOTICE: skipping tid (0, 2) for relation "htab" because the item number is out of range for this block + heap_force_kill +----------------- + +(1 row) + +select heap_force_freeze('htab'::regclass, ARRAY['(0, 2)']::tid[]); +NOTICE: skipping tid (0, 2) for relation "htab" because the item number is out of range for this block + heap_force_freeze +------------------- + +(1 row) + +-- dead tuple, should be skipped. +select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]); + heap_force_kill +----------------- + +(1 row) + +select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]); +NOTICE: skipping tid (0, 1) for relation "htab" because it is marked dead + heap_force_kill +----------------- + +(1 row) + +select heap_force_freeze('htab'::regclass, ARRAY['(0, 1)']::tid[]); +NOTICE: skipping tid (0, 1) for relation "htab" because it is marked dead + heap_force_freeze +------------------- + +(1 row) + +rollback; +-- cleanup. +drop table ptab; +drop view vw; +drop sequence seq; +drop extension pg_surgery; diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c new file mode 100644 index 0000000..68d8205 --- /dev/null +++ b/contrib/pg_surgery/heap_surgery.c @@ -0,0 +1,375 @@ +/*------------------------------------------------------------------------- + * + * heap_surgery.c + * Functions to perform surgery on the damaged heap table. + * + * Copyright (c) 2020, PostgreSQL Global Development Group + * + * IDENTIFICATION + * contrib/pg_surgery/heap_surgery.c + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "access/heapam.h" +#include "catalog/pg_am_d.h" +#include "catalog/pg_proc_d.h" +#include "miscadmin.h" +#include "storage/bufmgr.h" +#include "utils/acl.h" +#include "utils/rel.h" + +PG_MODULE_MAGIC; + +/* Options to forcefully change the state of a heap tuple. */ +typedef enum HeapTupleForceOption +{ + HEAP_FORCE_KILL, + HEAP_FORCE_FREEZE +} HeapTupleForceOption; + +PG_FUNCTION_INFO_V1(heap_force_kill); +PG_FUNCTION_INFO_V1(heap_force_freeze); + +static int32 tidcmp(const void *a, const void *b); +static Datum heap_force_common(FunctionCallInfo fcinfo, + HeapTupleForceOption heap_force_opt); +static void sanity_check_tid_array(ArrayType *ta, int *ntids); +static void sanity_check_relation(Relation rel); +static BlockNumber tids_same_page_fetch_offnums(ItemPointer tids, int ntids, + OffsetNumber *next_start_ptr, + OffsetNumber *offnos); + +/*------------------------------------------------------------------------- + * heap_force_kill() + * + * Force kill the tuple(s) pointed to by the item pointer(s) stored in the + * given tid array. + * + * Usage: SELECT heap_force_kill(regclass, tid[]); + *------------------------------------------------------------------------- + */ +Datum +heap_force_kill(PG_FUNCTION_ARGS) +{ + PG_RETURN_DATUM(heap_force_common(fcinfo, HEAP_FORCE_KILL)); +} + +/*------------------------------------------------------------------------- + * heap_force_freeze() + * + * Force freeze the tuple(s) pointed to by the item pointer(s) stored in the + * given tid array. + * + * Usage: SELECT heap_force_freeze(regclass, tid[]); + *------------------------------------------------------------------------- + */ +Datum +heap_force_freeze(PG_FUNCTION_ARGS) +{ + PG_RETURN_DATUM(heap_force_common(fcinfo, HEAP_FORCE_FREEZE)); +} + +/*------------------------------------------------------------------------- + * heap_force_common() + * + * Common code for heap_force_kill and heap_force_freeze + *------------------------------------------------------------------------- + */ +static Datum +heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt) +{ + Oid relid = PG_GETARG_OID(0); + ArrayType *ta = PG_GETARG_ARRAYTYPE_P_COPY(1); + ItemPointer tids; + int ntids, + nblocks; + Relation rel; + OffsetNumber *offnos; + OffsetNumber noffs, + curr_start_ptr, + next_start_ptr; + + if (RecoveryInProgress()) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("recovery is in progress"), + errhint("heap surgery functions cannot be executed during recovery."))); + + /* Basic sanity checking. */ + sanity_check_tid_array(ta, &ntids); + + rel = relation_open(relid, RowExclusiveLock); + + sanity_check_relation(rel); + + tids = ((ItemPointer) ARR_DATA_PTR(ta)); + + /* + * If there is more than one tid in the array, sort it so that we can + * easily fetch all the tids belonging to one particular page from the + * array. + */ + if (ntids > 1) + qsort((void*) tids, ntids, sizeof(ItemPointerData), tidcmp); + + offnos = (OffsetNumber *) palloc(ntids * sizeof(OffsetNumber)); + noffs = curr_start_ptr = next_start_ptr = 0; + nblocks = RelationGetNumberOfBlocks(rel); + + do + { + Buffer buf; + Page page; + BlockNumber blkno; + OffsetNumber maxoffset; + int i; + bool did_modify_page = false; + + /* + * Get the offset numbers from the tids belonging to one particular page + * and process them one by one. + */ + blkno = tids_same_page_fetch_offnums(tids, ntids, &next_start_ptr, + offnos); + + /* Calculate the number of offsets stored in offnos array. */ + noffs = next_start_ptr - curr_start_ptr; + + /* + * Update the current start pointer so that next time when + * tids_same_page_fetch_offnums() is called, we can calculate the number + * of offsets present in the offnos array. + */ + curr_start_ptr = next_start_ptr; + + /* Check whether the block number is valid. */ + if (blkno >= nblocks) + { + ereport(NOTICE, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("skipping block %u for relation \"%s\" because the block number is out of range", + blkno, RelationGetRelationName(rel)))); + continue; + } + + CHECK_FOR_INTERRUPTS(); + + buf = ReadBuffer(rel, blkno); + LockBufferForCleanup(buf); + + page = BufferGetPage(buf); + + maxoffset = PageGetMaxOffsetNumber(page); + + /* No ereport(ERROR) from here until all the changes are logged. */ + START_CRIT_SECTION(); + + for (i = 0; i < noffs; i++) + { + OffsetNumber offno; + ItemId itemid; + + if (offnos[i] == 0 || offnos[i] > maxoffset) + { + ereport(NOTICE, + errmsg("skipping tid (%u, %u) for relation \"%s\" because the item number is out of range for this block", + blkno, offnos[i], RelationGetRelationName(rel))); + continue; + } + + itemid = PageGetItemId(page, offnos[i]); + + /* Follow any redirections until we find something useful. */ + while (ItemIdIsRedirected(itemid)) + { + offno = ItemIdGetRedirect(itemid); + itemid = PageGetItemId(page, offno); + CHECK_FOR_INTERRUPTS(); + } + + /* Nothing to do if the itemid is unused or already dead. */ + if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid)) + { + if (!ItemIdIsUsed(itemid)) + ereport(NOTICE, + (errmsg("skipping tid (%u, %u) for relation \"%s\" because it is marked unused", + blkno, offnos[i], RelationGetRelationName(rel)))); + else + ereport(NOTICE, + (errmsg("skipping tid (%u, %u) for relation \"%s\" because it is marked dead", + blkno, offnos[i], RelationGetRelationName(rel)))); + continue; + } + + Assert(ItemIdIsNormal(itemid)); + + did_modify_page = true; + + if (heap_force_opt == HEAP_FORCE_KILL) + ItemIdSetDead(itemid); + else + { + HeapTupleHeader htup; + + Assert(heap_force_opt == HEAP_FORCE_FREEZE); + + htup = (HeapTupleHeader) PageGetItem(page, itemid); + + HeapTupleHeaderSetXmin(htup, FrozenTransactionId); + HeapTupleHeaderSetXmax(htup, InvalidTransactionId); + + /* We might have MOVED_OFF/MOVED_IN tuples in the database */ + if (htup->t_infomask & HEAP_MOVED) + { + if (htup->t_infomask & HEAP_MOVED_OFF) + HeapTupleHeaderSetXvac(htup, InvalidTransactionId); + else + HeapTupleHeaderSetXvac(htup, FrozenTransactionId); + } + + /* + * Clear all the visibility-related bits of this tuple and mark + * it as frozen. Also, get rid of HOT_UPDATED and KEYS_UPDATES + * bits. + */ + htup->t_infomask &= ~HEAP_XACT_MASK; + htup->t_infomask |= (HEAP_XMIN_FROZEN | HEAP_XMAX_INVALID); + htup->t_infomask2 &= ~HEAP_HOT_UPDATED; + htup->t_infomask2 &= ~HEAP_KEYS_UPDATED; + } + } + + /* + * If the page was modified, only then, we mark the buffer dirty or do + * the WAL logging. + */ + if (did_modify_page) + { + /* Mark buffer dirty before we write WAL. */ + MarkBufferDirty(buf); + + /* XLOG stuff */ + if (RelationNeedsWAL(rel)) + log_newpage_buffer(buf, true); + } + + END_CRIT_SECTION(); + + UnlockReleaseBuffer(buf); + } while (next_start_ptr != ntids); + + relation_close(rel, RowExclusiveLock); + + pfree(ta); + pfree(offnos); + + PG_RETURN_VOID(); +} + +/*------------------------------------------------------------------------- + * tidcmp() + * + * Compare two item pointers, return -1, 0, or +1. + * + * See ItemPointerCompare for details. + * ------------------------------------------------------------------------ + */ +static int32 +tidcmp(const void *a, const void *b) +{ + ItemPointer iptr1 = ((const ItemPointer) a); + ItemPointer iptr2 = ((const ItemPointer) b); + + return ItemPointerCompare(iptr1, iptr2); +} + +/*------------------------------------------------------------------------- + * sanity_check_tid_array() + * + * Perform sanity check on the given tid array. + * ------------------------------------------------------------------------ + */ +static void +sanity_check_tid_array(ArrayType *ta, int *ntids) +{ + if (ARR_HASNULL(ta) && array_contains_nulls(ta)) + ereport(ERROR, + (errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED), + errmsg("array must not contain nulls"))); + + *ntids = ArrayGetNItems(ARR_NDIM(ta), ARR_DIMS(ta)); + + if (*ntids == 0) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("empty tid array"))); +} + +/*------------------------------------------------------------------------- + * sanity_check_relation() + * + * Perform sanity check on the given relation. + * ------------------------------------------------------------------------ + */ +static void +sanity_check_relation(Relation rel) +{ + if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("only the relation using heap_tableam_handler is supported"))); + + if (rel->rd_rel->relkind != RELKIND_RELATION && + rel->rd_rel->relkind != RELKIND_MATVIEW && + rel->rd_rel->relkind != RELKIND_TOASTVALUE) + ereport(ERROR, + (errcode(ERRCODE_WRONG_OBJECT_TYPE), + errmsg("\"%s\" is not a table, materialized view, or TOAST table", + RelationGetRelationName(rel)))); + + /* Must be owner of the table or superuser. */ + if (!pg_class_ownercheck(RelationGetRelid(rel), GetUserId())) + aclcheck_error(ACLCHECK_NOT_OWNER, + get_relkind_objtype(rel->rd_rel->relkind), + RelationGetRelationName(rel)); +} + +/*------------------------------------------------------------------------- + * tids_same_page_fetch_offnums() + * + * Find out all the tids residing in the same page as tids[next_start_ptr] and + * fetch the offset number stored in each of them into a caller-allocated offset + * number array. + * ------------------------------------------------------------------------ + */ +static BlockNumber +tids_same_page_fetch_offnums(ItemPointer tids, int ntids, + OffsetNumber *next_start_ptr, OffsetNumber *offnos) +{ + int i; + BlockNumber prev_blkno, + blkno; + OffsetNumber offno; + + prev_blkno = blkno = InvalidBlockNumber; + + for (i = *next_start_ptr; i < ntids; i++) + { + ItemPointerData tid = tids[i]; + + blkno = ItemPointerGetBlockNumberNoCheck(&tid); + offno = ItemPointerGetOffsetNumberNoCheck(&tid); + + if (i == *next_start_ptr || (prev_blkno == blkno)) + offnos[i - *next_start_ptr] = offno; + else + break; + + prev_blkno = blkno; + } + + *next_start_ptr = i; + return prev_blkno; +} \ No newline at end of file diff --git a/contrib/pg_surgery/pg_surgery--1.0.sql b/contrib/pg_surgery/pg_surgery--1.0.sql new file mode 100644 index 0000000..2ae7f22 --- /dev/null +++ b/contrib/pg_surgery/pg_surgery--1.0.sql @@ -0,0 +1,18 @@ +/* contrib/pg_surgery/pg_surgery--1.0.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "CREATE EXTENSION pg_surgery" to load this file. \quit + +CREATE FUNCTION heap_force_kill(reloid regclass, tids tid[]) +RETURNS VOID +AS 'MODULE_PATHNAME', 'heap_force_kill' +LANGUAGE C STRICT; + +REVOKE EXECUTE ON FUNCTION heap_force_kill(regclass, tid[]) FROM PUBLIC; + +CREATE FUNCTION heap_force_freeze(reloid regclass, tids tid[]) +RETURNS VOID +AS 'MODULE_PATHNAME', 'heap_force_freeze' +LANGUAGE C STRICT; + +REVOKE EXECUTE ON FUNCTION heap_force_freeze(regclass, tid[]) FROM PUBLIC; \ No newline at end of file diff --git a/contrib/pg_surgery/pg_surgery.control b/contrib/pg_surgery/pg_surgery.control new file mode 100644 index 0000000..2bcdad1 --- /dev/null +++ b/contrib/pg_surgery/pg_surgery.control @@ -0,0 +1,5 @@ +# pg_surgery extension +comment = 'extension to perform surgery on a damaged relation' +default_version = '1.0' +module_pathname = '$libdir/pg_surgery' +relocatable = true diff --git a/contrib/pg_surgery/sql/pg_surgery.sql b/contrib/pg_surgery/sql/pg_surgery.sql new file mode 100644 index 0000000..4aef380 --- /dev/null +++ b/contrib/pg_surgery/sql/pg_surgery.sql @@ -0,0 +1,89 @@ +create extension pg_surgery; + +-- +-- check that using heap_force_kill and heap_force_freeze functions with the +-- supported relations passes. +-- + +-- normal heap table. +begin; +create table htab(a int); +insert into htab values (100), (200), (300), (400), (500); + +select * from htab where xmin = 2; +select heap_force_freeze('htab'::regclass, ARRAY['(0, 4)']::tid[]); +select ctid, xmax from htab where xmin = 2; + +select heap_force_kill('htab'::regclass, ARRAY['(0, 4)']::tid[]); +select * from htab where ctid = '(0, 4)'; +rollback; + +-- materialized view. +begin; +create materialized view mvw as select a from generate_series(1, 3) a; + +select * from mvw where xmin = 2; +select heap_force_freeze('mvw'::regclass, ARRAY['(0, 3)']::tid[]); +select * from mvw where xmin = 2; + +select heap_force_kill('mvw'::regclass, ARRAY['(0, 3)']::tid[]); +select * from mvw where ctid = '(0, 3)'; +rollback; + +-- +-- check that using heap_force_kill and heap_force_freeze functions with the +-- unsupported relations fails. +-- + +-- partitioned tables (the parent table) doesn't contain any tuple. +create table ptab (a int) partition by list (a); + +select heap_force_kill('ptab'::regclass, ARRAY['(0, 1)']::tid[]); +select heap_force_freeze('ptab'::regclass, ARRAY['(0, 1)']::tid[]); + +create index ptab_idx on ptab (a); + +-- indexes are not supported, should fail. +select heap_force_kill('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]); +select heap_force_freeze('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]); + +create view vw as select 1; + +-- views are not supported as well. so, all these should fail. +select heap_force_kill('vw'::regclass, ARRAY['(0, 1)']::tid[]); +select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]); + +create sequence seq; + +-- sequences are not supported as well. so, all these functions should fail. +select heap_force_kill('seq'::regclass, ARRAY['(0, 1)']::tid[]); +select heap_force_freeze('seq'::regclass, ARRAY['(0, 1)']::tid[]); + +-- +-- Some negative test-cases with invalid inputs. +-- +begin; +create table htab(a int); + +-- invalid block number, should be skipped. +select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]); +select heap_force_freeze('htab'::regclass, ARRAY['(0, 1)']::tid[]); + +insert into htab values (10); + +-- invalid offset number, should be skipped. +select heap_force_kill('htab'::regclass, ARRAY['(0, 2)']::tid[]); +select heap_force_freeze('htab'::regclass, ARRAY['(0, 2)']::tid[]); + +-- dead tuple, should be skipped. +select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]); +select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]); +select heap_force_freeze('htab'::regclass, ARRAY['(0, 1)']::tid[]); + +rollback; + +-- cleanup. +drop table ptab; +drop view vw; +drop sequence seq; +drop extension pg_surgery; diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml index 261a559..c82dde2 100644 --- a/doc/src/sgml/contrib.sgml +++ b/doc/src/sgml/contrib.sgml @@ -125,6 +125,7 @@ CREATE EXTENSION <replaceable>module_name</replaceable>; &pgrowlocks; &pgstatstatements; &pgstattuple; + &pgsurgery; &pgtrgm; &pgvisibility; &postgres-fdw; diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml index 64b5da0..828396d 100644 --- a/doc/src/sgml/filelist.sgml +++ b/doc/src/sgml/filelist.sgml @@ -139,6 +139,7 @@ <!ENTITY pgstandby SYSTEM "pgstandby.sgml"> <!ENTITY pgstatstatements SYSTEM "pgstatstatements.sgml"> <!ENTITY pgstattuple SYSTEM "pgstattuple.sgml"> +<!ENTITY pgsurgery SYSTEM "pgsurgery.sgml"> <!ENTITY pgtrgm SYSTEM "pgtrgm.sgml"> <!ENTITY pgvisibility SYSTEM "pgvisibility.sgml"> <!ENTITY postgres-fdw SYSTEM "postgres-fdw.sgml"> diff --git a/doc/src/sgml/pgsurgery.sgml b/doc/src/sgml/pgsurgery.sgml new file mode 100644 index 0000000..8d2ed81 --- /dev/null +++ b/doc/src/sgml/pgsurgery.sgml @@ -0,0 +1,121 @@ +<!-- doc/src/sgml/pgsurgery.sgml --> + +<sect1 id="pgsurgery" xreflabel="pg_surgery"> + <title>pg_surgery</title> + + <indexterm zone="pgsurgery"> + <primary>pg_surgery</primary> + </indexterm> + + <para> + The <filename>pg_surgery</filename> module provides various functions to + perform surgery on a damaged relation. It is intended to be used only with a + damaged relation and must be operated only on damaged tuples of such + relations. + </para> + + <para> + By default, only the superusers have <literal>EXECUTE</literal> privilege on + the functions provided by this module. Once the extension has been installed, + users may issue <command>GRANT</command> commands to change the privileges on + the functions to allow others to execute them. + </para> + + <note> + <para> + After a surgery is performed on a damaged relation using this module, we must + run VACUUM with DISABLE_PAGE_SKIPPING option on that relation to bring it back + into a sane state. + </para> + </note> + + <note> + <para> + While performing surgery on a damaged relation, we must not be doing anything + else on that relation in parallel. This is to ensure that when we are + operating on a damaged tuple there is no other transaction trying to modify + that tuple. + </para> + </note> + + <sect2> + <title>Functions</title> + + <variablelist> + <varlistentry> + <term> + <function>heap_force_kill(regclass, tid[]) returns void</function> + </term> + + <listitem> + <para> + <function>heap_force_kill</function> marks a <quote>used</quote> line + pointer as <quote>dead</quote> without examining the tuple. This may help + users to force-kill a corrupted tuple in the table that may or may not be + accessible. It takes the relation oid and an array of item pointers as an + input arguments. For example: +<programlisting> +test=> select * from t1 where ctid = '(0, 1)'; +ERROR: could not access status of transaction 4007513275 +DETAIL: Could not open file "pg_xact/0EED": No such file or directory. + +test=# select heap_force_kill('t1'::regclass, ARRAY['(0, 1)']::tid[]); +-[ RECORD 1 ]---+- +heap_force_kill | + +test=# select * from t1 where ctid = '(0, 1)'; +(0 rows) +</programlisting> + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term> + <function>heap_force_freeze(regclass, tid[]) returns void</function> + </term> + + <listitem> + <para> + <function>heap_force_freeze</function> marks a live tuple as frozen by + setting its xmin to FrozenTransactionId and xmax to InvalidTransactionId. + It clears all the visibility related bits in the infomask of the tuple and + sets the infomask flag to HEAP_XMIN_FROZEN and HEAP_XMAX_INVALID. This may + help users to force-freeze a tuple in the table that actually should have + been marked as frozen. Similar to <function>heap_force_kill</function> + function, this also takes the relation oid and an array of item pointers + as an input arguments. For example: +<programlisting> +test=> vacuum t1; +ERROR: found xmin 507 from before relfrozenxid 515 +CONTEXT: while scanning block 0 of relation "public.t1" + +test=# select ctid from t1 where xmin = 507; + ctid +------- + (0,3) +(1 row) + +test=# select heap_force_freeze('t1'::regclass, ARRAY['(0, 3)']::tid[]); +-[ RECORD 1 ]-----+- +heap_force_freeze | + +test=# vacuum t1; +VACUUM +</programlisting> + </para> + </listitem> + </varlistentry> + + </variablelist> + </sect2> + + <sect2> + <title>Authors</title> + + <para> + Ashutosh Sharma <email>ashu.coe...@gmail.com</email> + </para> + </sect2> + +</sect1> -- 1.8.3.1