Re: [HACKERS] A design for amcheck heapam verification

Peter Geoghegan Wed, 07 Feb 2018 17:58:15 -0800

On Mon, Feb 5, 2018 at 12:55 PM, Peter Geoghegan <p...@bowt.ie> wrote:
> Anyway, parallel CREATE INDEX added a new "scan" argument to
> IndexBuildHeapScan(), which caused this patch to bitrot. At a minimum,
> an additional NULL argument should be passed by amcheck. However, I
> have a better idea.
>
> ISTM that verify_nbtree.c should manage the heap scan itself, it the
> style of parallel CREATE INDEX. It can acquire its own MVCC snapshot
> for bt_index_check() (which pretends to be a CREATE INDEX
> CONCURRENTLY). There can be an MVCC snapshot acquired per index
> verified, a snapshot that is under the direct control of amcheck. The
> snapshot would be acquired at the start of verification on an index
> (when "heapallindex = true"), before the verification of the index
> structure even begins, and released at the very end of verification.


Attached patch fixes the parallel index build bitrot in this way. This
is version 6 of the patch.

This approach resulted in a nice reduction in complexity:
bt_index_check() and bt_index_parent_check() heapallindexed
verification operations both work in exactly the same way now, except
that bt_index_check() imitates a CREATE INDEX CONCURRENTLY (to match
the heavyweight relation locks acquired). This doesn't really need to
be explained as a special case anymore; bt_index_parent_check() is
like an ordinary CREATE INDEX, without any additional "TransactionXmin
heap tuple xmin recheck" complication.

A further benefit is that this makes running bt_index_check() checks
against many indexes more thorough, and easier to reason about. Users
won't have to worry about TransactionXmin becoming very stale when
many indexes are verified within a single command.

I made the following additional, unrelated changes based on various feedback:

* Faster modulo operations.

Andrey Borodin suggested that I make k_hashes() do fewer modulo
operations. While I don't want to change the algorithm to make this
happen, the overhead has been reduced. Modulo operations are now
performed through bitwise AND operations, which is possible only
because the bitset size is always a power of two. Apparently this is a
fairly common optimization for Bloom filters that use (enhanced)
double-hashing; we might as well do it this way.

I've really just transcribed the enhanced double hashing pseudo-code
from the Georgia Tech/Dillinger & Manolios paper into C code, so no
real change there; bloomfilter.c's k_hashes() is still closely based
on "5.2 Enhanced Double Hashing" from that same paper. Experience
suggests that we ought to be very conservative about developing novel
hashing techniques. Paranoid, even.

* New reference to the modulo bias effect.

Michael Paquier wondered why the Bloom filter was always a
power-of-two, which this addresses. (Of course, the "modulo bitwise
AND" optimization I just mentioned is another reason to limit
ourselves to power-of-two bitset sizes, albeit a new one.)

* Removed sdbmhash().

Michael also wanted to know more about sdbmhash(), due to some general
concern about its quality. I realized that it is best to avoid adding
a new general-purpose hash function, whose sole purpose is to be
different to hash_any(), when I could instead use
hash_uint32_extended() to get two 32-bit values all at once. Robert
suggested this approach at one point, actually, but for some reason I
didn't follow up until now.

-- 
Peter Geoghegan

From 2ff9dcace49ea590762701717235d87e13b03c6b Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <p...@bowt.ie>
Date: Thu, 24 Aug 2017 20:58:21 -0700
Subject: [PATCH 1/2] Add Bloom filter data structure implementation.

A Bloom filter is a space-efficient, probabilistic data structure that
can be used to test set membership.  Callers will sometimes incur false
positives, but never false negatives.  The rate of false positives is a
function of the total number of elements and the amount of memory
available for the Bloom filter.

Two classic applications of Bloom filters are cache filtering, and data
synchronization testing.  Any user of Bloom filters must accept the
possibility of false positives as a cost worth paying for the benefit in
space efficiency.

This commit adds a test harness extension module, test_bloomfilter.  It
can be used to get a sense of how the Bloom filter implementation
performs under varying conditions.
---
 src/backend/lib/Makefile                           |   4 +-
 src/backend/lib/README                             |   2 +
 src/backend/lib/bloomfilter.c                      | 303 +++++++++++++++++++++
 src/include/lib/bloomfilter.h                      |  27 ++
 src/test/modules/Makefile                          |   1 +
 src/test/modules/test_bloomfilter/.gitignore       |   4 +
 src/test/modules/test_bloomfilter/Makefile         |  21 ++
 src/test/modules/test_bloomfilter/README           |  71 +++++
 .../test_bloomfilter/expected/test_bloomfilter.out |  25 ++
 .../test_bloomfilter/sql/test_bloomfilter.sql      |  22 ++
 .../test_bloomfilter/test_bloomfilter--1.0.sql     |  10 +
 .../modules/test_bloomfilter/test_bloomfilter.c    | 138 ++++++++++
 .../test_bloomfilter/test_bloomfilter.control      |   4 +
 src/tools/pgindent/typedefs.list                   |   1 +
 14 files changed, 631 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/lib/bloomfilter.c
 create mode 100644 src/include/lib/bloomfilter.h
 create mode 100644 src/test/modules/test_bloomfilter/.gitignore
 create mode 100644 src/test/modules/test_bloomfilter/Makefile
 create mode 100644 src/test/modules/test_bloomfilter/README
 create mode 100644 src/test/modules/test_bloomfilter/expected/test_bloomfilter.out
 create mode 100644 src/test/modules/test_bloomfilter/sql/test_bloomfilter.sql
 create mode 100644 src/test/modules/test_bloomfilter/test_bloomfilter--1.0.sql
 create mode 100644 src/test/modules/test_bloomfilter/test_bloomfilter.c
 create mode 100644 src/test/modules/test_bloomfilter/test_bloomfilter.control

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index d1fefe4..191ea9b 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/lib
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = binaryheap.o bipartite_match.o dshash.o hyperloglog.o ilist.o \
-	   knapsack.o pairingheap.o rbtree.o stringinfo.o
+OBJS = binaryheap.o bipartite_match.o bloomfilter.o dshash.o hyperloglog.o \
+       ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/README b/src/backend/lib/README
index 5e5ba5e..376ae27 100644
--- a/src/backend/lib/README
+++ b/src/backend/lib/README
@@ -3,6 +3,8 @@ in the backend:
 
 binaryheap.c - a binary heap
 
+bloomfilter.c - probabilistic, space-efficient set membership testing
+
 hyperloglog.c - a streaming cardinality estimator
 
 pairingheap.c - a pairing heap
diff --git a/src/backend/lib/bloomfilter.c b/src/backend/lib/bloomfilter.c
new file mode 100644
index 0000000..a4ca18d
--- /dev/null
+++ b/src/backend/lib/bloomfilter.c
@@ -0,0 +1,303 @@
+/*-------------------------------------------------------------------------
+ *
+ * bloomfilter.c
+ *		Minimal Bloom filter
+ *
+ * A Bloom filter is a probabilistic data structure that is used to test an
+ * element's membership of a set.  False positives are possible, but false
+ * negatives are not; a test of membership of the set returns either "possibly
+ * in set" or "definitely not in set".  This can be very space efficient when
+ * individual elements are larger than a few bytes, because elements are hashed
+ * in order to set bits in the Bloom filter bitset.
+ *
+ * Elements can be added to the set, but not removed.  The more elements that
+ * are added, the larger the probability of false positives.  Caller must hint
+ * an estimated total size of the set when its Bloom filter is initialized.
+ * This is used to balance the use of memory against the final false positive
+ * rate.
+ *
+ * Copyright (c) 2018, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/bloomfilter.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/hash.h"
+#include "lib/bloomfilter.h"
+
+#define MAX_HASH_FUNCS		10
+
+struct bloom_filter
+{
+	/* K hash functions are used, seeded by caller's seed */
+	int			k_hash_funcs;
+	uint64		seed;
+	/* m is bitset size, in bits.  Must be a power-of-two <= 2^32.  */
+	uint64		m;
+	unsigned char bitset[FLEXIBLE_ARRAY_MEMBER];
+};
+
+static int	my_bloom_power(uint64 target_bitset_bits);
+static int	optimal_k(uint64 bitset_bits, int64 total_elems);
+static void k_hashes(bloom_filter *filter, uint32 *hashes, unsigned char *elem,
+		 size_t len);
+static inline uint32 mod_m(uint32 a, uint64 m);
+
+/*
+ * Create Bloom filter in caller's memory context.  This should get a false
+ * positive rate of between 1% and 2% when bitset is not constrained by memory.
+ *
+ * total_elems is an estimate of the final size of the set.  It ought to be
+ * approximately correct, but we can cope well with it being off by perhaps a
+ * factor of five or more.  See "Bloom Filters in Probabilistic Verification"
+ * (Dillinger & Manolios, 2004) for details of why this is the case.
+ *
+ * bloom_work_mem is sized in KB, in line with the general work_mem convention.
+ *
+ * The Bloom filter behaves non-deterministically when caller passes a random
+ * seed value.  This ensures that the same false positives will not occur from
+ * one run to the next, which is useful to some callers.
+ *
+ * Notes on appropriate use:
+ *
+ * To keep the implementation simple and predictable, the underlying bitset is
+ * always sized as a power-of-two number of bits, and the largest possible
+ * bitset is 512MB.  The implementation rounds down as needed.
+ *
+ * The implementation is well suited to data synchronization problems between
+ * unordered sets, where predictable performance is more important than worst
+ * case guarantees around false positives.  Another problem that the
+ * implementation is well suited for is cache filtering where good performance
+ * already relies upon having a relatively small and/or low cardinality set of
+ * things that are interesting (with perhaps many more uninteresting things
+ * that never populate the filter).
+ */
+bloom_filter *
+bloom_create(int64 total_elems, int bloom_work_mem, uint32 seed)
+{
+	bloom_filter *filter;
+	int			bloom_power;
+	uint64		bitset_bytes;
+	uint64		bitset_bits;
+
+	/*
+	 * Aim for two bytes per element; this is sufficient to get a false
+	 * positive rate below 1%, independent of the size of the bitset or total
+	 * number of elements.  Also, if rounding down the size of the bitset to
+	 * the next lowest power of two turns out to be a significant drop, the
+	 * false positive rate still won't exceed 2% in almost all cases.
+	 */
+	bitset_bytes = Min(bloom_work_mem * 1024L, total_elems * 2);
+	/* Minimum allowable size is 1MB */
+	bitset_bytes = Max(1024L * 1024L, bitset_bytes);
+
+	/* Size in bits should be the highest power of two within budget */
+	bloom_power = my_bloom_power(bitset_bytes * BITS_PER_BYTE);
+	/* Use uint64 to size bitset, since PG_UINT32_MAX is 2^32 - 1, not 2^32 */
+	bitset_bits = UINT64CONST(1) << bloom_power;
+	bitset_bytes = bitset_bits / BITS_PER_BYTE;
+
+	/* Allocate bloom filter as all-zeroes */
+	filter = palloc0(offsetof(bloom_filter, bitset) +
+					 sizeof(unsigned char) * bitset_bytes);
+	filter->k_hash_funcs = optimal_k(bitset_bits, total_elems);
+
+	/*
+	 * Caller will probably use signed 32-bit pseudo-random number, so hash
+	 * caller's value to get 64-bit seed value
+	 */
+	filter->seed = DatumGetUInt64(hash_uint32_extended(seed, 0));
+	filter->m = bitset_bits;
+
+	return filter;
+}
+
+/*
+ * Free Bloom filter
+ */
+void
+bloom_free(bloom_filter *filter)
+{
+	pfree(filter);
+}
+
+/*
+ * Add element to Bloom filter
+ */
+void
+bloom_add_element(bloom_filter *filter, unsigned char *elem, size_t len)
+{
+	uint32		hashes[MAX_HASH_FUNCS];
+	int			i;
+
+	k_hashes(filter, hashes, elem, len);
+
+	/* Map a bit-wise address to a byte-wise address + bit offset */
+	for (i = 0; i < filter->k_hash_funcs; i++)
+	{
+		filter->bitset[hashes[i] >> 3] |= 1 << (hashes[i] & 7);
+	}
+}
+
+/*
+ * Test if Bloom filter definitely lacks element.
+ *
+ * Returns true if the element is definitely not in the set of elements
+ * observed by bloom_add_element().  Otherwise, returns false, indicating that
+ * element is probably present in set.
+ */
+bool
+bloom_lacks_element(bloom_filter *filter, unsigned char *elem, size_t len)
+{
+	uint32		hashes[MAX_HASH_FUNCS];
+	int			i;
+
+	k_hashes(filter, hashes, elem, len);
+
+	/* Map a bit-wise address to a byte-wise address + bit offset */
+	for (i = 0; i < filter->k_hash_funcs; i++)
+	{
+		if (!(filter->bitset[hashes[i] >> 3] & (1 << (hashes[i] & 7))))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * What proportion of bits are currently set?
+ *
+ * Returns proportion, expressed as a multiplier of filter size.  That should
+ * generally be close to 0.5, even when we have more than enough memory to
+ * ensure a false positive rate within target 1% to 2% band, since more hash
+ * functions are used as more memory is available per element.
+ *
+ * This is the only instrumentation that is low overhead enough to appear in
+ * debug traces.  When debugging Bloom filter code, it's likely to be far more
+ * interesting to directly test the false positive rate.
+ */
+double
+bloom_prop_bits_set(bloom_filter *filter)
+{
+	int			bitset_bytes = filter->m / BITS_PER_BYTE;
+	uint64		bits_set = 0;
+	int			i;
+
+	for (i = 0; i < bitset_bytes; i++)
+	{
+		unsigned char byte = filter->bitset[i];
+
+		while (byte)
+		{
+			bits_set++;
+			byte &= (byte - 1);
+		}
+	}
+
+	return bits_set / (double) filter->m;
+}
+
+/*
+ * Which element in the sequence of powers-of-two is less than or equal to
+ * target_bitset_bits?
+ *
+ * Value returned here must be generally safe as the basis for actual bitset
+ * size.
+ *
+ * Bitset is never allowed to exceed 2 ^ 32 bits (512MB).  This is sufficient
+ * for the needs of all current callers, and allows us to use 32-bit hash
+ * functions.  It also makes it easy to stay under the MaxAllocSize restriction
+ * (caller needs to leave room for non-bitset fields that appear before
+ * flexible array member, so a 1GB bitset would use an allocation that just
+ * exceeds MaxAllocSize).
+ */
+static int
+my_bloom_power(uint64 target_bitset_bits)
+{
+	int			bloom_power = -1;
+
+	while (target_bitset_bits > 0 && bloom_power < 32)
+	{
+		bloom_power++;
+		target_bitset_bits >>= 1;
+	}
+
+	return bloom_power;
+}
+
+/*
+ * Determine optimal number of hash functions based on size of filter in bits,
+ * and projected total number of elements.  The optimal number is the number
+ * that minimizes the false positive rate.
+ */
+static int
+optimal_k(uint64 bitset_bits, int64 total_elems)
+{
+	int			k = round(log(2.0) * bitset_bits / total_elems);
+
+	return Max(1, Min(k, MAX_HASH_FUNCS));
+}
+
+/*
+ * Generate k hash values for element.
+ *
+ * Caller passes array, which is filled-in with k values determined by hashing
+ * caller's element.
+ *
+ * Only 2 real independent hash functions are actually used to support an
+ * interface of up to MAX_HASH_FUNCS hash functions; enhanced double hashing is
+ * used to make this work.  The main reason we prefer enhanced double hashing
+ * to classic double hashing is that the latter has an issue with collisions
+ * when using power-of-two sized bitsets.  See Dillinger & Manolios for full
+ * details.
+ */
+static void
+k_hashes(bloom_filter *filter, uint32 *hashes, unsigned char *elem, size_t len)
+{
+	uint64		hash;
+	uint32		x, y;
+	uint64		m;
+	int			i;
+
+	/* Use 64-bit hashing to get two independent 32-bit hashes */
+	hash = DatumGetUInt64(hash_any_extended(elem, len, filter->seed));
+	x = (uint32) hash;
+	y = (uint32) (hash >> 32);
+	m = filter->m;
+
+	x = mod_m(x, m);
+	y = mod_m(y, m);
+
+	/* Accumulate hashes */
+	hashes[0] = x;
+	for (i = 1; i < filter->k_hash_funcs; i++)
+	{
+		x = mod_m(x + y, m);
+		y = mod_m(y + i, m);
+
+		hashes[i] = x;
+	}
+}
+
+/*
+ * Calculate "val MOD m" inexpensively.
+ *
+ * Assumes that m (which is bitset size) is a power-of-two.
+ *
+ * Using a power-of-two number of bits for bitset size allows us to use bitwise
+ * AND operations to calculate the modulo of a hash value.  It's also a simple
+ * way of avoiding the modulo bias effect.
+ */
+static inline uint32
+mod_m(uint32 val, uint64 m)
+{
+	Assert(m <= PG_UINT32_MAX + UINT64CONST(1));
+	Assert(((m - 1) & m) == 0);
+
+	return val & (m - 1);
+}
diff --git a/src/include/lib/bloomfilter.h b/src/include/lib/bloomfilter.h
new file mode 100644
index 0000000..5bc99c3
--- /dev/null
+++ b/src/include/lib/bloomfilter.h
@@ -0,0 +1,27 @@
+/*-------------------------------------------------------------------------
+ *
+ * bloomfilter.h
+ *	  Minimal Bloom filter
+ *
+ * Copyright (c) 2018, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *    src/include/lib/bloomfilter.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _BLOOMFILTER_H_
+#define _BLOOMFILTER_H_
+
+typedef struct bloom_filter bloom_filter;
+
+extern bloom_filter *bloom_create(int64 total_elems, int bloom_work_mem,
+			 uint32 seed);
+extern void bloom_free(bloom_filter *filter);
+extern void bloom_add_element(bloom_filter *filter, unsigned char *elem,
+				  size_t len);
+extern bool bloom_lacks_element(bloom_filter *filter, unsigned char *elem,
+					size_t len);
+extern double bloom_prop_bits_set(bloom_filter *filter);
+
+#endif
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index b7ed0af..fb3aae1 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -9,6 +9,7 @@ SUBDIRS = \
 		  commit_ts \
 		  dummy_seclabel \
 		  snapshot_too_old \
+		  test_bloomfilter \
 		  test_ddl_deparse \
 		  test_extensions \
 		  test_parser \
diff --git a/src/test/modules/test_bloomfilter/.gitignore b/src/test/modules/test_bloomfilter/.gitignore
new file mode 100644
index 0000000..5dcb3ff
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_bloomfilter/Makefile b/src/test/modules/test_bloomfilter/Makefile
new file mode 100644
index 0000000..808c931
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/Makefile
@@ -0,0 +1,21 @@
+# src/test/modules/test_bloomfilter/Makefile
+
+MODULE_big = test_bloomfilter
+OBJS = test_bloomfilter.o $(WIN32RES)
+PGFILEDESC = "test_bloomfilter - test code for Bloom filter library"
+
+EXTENSION = test_bloomfilter
+DATA = test_bloomfilter--1.0.sql
+
+REGRESS = test_bloomfilter
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_bloomfilter
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_bloomfilter/README b/src/test/modules/test_bloomfilter/README
new file mode 100644
index 0000000..e54ed13
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/README
@@ -0,0 +1,71 @@
+test_bloomfilter overview
+=========================
+
+test_bloomfilter is a test harness module for testing Bloom filter library set
+membership operations.  It consists of a single SQL-callable function,
+test_bloomfilter(), and regression tests.  Membership tests are performed using
+an artificial dataset that is programmatically generated.
+
+The test_bloomfilter() function displays instrumentation at DEBUG1 elog level
+(WARNING when the false positive rate exceeds a 1% threshold).  This can be
+used to get a sense of the performance characteristics of the Postgres Bloom
+filter implementation under varied conditions.
+
+Bitset size
+-----------
+
+The main bloomfilter.c criteria for sizing its bitset is that the false
+positive rate should not exceed 2% when sufficient bloom_work_mem is available
+(and the caller-supplied estimate of the number of elements turns out to have
+been accurate).  A 2% rate is currently assumed to be good enough for all Bloom
+filter callers.
+
+The traditional guarantee Bloom filters offer is that with an optimal K, there
+will be only a 1% false positive rate with just 9.6 bits of memory per element.
+The 2% worst case guarantee exists because there is a need for some slop, to
+account for implementation inflexibility in bitset sizing.  The bitset is kept
+to a power-of-two number of bits in size, so callers may have their
+bloom_work_mem argument truncated down by almost half -- when that happens, the
+guarantee needs to hold up.  In practice callers that always pass a
+bloom_work_mem that is aligned with a power-of-two bitset size will actually
+get the "9.6 bits per element" 1% false positive rate.  (Under-promising in
+this manner is a fudge that allows the contract to be kept simple.)
+
+Strategy
+--------
+
+Our approach to regression testing is to test that bloomfilter.c has only a 1%
+false positive rate for a single bitset size (2 ^ 23, or 1MB).  We test a
+dataset with 838,861 elements, which works out at 10 bits of memory per
+element.  We round up from 9.6 bits to 10 bits to make sure that we reliably
+get under 1% for regression testing.  Note that a random seed is used in the
+regression tests, because the exact false positive rate is inconsistent across
+platforms, which makes non-deterministic hashing something that the regression
+tests need to be tolerant of anyway.
+
+SQL-callable function
+=====================
+
+The SQL-callable function test_bloomfilter() provides the following arguments:
+
+* "power" is the power-of-two used to size the Bloom filter's bitset.
+
+The minimum valid argument value is 23 (2^23 bits), or 1MB of memory.  The
+maximum valid argument value is 32, or 512MB of memory.  These restrictions
+reflect restrictions in bloomfilter.c itself.
+
+* "nelements" is the number of elements to generate for testing purposes.
+
+Adjust argument value to observe changes in the false positive rate for a given
+Bloom filter bitset size.
+
+* "seed" is a seed value for hashing.
+
+A value < 0 is interpreted as "use random seed".  Varying the seed value (or
+specifying -1) should result in small variations in the total number of false
+positives.
+
+* "tests" is the number of tests to run.
+
+This may be increased when it's useful to perform many tests without the
+overhead of setting up and tearing down a pg_regress database each time.
diff --git a/src/test/modules/test_bloomfilter/expected/test_bloomfilter.out b/src/test/modules/test_bloomfilter/expected/test_bloomfilter.out
new file mode 100644
index 0000000..4d60eca
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/expected/test_bloomfilter.out
@@ -0,0 +1,25 @@
+CREATE EXTENSION test_bloomfilter;
+--
+-- These tests don't produce any interesting output, unless they fail.  For an
+-- explanation of the arguments, and the values used here, see README.
+--
+SELECT test_bloomfilter(power => 23,
+    nelements => 838861,
+    seed => -1,
+    tests => 1);
+ test_bloomfilter 
+------------------
+ 
+(1 row)
+
+-- Equivalent "10 bits per element" tests for all possible bitset sizes:
+--
+-- SELECT test_bloomfilter(24, 1677722)
+-- SELECT test_bloomfilter(25, 3355443)
+-- SELECT test_bloomfilter(26, 6710886)
+-- SELECT test_bloomfilter(27, 13421773)
+-- SELECT test_bloomfilter(28, 26843546)
+-- SELECT test_bloomfilter(29, 53687091)
+-- SELECT test_bloomfilter(30, 107374182)
+-- SELECT test_bloomfilter(31, 214748365)
+-- SELECT test_bloomfilter(32, 429496730)
diff --git a/src/test/modules/test_bloomfilter/sql/test_bloomfilter.sql b/src/test/modules/test_bloomfilter/sql/test_bloomfilter.sql
new file mode 100644
index 0000000..cc9d19e
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/sql/test_bloomfilter.sql
@@ -0,0 +1,22 @@
+CREATE EXTENSION test_bloomfilter;
+
+--
+-- These tests don't produce any interesting output, unless they fail.  For an
+-- explanation of the arguments, and the values used here, see README.
+--
+SELECT test_bloomfilter(power => 23,
+    nelements => 838861,
+    seed => -1,
+    tests => 1);
+
+-- Equivalent "10 bits per element" tests for all possible bitset sizes:
+--
+-- SELECT test_bloomfilter(24, 1677722)
+-- SELECT test_bloomfilter(25, 3355443)
+-- SELECT test_bloomfilter(26, 6710886)
+-- SELECT test_bloomfilter(27, 13421773)
+-- SELECT test_bloomfilter(28, 26843546)
+-- SELECT test_bloomfilter(29, 53687091)
+-- SELECT test_bloomfilter(30, 107374182)
+-- SELECT test_bloomfilter(31, 214748365)
+-- SELECT test_bloomfilter(32, 429496730)
diff --git a/src/test/modules/test_bloomfilter/test_bloomfilter--1.0.sql b/src/test/modules/test_bloomfilter/test_bloomfilter--1.0.sql
new file mode 100644
index 0000000..bf1f1cd
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/test_bloomfilter--1.0.sql
@@ -0,0 +1,10 @@
+/* src/test/modules/test_bloomfilter/test_bloomfilter--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_bloomfilter" to load this file. \quit
+
+-- See README for an explanation of each argument
+CREATE FUNCTION test_bloomfilter(power integer, nelements bigint,
+    seed integer DEFAULT -1, tests integer DEFAULT 1)
+	RETURNS pg_catalog.void STRICT
+	AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_bloomfilter/test_bloomfilter.c b/src/test/modules/test_bloomfilter/test_bloomfilter.c
new file mode 100644
index 0000000..74afd36
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/test_bloomfilter.c
@@ -0,0 +1,138 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_bloomfilter.c
+ *		Test false positive rate of Bloom filter against test dataset.
+ *
+ * Copyright (c) 2018, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_bloomfilter/test_bloomfilter.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "lib/bloomfilter.h"
+#include "miscadmin.h"
+
+PG_MODULE_MAGIC;
+
+/* Must fit decimal representation of PG_INT64_MAX + 2 bytes: */
+#define MAX_ELEMENT_BYTES		20
+/* False positive rate WARNING threshold (1%): */
+#define FPOSITIVE_THRESHOLD		0.01
+
+
+/*
+ * Populate an empty Bloom filter with "nelements" dummy strings.
+ */
+static void
+populate_with_dummy_strings(bloom_filter *filter, int64 nelements)
+{
+	char		element[MAX_ELEMENT_BYTES];
+	int64		i;
+
+	for (i = 0; i < nelements; i++)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		snprintf(element, sizeof(element), "i" INT64_FORMAT, i);
+		bloom_add_element(filter, (unsigned char *) element, strlen(element));
+	}
+}
+
+/*
+ * Returns number of strings that are indicated as probably appearing in Bloom
+ * filter that were in fact never added by populate_with_dummy_strings().
+ * These are false positives.
+ */
+static int64
+nfalsepos_for_missing_strings(bloom_filter *filter, int64 nelements)
+{
+	char		element[MAX_ELEMENT_BYTES];
+	int64		nfalsepos = 0;
+	int64		i;
+
+	for (i = 0; i < nelements; i++)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		snprintf(element, sizeof(element), "M" INT64_FORMAT, i);
+		if (!bloom_lacks_element(filter, (unsigned char *) element,
+								 strlen(element)))
+			nfalsepos++;
+	}
+
+	return nfalsepos;
+}
+
+static void
+create_and_test_bloom(int power, int64 nelements, int callerseed)
+{
+	int			bloom_work_mem;
+	uint32		seed;
+	int64		nfalsepos;
+	bloom_filter *filter;
+
+	bloom_work_mem = (1L << power) / 8L / 1024L;
+
+	elog(DEBUG1, "bloom_work_mem (KB): %d", bloom_work_mem);
+
+	/*
+	 * Generate random seed, or use caller's.  Seed should always be a
+	 * positive value less than or equal to PG_INT32_MAX, to ensure that any
+	 * random seed can be recreated through callerseed if the need arises.
+	 * (Don't assume that RAND_MAX cannot exceed PG_INT32_MAX.)
+	 */
+	seed = callerseed < 0 ? random() % PG_INT32_MAX : callerseed;
+
+	/* Create Bloom filter, populate it, and report on false positive rate */
+	filter = bloom_create(nelements, bloom_work_mem, seed);
+	populate_with_dummy_strings(filter, nelements);
+	nfalsepos = nfalsepos_for_missing_strings(filter, nelements);
+
+	ereport((nfalsepos > nelements * FPOSITIVE_THRESHOLD) ? WARNING : DEBUG1,
+			(errmsg_internal("false positives: " INT64_FORMAT " (rate: %.6f, proportion bits set: %.6f, seed: %u)",
+							 nfalsepos, (double) nfalsepos / nelements,
+							 bloom_prop_bits_set(filter), seed)));
+
+	bloom_free(filter);
+}
+
+PG_FUNCTION_INFO_V1(test_bloomfilter);
+
+/*
+ * SQL-callable entry point to perform all tests.
+ *
+ * If a 1% false positive threshold is not met, emits WARNINGs.
+ *
+ * See README for details of arguments.
+ */
+Datum
+test_bloomfilter(PG_FUNCTION_ARGS)
+{
+	int			power = PG_GETARG_INT32(0);
+	int64		nelements = PG_GETARG_INT64(1);
+	int			seed = PG_GETARG_INT32(2);
+	int			tests = PG_GETARG_INT32(3);
+	int			i;
+
+	if (power < 23 || power > 32)
+		elog(ERROR, "power argument must be between 23 and 32 inclusive");
+
+	if (tests <= 0)
+		elog(ERROR, "invalid number of tests: %d", tests);
+
+	if (nelements < 0)
+		elog(ERROR, "invalid number of elements: %d", tests);
+
+	for (i = 0; i < tests; i++)
+	{
+		elog(DEBUG1, "beginning test #%d...", i + 1);
+
+		create_and_test_bloom(power, nelements, seed);
+	}
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_bloomfilter/test_bloomfilter.control b/src/test/modules/test_bloomfilter/test_bloomfilter.control
new file mode 100644
index 0000000..99e56ee
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/test_bloomfilter.control
@@ -0,0 +1,4 @@
+comment = 'Test code for Bloom filter library'
+default_version = '1.0'
+module_pathname = '$libdir/test_bloomfilter'
+relocatable = true
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d4765ce..1b1a996 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2580,6 +2580,7 @@ bitmapword
 bits16
 bits32
 bits8
+bloom_filter
 bool
 brin_column_state
 bytea
-- 
2.7.4

From e492d4a7553c8e736ca03b2013fa6a8ec9302bd5 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <p...@bowt.ie>
Date: Tue, 2 May 2017 00:19:24 -0700
Subject: [PATCH 2/2] Add amcheck verification of indexes against heap.

Add a new, optional capability to bt_index_check() and
bt_index_parent_check():  callers can check that each heap tuple that
ought to have an index entry does in fact have one.  This happens at the
end of the existing verification checks.

This is implemented by using a Bloom filter data structure.  The
implementation performs set membership tests within a callback (the same
type of callback that each index AM registers for CREATE INDEX).  The
Bloom filter is populated during the initial index verification scan.
---
 contrib/amcheck/Makefile                 |   2 +-
 contrib/amcheck/amcheck--1.0--1.1.sql    |  28 +++
 contrib/amcheck/amcheck.control          |   2 +-
 contrib/amcheck/expected/check_btree.out |  14 +-
 contrib/amcheck/sql/check_btree.sql      |   9 +-
 contrib/amcheck/verify_nbtree.c          | 286 ++++++++++++++++++++++++++++---
 doc/src/sgml/amcheck.sgml                | 122 ++++++++++---
 7 files changed, 401 insertions(+), 62 deletions(-)
 create mode 100644 contrib/amcheck/amcheck--1.0--1.1.sql

diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
index 43bed91..c5764b5 100644
--- a/contrib/amcheck/Makefile
+++ b/contrib/amcheck/Makefile
@@ -4,7 +4,7 @@ MODULE_big	= amcheck
 OBJS		= verify_nbtree.o $(WIN32RES)
 
 EXTENSION = amcheck
-DATA = amcheck--1.0.sql
+DATA = amcheck--1.0--1.1.sql amcheck--1.0.sql
 PGFILEDESC = "amcheck - function for verifying relation integrity"
 
 REGRESS = check check_btree
diff --git a/contrib/amcheck/amcheck--1.0--1.1.sql b/contrib/amcheck/amcheck--1.0--1.1.sql
new file mode 100644
index 0000000..e6cca0a
--- /dev/null
+++ b/contrib/amcheck/amcheck--1.0--1.1.sql
@@ -0,0 +1,28 @@
+/* contrib/amcheck/amcheck--1.0--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION amcheck UPDATE TO '1.1'" to load this file. \quit
+
+--
+-- bt_index_check()
+--
+DROP FUNCTION bt_index_check(regclass);
+CREATE FUNCTION bt_index_check(index regclass,
+    heapallindexed boolean DEFAULT false)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+--
+-- bt_index_parent_check()
+--
+DROP FUNCTION bt_index_parent_check(regclass);
+CREATE FUNCTION bt_index_parent_check(index regclass,
+    heapallindexed boolean DEFAULT false)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_parent_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+-- Don't want these to be available to public
+REVOKE ALL ON FUNCTION bt_index_check(regclass, boolean) FROM PUBLIC;
+REVOKE ALL ON FUNCTION bt_index_parent_check(regclass, boolean) FROM PUBLIC;
diff --git a/contrib/amcheck/amcheck.control b/contrib/amcheck/amcheck.control
index 05e2861..4690484 100644
--- a/contrib/amcheck/amcheck.control
+++ b/contrib/amcheck/amcheck.control
@@ -1,5 +1,5 @@
 # amcheck extension
 comment = 'functions for verifying relation integrity'
-default_version = '1.0'
+default_version = '1.1'
 module_pathname = '$libdir/amcheck'
 relocatable = true
diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out
index df3741e..42872b8 100644
--- a/contrib/amcheck/expected/check_btree.out
+++ b/contrib/amcheck/expected/check_btree.out
@@ -16,8 +16,8 @@ RESET ROLE;
 -- we, intentionally, don't check relation permissions - it's useful
 -- to run this cluster-wide with a restricted account, and as tested
 -- above explicit permission has to be granted for that.
-GRANT EXECUTE ON FUNCTION bt_index_check(regclass) TO bttest_role;
-GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_check(regclass, boolean) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass, boolean) TO bttest_role;
 SET ROLE bttest_role;
 SELECT bt_index_check('bttest_a_idx');
  bt_index_check 
@@ -56,8 +56,14 @@ SELECT bt_index_check('bttest_a_idx');
  
 (1 row)
 
--- more expansive test
-SELECT bt_index_parent_check('bttest_b_idx');
+-- more expansive tests
+SELECT bt_index_check('bttest_a_idx', true);
+ bt_index_check 
+----------------
+ 
+(1 row)
+
+SELECT bt_index_parent_check('bttest_b_idx', true);
  bt_index_parent_check 
 -----------------------
  
diff --git a/contrib/amcheck/sql/check_btree.sql b/contrib/amcheck/sql/check_btree.sql
index fd90531..5d27969 100644
--- a/contrib/amcheck/sql/check_btree.sql
+++ b/contrib/amcheck/sql/check_btree.sql
@@ -19,8 +19,8 @@ RESET ROLE;
 -- we, intentionally, don't check relation permissions - it's useful
 -- to run this cluster-wide with a restricted account, and as tested
 -- above explicit permission has to be granted for that.
-GRANT EXECUTE ON FUNCTION bt_index_check(regclass) TO bttest_role;
-GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_check(regclass, boolean) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass, boolean) TO bttest_role;
 SET ROLE bttest_role;
 SELECT bt_index_check('bttest_a_idx');
 SELECT bt_index_parent_check('bttest_a_idx');
@@ -42,8 +42,9 @@ ROLLBACK;
 
 -- normal check outside of xact
 SELECT bt_index_check('bttest_a_idx');
--- more expansive test
-SELECT bt_index_parent_check('bttest_b_idx');
+-- more expansive tests
+SELECT bt_index_check('bttest_a_idx', true);
+SELECT bt_index_parent_check('bttest_b_idx', true);
 
 BEGIN;
 SELECT bt_index_check('bttest_a_idx');
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index da518da..7e20d52 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -8,6 +8,11 @@
  * (the insertion scankey sort-wise NULL semantics are needed for
  * verification).
  *
+ * When index-to-heap verification is requested, a Bloom filter is used to
+ * fingerprint all tuples in the target index, as the index is traversed to
+ * verify its structure.  A heap scan later verifies the presence in the heap
+ * of all index tuples fingerprinted within the Bloom filter.
+ *
  *
  * Copyright (c) 2017-2018, PostgreSQL Global Development Group
  *
@@ -23,6 +28,7 @@
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
 #include "commands/tablecmds.h"
+#include "lib/bloomfilter.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "utils/memutils.h"
@@ -43,9 +49,10 @@ PG_MODULE_MAGIC;
  * target is the point of reference for a verification operation.
  *
  * Other B-Tree pages may be allocated, but those are always auxiliary (e.g.,
- * they are current target's child pages). Conceptually, problems are only
- * ever found in the current target page. Each page found by verification's
- * left/right, top/bottom scan becomes the target exactly once.
+ * they are current target's child pages).  Conceptually, problems are only
+ * ever found in the current target page (or for a particular heap tuple during
+ * heapallindexed verification).  Each page found by verification's left/right,
+ * top/bottom scan becomes the target exactly once.
  */
 typedef struct BtreeCheckState
 {
@@ -53,10 +60,13 @@ typedef struct BtreeCheckState
 	 * Unchanging state, established at start of verification:
 	 */
 
-	/* B-Tree Index Relation */
+	/* B-Tree Index Relation and associated heap relation */
 	Relation	rel;
+	Relation	heaprel;
 	/* ShareLock held on heap/index, rather than AccessShareLock? */
 	bool		readonly;
+	/* Also verifying heap has no unindexed tuples? */
+	bool		heapallindexed;
 	/* Per-page context */
 	MemoryContext targetcontext;
 	/* Buffer access strategy */
@@ -72,6 +82,15 @@ typedef struct BtreeCheckState
 	BlockNumber targetblock;
 	/* Target page's LSN */
 	XLogRecPtr	targetlsn;
+
+	/*
+	 * Mutable state, for optional heapallindexed verification:
+	 */
+
+	/* Bloom filter fingerprints B-Tree index */
+	bloom_filter *filter;
+	/* Debug counter */
+	int64		heaptuplespresent;
 } BtreeCheckState;
 
 /*
@@ -92,15 +111,20 @@ typedef struct BtreeLevel
 PG_FUNCTION_INFO_V1(bt_index_check);
 PG_FUNCTION_INFO_V1(bt_index_parent_check);
 
-static void bt_index_check_internal(Oid indrelid, bool parentcheck);
+static void bt_index_check_internal(Oid indrelid, bool parentcheck,
+						bool heapallindexed);
 static inline void btree_index_checkable(Relation rel);
-static void bt_check_every_level(Relation rel, bool readonly);
+static void bt_check_every_level(Relation rel, Relation heaprel,
+					 bool readonly, bool heapallindexed);
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
 static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
 static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 				  ScanKey targetkey);
+static void bt_tuple_present_callback(Relation index, HeapTuple htup,
+						  Datum *values, bool *isnull,
+						  bool tupleIsAlive, void *checkstate);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
 static inline bool invariant_leq_offset(BtreeCheckState *state,
@@ -116,37 +140,47 @@ static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 
 /*
- * bt_index_check(index regclass)
+ * bt_index_check(index regclass, heapallindexed boolean)
  *
  * Verify integrity of B-Tree index.
  *
  * Acquires AccessShareLock on heap & index relations.  Does not consider
- * invariants that exist between parent/child pages.
+ * invariants that exist between parent/child pages.  Optionally verifies
+ * that heap does not contain any unindexed or incorrectly indexed tuples.
  */
 Datum
 bt_index_check(PG_FUNCTION_ARGS)
 {
 	Oid			indrelid = PG_GETARG_OID(0);
+	bool		heapallindexed = false;
 
-	bt_index_check_internal(indrelid, false);
+	if (PG_NARGS() == 2)
+		heapallindexed = PG_GETARG_BOOL(1);
+
+	bt_index_check_internal(indrelid, false, heapallindexed);
 
 	PG_RETURN_VOID();
 }
 
 /*
- * bt_index_parent_check(index regclass)
+ * bt_index_parent_check(index regclass, heapallindexed boolean)
  *
  * Verify integrity of B-Tree index.
  *
  * Acquires ShareLock on heap & index relations.  Verifies that downlinks in
- * parent pages are valid lower bounds on child pages.
+ * parent pages are valid lower bounds on child pages.  Optionally verifies
+ * that heap does not contain any unindexed or incorrectly indexed tuples.
  */
 Datum
 bt_index_parent_check(PG_FUNCTION_ARGS)
 {
 	Oid			indrelid = PG_GETARG_OID(0);
+	bool		heapallindexed = false;
 
-	bt_index_check_internal(indrelid, true);
+	if (PG_NARGS() == 2)
+		heapallindexed = PG_GETARG_BOOL(1);
+
+	bt_index_check_internal(indrelid, true, heapallindexed);
 
 	PG_RETURN_VOID();
 }
@@ -155,7 +189,7 @@ bt_index_parent_check(PG_FUNCTION_ARGS)
  * Helper for bt_index_[parent_]check, coordinating the bulk of the work.
  */
 static void
-bt_index_check_internal(Oid indrelid, bool parentcheck)
+bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 {
 	Oid			heapid;
 	Relation	indrel;
@@ -185,15 +219,20 @@ bt_index_check_internal(Oid indrelid, bool parentcheck)
 	 * Open the target index relations separately (like relation_openrv(), but
 	 * with heap relation locked first to prevent deadlocking).  In hot
 	 * standby mode this will raise an error when parentcheck is true.
+	 *
+	 * There is no need for the usual indcheckxmin usability horizon test here,
+	 * even in the heapallindexed case, because index undergoing verification
+	 * only needs to have entries for the snapshot that may be registered
+	 * later.  (If this is a parentcheck verification, there is no question
+	 * about committed or recently dead heap tuples lacking index entries due
+	 * to concurrent activity.)
 	 */
 	indrel = index_open(indrelid, lockmode);
 
 	/*
 	 * Since we did the IndexGetRelation call above without any lock, it's
 	 * barely possible that a race against an index drop/recreation could have
-	 * netted us the wrong table.  Although the table itself won't actually be
-	 * examined during verification currently, a recheck still seems like a
-	 * good idea.
+	 * netted us the wrong table.
 	 */
 	if (heaprel == NULL || heapid != IndexGetRelation(indrelid, false))
 		ereport(ERROR,
@@ -204,8 +243,8 @@ bt_index_check_internal(Oid indrelid, bool parentcheck)
 	/* Relation suitable for checking as B-Tree? */
 	btree_index_checkable(indrel);
 
-	/* Check index */
-	bt_check_every_level(indrel, parentcheck);
+	/* Check index, possibly against table it is an index on */
+	bt_check_every_level(indrel, heaprel, parentcheck, heapallindexed);
 
 	/*
 	 * Release locks early. That's ok here because nothing in the called
@@ -253,11 +292,14 @@ btree_index_checkable(Relation rel)
 
 /*
  * Main entry point for B-Tree SQL-callable functions. Walks the B-Tree in
- * logical order, verifying invariants as it goes.
+ * logical order, verifying invariants as it goes.  Optionally, verification
+ * checks if the heap relation contains any tuples that are not represented in
+ * the index but should be.
  *
  * It is the caller's responsibility to acquire appropriate heavyweight lock on
  * the index relation, and advise us if extra checks are safe when a ShareLock
- * is held.
+ * is held.  (A lock of the same type must also have been acquired on the heap
+ * relation.)
  *
  * A ShareLock is generally assumed to prevent any kind of physical
  * modification to the index structure, including modifications that VACUUM may
@@ -272,13 +314,15 @@ btree_index_checkable(Relation rel)
  * parent/child check cannot be affected.)
  */
 static void
-bt_check_every_level(Relation rel, bool readonly)
+bt_check_every_level(Relation rel, Relation heaprel, bool readonly,
+					 bool heapallindexed)
 {
 	BtreeCheckState *state;
 	Page		metapage;
 	BTMetaPageData *metad;
 	uint32		previouslevel;
 	BtreeLevel	current;
+	Snapshot	snapshot = SnapshotAny;
 
 	/*
 	 * RecentGlobalXmin assertion matches index_getnext_tid().  See note on
@@ -291,7 +335,34 @@ bt_check_every_level(Relation rel, bool readonly)
 	 */
 	state = palloc(sizeof(BtreeCheckState));
 	state->rel = rel;
+	state->heaprel = heaprel;
 	state->readonly = readonly;
+	state->heapallindexed = heapallindexed;
+
+	if (state->heapallindexed)
+	{
+		int64		total_elems;
+		uint32		seed;
+
+		/* Size Bloom filter based on estimated number of tuples in index */
+		total_elems = (int64) state->rel->rd_rel->reltuples;
+		/* Random seed relies on backend srandom() call to avoid repetition */
+		seed = random();
+		/* Create Bloom filter to fingerprint index */
+		state->filter = bloom_create(total_elems, maintenance_work_mem, seed);
+		state->heaptuplespresent = 0;
+
+		/*
+		 * Register our own snapshot in !readonly case, rather than asking
+		 * IndexBuildHeapScan() to do this for us later.  This needs to happen
+		 * before index fingerprinting begins, so we can later be certain that
+		 * index fingerprinting should have reached all tuples returned by
+		 * IndexBuildHeapScan().
+		 */
+		if (!state->readonly)
+			snapshot = RegisterSnapshot(GetTransactionSnapshot());
+	}
+
 	/* Create context for page */
 	state->targetcontext = AllocSetContextCreate(CurrentMemoryContext,
 												 "amcheck context",
@@ -345,6 +416,63 @@ bt_check_every_level(Relation rel, bool readonly)
 		previouslevel = current.level;
 	}
 
+	/*
+	 * * Heap contains unindexed/malformed tuples check *
+	 */
+	if (state->heapallindexed)
+	{
+		IndexInfo  *indexinfo = BuildIndexInfo(state->rel);
+		HeapScanDesc scan;
+
+		/*
+		 * Create our own scan for IndexBuildHeapScan(), like a parallel index
+		 * build.  We do things this way because it lets us use the MVCC
+		 * snapshot we acquired before index fingerprinting began (in the
+		 * !readonly case).
+		 */
+		scan = heap_beginscan_strat(state->heaprel, /* relation */
+									snapshot,	/* snapshot */
+									0,	/* number of keys */
+									NULL,	/* scan key */
+									true,	/* buffer access strategy OK */
+									true);	/* syncscan OK? */
+
+		/*
+		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
+		 * behaves when only AccessShareLock held.  This is really only needed
+		 * to prevent confusion within IndexBuildHeapScan() about how to
+		 * interpret the state we pass.
+		 */
+		indexinfo->ii_Concurrent = !state->readonly;
+
+		/*
+		 * Don't wait for uncommitted tuple xact commit/abort when index is a
+		 * unique index on a catalog (or an index used by an exclusion
+		 * constraint).  This could otherwise happen in the readonly case.
+		 */
+		indexinfo->ii_Unique = false;
+		indexinfo->ii_ExclusionOps = NULL;
+		indexinfo->ii_ExclusionProcs = NULL;
+		indexinfo->ii_ExclusionStrats = NULL;
+
+		elog(DEBUG1, "verifying that tuples from index \"%s\" are present in \"%s\"",
+			 RelationGetRelationName(state->rel),
+			 RelationGetRelationName(state->heaprel));
+
+		IndexBuildHeapScan(state->heaprel, state->rel, indexinfo, true,
+						   bt_tuple_present_callback, (void *) state, scan);
+
+		ereport(DEBUG1,
+				(errmsg_internal("finished verifying presence of " INT64_FORMAT " tuples (proportion of bits set: %f) from table \"%s\"",
+								 state->heaptuplespresent, bloom_prop_bits_set(state->filter),
+								 RelationGetRelationName(heaprel))));
+
+		if (snapshot != SnapshotAny)
+			UnregisterSnapshot(snapshot);
+
+		bloom_free(state->filter);
+	}
+
 	/* Be tidy: */
 	MemoryContextDelete(state->targetcontext);
 }
@@ -497,7 +625,7 @@ bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
 					 errdetail_internal("Block pointed to=%u expected level=%u level in pointed to block=%u.",
 										current, level.level, opaque->btpo.level)));
 
-		/* Verify invariants for page -- all important checks occur here */
+		/* Verify invariants for page */
 		bt_target_page_check(state);
 
 nextpage:
@@ -544,6 +672,9 @@ nextpage:
  *
  * - That all child pages respect downlinks lower bound.
  *
+ * This is also where heapallindexed callers use their Bloom filter to
+ * fingerprint IndexTuples.
+ *
  * Note:  Memory allocated in this routine is expected to be released by caller
  * resetting state->targetcontext.
  */
@@ -587,6 +718,11 @@ bt_target_page_check(BtreeCheckState *state)
 		itup = (IndexTuple) PageGetItem(state->target, itemid);
 		skey = _bt_mkscankey(state->rel, itup);
 
+		/* Fingerprint leaf page tuples (those that point to the heap) */
+		if (state->heapallindexed && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
+			bloom_add_element(state->filter, (unsigned char *) itup,
+							  IndexTupleSize(itup));
+
 		/*
 		 * * High key check *
 		 *
@@ -680,8 +816,10 @@ bt_target_page_check(BtreeCheckState *state)
 		 * * Last item check *
 		 *
 		 * Check last item against next/right page's first data item's when
-		 * last item on page is reached.  This additional check can detect
-		 * transposed pages.
+		 * last item on page is reached.  This additional check will detect
+		 * transposed pages iff the supposed right sibling page happens to
+		 * belong before target in the key space.  (Otherwise, a subsequent
+		 * heap verification will probably detect the problem.)
 		 *
 		 * This check is similar to the item order check that will have
 		 * already been performed for every other "real" item on target page
@@ -1060,6 +1198,106 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 }
 
 /*
+ * Per-tuple callback from IndexBuildHeapScan, used to determine if index has
+ * all the entries that definitely should have been observed in leaf pages of
+ * the target index (that is, all IndexTuples that were fingerprinted by our
+ * Bloom filter).  All heapallindexed checks occur here.
+ *
+ * The redundancy between an index and the table it indexes provides a good
+ * opportunity to detect corruption, especially corruption within the table.
+ * The high level principle behind the verification performed here is that any
+ * IndexTuple that should be in an index following a fresh CREATE INDEX (based
+ * on the same index definition) should also have been in the original,
+ * existing index, which should have used exactly the same representation
+ *
+ * Since the overall structure of the index has already been verified, the most
+ * likely explanation for error here is a corrupt heap page (could be logical
+ * or physical corruption).  Index corruption may still be detected here,
+ * though.  Only readonly callers will have verified that left links and right
+ * links are in agreement, and so it's possible that a leaf page transposition
+ * within index is actually the source of corruption detected here (for
+ * !readonly callers).  The checks performed only for readonly callers might
+ * more accurately frame the problem as a cross-page invariant issue (this
+ * could even be due to recovery not replaying all WAL records).  The !readonly
+ * ERROR message raised here includes a HINT about retrying with readonly
+ * verification, just in case it's a cross-page invariant issue, though that
+ * isn't particularly likely.
+ *
+ * IndexBuildHeapScan() expects to be able to find the root tuple when a
+ * heap-only tuple (the live tuple at the end of some HOT chain) needs to be
+ * indexed, in order to replace the actual tuple's TID with the root tuple's
+ * TID (which is what we're actually passed back here).  The index build heap
+ * scan code will raise an error when a tuple that claims to be the root of the
+ * heap-only tuple's HOT chain cannot be located.  This catches cases where the
+ * original root item offset/root tuple for a HOT chain indicates (for whatever
+ * reason) that the entire HOT chain is dead, despite the fact that the latest
+ * heap-only tuple should be indexed.  When this happens, sequential scans may
+ * always give correct answers, and all indexes may be considered structurally
+ * consistent (i.e. the nbtree structural checks would not detect corruption).
+ * It may be the case that only index scans give wrong answers, and yet heap or
+ * SLRU corruption is the real culprit.  (While it's true that LP_DEAD bit
+ * setting will probably also leave the index in a corrupt state before too
+ * long, the problem is nonetheless that there is heap corruption.)
+ *
+ * Heap-only tuple handling within IndexBuildHeapScan() works in a way that
+ * helps us to detect index tuples that contain the wrong values (values that
+ * don't match the latest tuple in the HOT chain).  This can happen when there
+ * is no superseding index tuple due to a faulty assessment of HOT safety,
+ * perhaps during the original CREATE INDEX.  Because the latest tuple's
+ * contents are used with the root TID, an error will be raised when a tuple
+ * with the same TID but non-matching attribute values is passed back to us.
+ * Faulty assessment of HOT-safety was behind at least two distinct CREATE
+ * INDEX CONCURRENTLY bugs that made it into stable releases, one of which was
+ * undetected for many years.  In short, the same principle that allows a
+ * REINDEX to repair corruption when there was an (undetected) broken HOT chain
+ * also allows us to detect the corruption in many cases.
+ */
+static void
+bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
+						  bool *isnull, bool tupleIsAlive, void *checkstate)
+{
+	BtreeCheckState *state = (BtreeCheckState *) checkstate;
+	IndexTuple	itup;
+
+	Assert(state->heapallindexed);
+
+	/*
+	 * Generate an index tuple for fingerprinting.
+	 *
+	 * Index tuple formation is assumed to be deterministic, and IndexTuples
+	 * are assumed immutable.  While the LP_DEAD bit is mutable in leaf pages,
+	 * that's ItemId metadata, which was not fingerprinted.  (There will often
+	 * be some dead-to-everyone IndexTuples fingerprinted by the Bloom filter,
+	 * but we only try to detect the absence of needed tuples, so that's okay.)
+	 *
+	 * Note that we rely on deterministic index_form_tuple() TOAST compression.
+	 * If index_form_tuple() was ever enhanced to compress datums out-of-line,
+	 * or otherwise varied when or how compression was applied, our assumption
+	 * would break, leading to false positive reports of corruption.  For now,
+	 * we don't decompress/normalize toasted values as part of fingerprinting.
+	 */
+	itup = index_form_tuple(RelationGetDescr(index), values, isnull);
+	itup->t_tid = htup->t_self;
+
+	/* Probe Bloom filter -- tuple should be present */
+	if (bloom_lacks_element(state->filter, (unsigned char *) itup,
+							IndexTupleSize(itup)))
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("heap tuple (%u,%u) from table \"%s\" lacks matching index tuple within index \"%s\"",
+						ItemPointerGetBlockNumber(&(itup->t_tid)),
+						ItemPointerGetOffsetNumber(&(itup->t_tid)),
+						RelationGetRelationName(state->heaprel),
+						RelationGetRelationName(state->rel)),
+				 !state->readonly
+				 ? errhint("Retrying verification using the function bt_index_parent_check() might provide a more specific error.")
+				 : 0));
+
+	state->heaptuplespresent++;
+	pfree(itup);
+}
+
+/*
  * Is particular offset within page (whose special state is passed by caller)
  * the page negative-infinity item?
  *
diff --git a/doc/src/sgml/amcheck.sgml b/doc/src/sgml/amcheck.sgml
index 852e260..f6be1b3 100644
--- a/doc/src/sgml/amcheck.sgml
+++ b/doc/src/sgml/amcheck.sgml
@@ -44,7 +44,7 @@
   <variablelist>
    <varlistentry>
     <term>
-     <function>bt_index_check(index regclass) returns void</function>
+     <function>bt_index_check(index regclass, heapallindexed boolean DEFAULT false) returns void</function>
      <indexterm>
       <primary>bt_index_check</primary>
      </indexterm>
@@ -55,7 +55,9 @@
       <function>bt_index_check</function> tests that its target, a
       B-Tree index, respects a variety of invariants.  Example usage:
 <screen>
-test=# SELECT bt_index_check(c.oid), c.relname, c.relpages
+test=# SELECT bt_index_check(index =&gt; c.oid, heapallindexed =&gt; i.indisunique)
+               c.relname,
+               c.relpages
 FROM pg_index i
 JOIN pg_opclass op ON i.indclass[0] = op.oid
 JOIN pg_am am ON op.opcmethod = am.oid
@@ -83,9 +85,11 @@ ORDER BY c.relpages DESC LIMIT 10;
 </screen>
       This example shows a session that performs verification of every
       catalog index in the database <quote>test</quote>.  Details of just
-      the 10 largest indexes verified are displayed.  Since no error
-      is raised, all indexes tested appear to be logically consistent.
-      Naturally, this query could easily be changed to call
+      the 10 largest indexes verified are displayed.  Verification of
+      the presence of heap tuples as index tuples is requested for
+      unique indexes only.  Since no error is raised, all indexes
+      tested appear to be logically consistent.  Naturally, this query
+      could easily be changed to call
       <function>bt_index_check</function> for every index in the
       database where verification is supported.
      </para>
@@ -95,10 +99,11 @@ ORDER BY c.relpages DESC LIMIT 10;
       is the same lock mode acquired on relations by simple
       <literal>SELECT</literal> statements.
       <function>bt_index_check</function> does not verify invariants
-      that span child/parent relationships, nor does it verify that
-      the target index is consistent with its heap relation.  When a
-      routine, lightweight test for corruption is required in a live
-      production environment, using
+      that span child/parent relationships, but will verify the
+      presence of all heap tuples as index tuples within the index
+      when <parameter>heapallindexed</parameter> is
+      <literal>true</literal>.  When a routine, lightweight test for
+      corruption is required in a live production environment, using
       <function>bt_index_check</function> often provides the best
       trade-off between thoroughness of verification and limiting the
       impact on application performance and availability.
@@ -108,7 +113,7 @@ ORDER BY c.relpages DESC LIMIT 10;
 
    <varlistentry>
     <term>
-     <function>bt_index_parent_check(index regclass) returns void</function>
+     <function>bt_index_parent_check(index regclass, heapallindexed boolean DEFAULT false) returns void</function>
      <indexterm>
       <primary>bt_index_parent_check</primary>
      </indexterm>
@@ -117,19 +122,21 @@ ORDER BY c.relpages DESC LIMIT 10;
     <listitem>
      <para>
       <function>bt_index_parent_check</function> tests that its
-      target, a B-Tree index, respects a variety of invariants.  The
-      checks performed by <function>bt_index_parent_check</function>
-      are a superset of the checks performed by
-      <function>bt_index_check</function>.
+      target, a B-Tree index, respects a variety of invariants.
+      Optionally, when the <parameter>heapallindexed</parameter>
+      argument is <literal>true</literal>, the function verifies the
+      presence of all heap tuples that should be found within the
+      index.  The checks that can be performed by
+      <function>bt_index_parent_check</function> are a superset of the
+      checks that can be performed by <function>bt_index_check</function>.
       <function>bt_index_parent_check</function> can be thought of as
       a more thorough variant of <function>bt_index_check</function>:
       unlike <function>bt_index_check</function>,
       <function>bt_index_parent_check</function> also checks
-      invariants that span parent/child relationships.  However, it
-      does not verify that the target index is consistent with its
-      heap relation.  <function>bt_index_parent_check</function>
-      follows the general convention of raising an error if it finds a
-      logical inconsistency or other problem.
+      invariants that span parent/child relationships.
+      <function>bt_index_parent_check</function> follows the general
+      convention of raising an error if it finds a logical
+      inconsistency or other problem.
      </para>
      <para>
       A <literal>ShareLock</literal> is required on the target index by
@@ -159,6 +166,47 @@ ORDER BY c.relpages DESC LIMIT 10;
  </sect2>
 
  <sect2>
+  <title>Optional <parameter>heapallindexed</parameter> verification</title>
+ <para>
+  When the <parameter>heapallindexed</parameter> argument to
+  verification functions is <literal>true</literal>, an additional
+  phase of verification is performed against the table associated with
+  the target index relation.  This consists of a <quote>dummy</quote>
+  <command>CREATE INDEX</command> operation, which checks for the
+  presence of all hypothetical new index tuples against a temporary,
+  in-memory summarizing structure (this is built when needed during
+  the basic first phase of verification).  The summarizing structure
+  <quote>fingerprints</quote> every tuple found within the target
+  index.  The high level principle behind
+  <parameter>heapallindexed</parameter> verification is that a new
+  index that is equivalent to the existing, target index must only
+  have entries that can be found in the existing structure.
+ </para>
+ <para>
+  The additional <parameter>heapallindexed</parameter> phase adds
+  significant overhead: verification will typically take several times
+  longer.  However, there is no change to the relation-level locks
+  acquired when <parameter>heapallindexed</parameter> verification is
+  performed.
+ </para>
+ <para>
+  The summarizing structure is bound in size by
+  <varname>maintenance_work_mem</varname>.  In order to ensure that
+  there is no more than a 2% probability of failure to detect an
+  inconsistency for each heap tuple that should be represented in the
+  index, approximately 2 bytes of memory are needed per tuple.  As
+  less memory is made available per tuple, the probability of missing
+  an inconsistency slowly increases.  This approach limits the
+  overhead of verification significantly, while only slightly reducing
+  the probability of detecting a problem, especially for installations
+  where verification is treated as a routine maintenance task.  Any
+  single absent or malformed tuple has a new opportunity to be
+  detected with each new verification attempt.
+ </para>
+
+ </sect2>
+
+ <sect2>
   <title>Using <filename>amcheck</filename> effectively</title>
 
  <para>
@@ -199,16 +247,29 @@ ORDER BY c.relpages DESC LIMIT 10;
    </listitem>
    <listitem>
     <para>
+     Structural inconsistencies between indexes and the heap relations
+     that are indexed (when <parameter>heapallindexed</parameter>
+     verification is performed).
+    </para>
+    <para>
+     There is no cross-checking of indexes against their heap relation
+     during normal operation.  Symptoms of heap corruption can be subtle.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
      Corruption caused by hypothetical undiscovered bugs in the
-     underlying <productname>PostgreSQL</productname> access method code or sort
-     code.
+     underlying <productname>PostgreSQL</productname> access method
+     code, sort code, or transaction management code.
     </para>
     <para>
      Automatic verification of the structural integrity of indexes
      plays a role in the general testing of new or proposed
      <productname>PostgreSQL</productname> features that could plausibly allow a
-     logical inconsistency to be introduced.  One obvious testing
-     strategy is to call <filename>amcheck</filename> functions continuously
+     logical inconsistency to be introduced.  Verification of table
+     structure and associated visibility and transaction status
+     information plays a similar role.  One obvious testing strategy
+     is to call <filename>amcheck</filename> functions continuously
      when running the standard regression tests.  See <xref
      linkend="regress-run"/> for details on running the tests.
     </para>
@@ -242,6 +303,12 @@ ORDER BY c.relpages DESC LIMIT 10;
      <emphasis>absolute</emphasis> protection against failures that
      result in memory corruption.
     </para>
+    <para>
+     When <parameter>heapallindexed</parameter> verification is
+     performed, there is generally a greatly increased chance of
+     detecting single-bit errors, since strict binary equality is
+     tested, and the indexed attributes within the heap are tested.
+    </para>
    </listitem>
   </itemizedlist>
   In general, <filename>amcheck</filename> can only prove the presence of
@@ -253,11 +320,10 @@ ORDER BY c.relpages DESC LIMIT 10;
   <title>Repairing corruption</title>
  <para>
   No error concerning corruption raised by <filename>amcheck</filename> should
-  ever be a false positive.  In practice, <filename>amcheck</filename> is more
-  likely to find software bugs than problems with hardware.
-  <filename>amcheck</filename> raises errors in the event of conditions that,
-  by definition, should never happen, and so careful analysis of
-  <filename>amcheck</filename> errors is often required.
+  ever be a false positive.  <filename>amcheck</filename> raises
+  errors in the event of conditions that, by definition, should never
+  happen, and so careful analysis of <filename>amcheck</filename>
+  errors is often required.
  </para>
  <para>
   There is no general method of repairing problems that
-- 
2.7.4

Re: [HACKERS] A design for amcheck heapam verification

Reply via email to