Thanks Justin! I've applied all the fixes you proposed, and (hopefully)
improved a couple other comments.

I've been working on this over the past couple days, trying to polish
and commit it over the weekend - both into master and backbranches.
Sadly, the backpatching part turned out to be a bit more complicated
than I expected, because of the BRIN reworks in PG14 (done by me, as
foundation for the new opclasses, so ... well).

Anyway, I got it done, but it's a bit uglier than I hoped for and I
don't feel like pushing this on Sunday midnight. I think it's correct,
but maybe another pass to polish it a bit more is better.

So here are two patches - one for 11-13, the other for 14-master.

There's also a separate patch with pageinspect tests, but only as a
demonstration of various (non)broken cases, not for commit. And then
also a bash script generating indexes with random data, randomized
summarization etc. - on unpatched systems this happens to fail in about
1/3 of the runs (at least for me). I haven't seen any failures with the
patches attached (on any branch).

As for the issue / fix, I don't think there's a better solution than
what the patch does - we need to distinguish empty / all-nulls ranges,
but we can't add a flag because of on-disk format / ABI. So using the
existing flags seems like the only option - I haven't heard any other
ideas so far, and I couldn't come up with any myself either.

I've also thought about alternative "encodings" into allnulls/hasnulls,
instead of treating (true,true) as "empty" - but none of that ended up
being any simpler, quite the opposite actually, as it would change what
the individual flags mean etc. So AFAICS this is the best / least
disruptive option.

I went over all the places touching these flags, to double check if any
of those needs some tweaks (similar to union_tuples, which I missed for
a long time). But I haven't found anything else, so I think this version
of the patches is complete.

As for assessing how many indexes are affected - in principle, any index
on columns with NULLs may be broken. But it only matters if the index is
used for IS NULL queries, other queries are not affected.

I also realized that this only affects insertion of individual tuples
into existing all-null summaries, not "bulk" summarization that sees all
values at once. This happens because in this case add_values_to_range
sets hasnulls=true for the first (NULL) value, and then calls the
addValue procedure for the second (non-NULL) one, which resets the
allnulls flag to false.

But when inserting individual rows, we first set hasnulls=true, but
brin_form_tuple ignores that because of allnulls=true. And then when
inserting the second row, we start with hasnulls=false again, and the
opclass quietly resets the allnulls flag.

I guess this further reduces the number of broken indexes, especially
for data sets with small null_frac, or for append-only (or -mostly)
tables where most of the summarization is bulk.

I still feel a bit uneasy about this, but I think the patch is solid.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From 033286cff9733c24fdc7c3f774d947c9f1072aa0 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.von...@postgresql.org>
Date: Sun, 8 Jan 2023 23:42:41 +0100
Subject: [PATCH] extra tests

---
 contrib/pageinspect/Makefile                |   2 +-
 contrib/pageinspect/expected/brin-fails.out | 152 ++++++++++++++++++++
 contrib/pageinspect/sql/brin-fails.sql      |  85 +++++++++++
 3 files changed, 238 insertions(+), 1 deletion(-)
 create mode 100644 contrib/pageinspect/expected/brin-fails.out
 create mode 100644 contrib/pageinspect/sql/brin-fails.sql

diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile
index 95e030b396..69a28a6d3d 100644
--- a/contrib/pageinspect/Makefile
+++ b/contrib/pageinspect/Makefile
@@ -22,7 +22,7 @@ DATA =  pageinspect--1.11--1.12.sql pageinspect--1.10--1.11.sql \
 	pageinspect--1.0--1.1.sql
 PGFILEDESC = "pageinspect - functions to inspect contents of database pages"
 
-REGRESS = page btree brin gin gist hash checksum oldextversions
+REGRESS = page btree brin gin gist hash checksum oldextversions brin-fails
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/pageinspect/expected/brin-fails.out b/contrib/pageinspect/expected/brin-fails.out
new file mode 100644
index 0000000000..a12c761fc8
--- /dev/null
+++ b/contrib/pageinspect/expected/brin-fails.out
@@ -0,0 +1,152 @@
+create table t (a int);
+create extension pageinspect;
+-- works
+drop index if exists t_a_idx;
+NOTICE:  index "t_a_idx" does not exist, skipping
+truncate t;
+insert into t values (null), (1);
+create index on t using brin (a);
+select * from brin_page_items(get_raw_page('t_a_idx', 2), 't_a_idx'::regclass);
+ itemoffset | blknum | attnum | allnulls | hasnulls | placeholder |  value   
+------------+--------+--------+----------+----------+-------------+----------
+          1 |      0 |      1 | f        | t        | f           | {1 .. 1}
+(1 row)
+
+-- works
+drop index if exists t_a_idx;
+truncate t;
+insert into t values (1), (null);
+create index on t using brin (a);
+select * from brin_page_items(get_raw_page('t_a_idx', 2), 't_a_idx'::regclass);
+ itemoffset | blknum | attnum | allnulls | hasnulls | placeholder |  value   
+------------+--------+--------+----------+----------+-------------+----------
+          1 |      0 |      1 | f        | t        | f           | {1 .. 1}
+(1 row)
+
+-- fails
+drop index if exists t_a_idx;
+truncate t;
+insert into t values (null);
+create index on t using brin (a);
+insert into t values (1);
+select * from brin_page_items(get_raw_page('t_a_idx', 2), 't_a_idx'::regclass);
+ itemoffset | blknum | attnum | allnulls | hasnulls | placeholder |  value   
+------------+--------+--------+----------+----------+-------------+----------
+          1 |      0 |      1 | f        | t        | f           | {1 .. 1}
+(1 row)
+
+-- fails
+drop index if exists t_a_idx;
+truncate t;
+create index on t using brin (a);
+insert into t values (null);
+insert into t values (1);
+select * from brin_page_items(get_raw_page('t_a_idx', 2), 't_a_idx'::regclass);
+ itemoffset | blknum | attnum | allnulls | hasnulls | placeholder |  value   
+------------+--------+--------+----------+----------+-------------+----------
+          1 |      0 |      1 | f        | t        | f           | {1 .. 1}
+(1 row)
+
+-- works
+drop index if exists t_a_idx;
+truncate t;
+create index on t using brin (a) with (pages_per_range=1);
+insert into t select null from generate_series(1,291); -- fill first page
+insert into t values (null), (1);
+select brin_summarize_new_values('t_a_idx');
+ brin_summarize_new_values 
+---------------------------
+                         1
+(1 row)
+
+select * from brin_page_items(get_raw_page('t_a_idx', 2), 't_a_idx'::regclass);
+ itemoffset | blknum | attnum | allnulls | hasnulls | placeholder |  value   
+------------+--------+--------+----------+----------+-------------+----------
+          1 |      0 |      1 | t        | f        | f           | 
+          2 |      1 |      1 | f        | t        | f           | {1 .. 1}
+(2 rows)
+
+-- fails
+drop index if exists t_a_idx;
+truncate t;
+create index on t using brin (a) with (pages_per_range=1);
+insert into t select null from generate_series(1,291); -- fill first page
+insert into t values (null);
+select brin_summarize_new_values('t_a_idx');
+ brin_summarize_new_values 
+---------------------------
+                         1
+(1 row)
+
+insert into t values (1);
+select * from brin_page_items(get_raw_page('t_a_idx', 2), 't_a_idx'::regclass);
+ itemoffset | blknum | attnum | allnulls | hasnulls | placeholder |  value   
+------------+--------+--------+----------+----------+-------------+----------
+          1 |      0 |      1 | t        | f        | f           | 
+          2 |      1 |      1 | f        | t        | f           | {1 .. 1}
+(2 rows)
+
+-- works
+drop index if exists t_a_idx;
+truncate t;
+create index on t using brin (a) with (pages_per_range=1);
+insert into t select null from generate_series(1,291); -- fill first page
+insert into t values (null);
+insert into t values (1);
+select brin_summarize_new_values('t_a_idx');
+ brin_summarize_new_values 
+---------------------------
+                         1
+(1 row)
+
+insert into t values (1);
+select * from brin_page_items(get_raw_page('t_a_idx', 2), 't_a_idx'::regclass);
+ itemoffset | blknum | attnum | allnulls | hasnulls | placeholder |  value   
+------------+--------+--------+----------+----------+-------------+----------
+          1 |      0 |      1 | t        | f        | f           | 
+          2 |      1 |      1 | f        | t        | f           | {1 .. 1}
+(2 rows)
+
+-- fails
+drop index if exists t_a_idx;
+truncate t;
+create index on t using brin (a) with (pages_per_range=1);
+insert into t select null from generate_series(1,291); -- fill first page
+insert into t values (null);
+insert into t values (null);
+select brin_summarize_new_values('t_a_idx');
+ brin_summarize_new_values 
+---------------------------
+                         1
+(1 row)
+
+insert into t values (1);
+select * from brin_page_items(get_raw_page('t_a_idx', 2), 't_a_idx'::regclass);
+ itemoffset | blknum | attnum | allnulls | hasnulls | placeholder |  value   
+------------+--------+--------+----------+----------+-------------+----------
+          1 |      0 |      1 | t        | f        | f           | 
+          2 |      1 |      1 | f        | t        | f           | {1 .. 1}
+(2 rows)
+
+-- fails
+drop index if exists t_a_idx;
+truncate t;
+create index on t using brin (a) with (pages_per_range=1);
+insert into t select null from generate_series(1,291); -- fill first page
+insert into t values (null);
+insert into t values (null);
+select brin_summarize_new_values('t_a_idx');
+ brin_summarize_new_values 
+---------------------------
+                         1
+(1 row)
+
+insert into t values (null);
+insert into t values (1);
+select * from brin_page_items(get_raw_page('t_a_idx', 2), 't_a_idx'::regclass);
+ itemoffset | blknum | attnum | allnulls | hasnulls | placeholder |  value   
+------------+--------+--------+----------+----------+-------------+----------
+          1 |      0 |      1 | t        | f        | f           | 
+          2 |      1 |      1 | f        | t        | f           | {1 .. 1}
+(2 rows)
+
diff --git a/contrib/pageinspect/sql/brin-fails.sql b/contrib/pageinspect/sql/brin-fails.sql
new file mode 100644
index 0000000000..ca57ba7e03
--- /dev/null
+++ b/contrib/pageinspect/sql/brin-fails.sql
@@ -0,0 +1,85 @@
+create table t (a int);
+create extension pageinspect;
+
+-- works
+drop index if exists t_a_idx;
+truncate t;
+insert into t values (null), (1);
+create index on t using brin (a);
+select * from brin_page_items(get_raw_page('t_a_idx', 2), 't_a_idx'::regclass);
+
+-- works
+drop index if exists t_a_idx;
+truncate t;
+insert into t values (1), (null);
+create index on t using brin (a);
+select * from brin_page_items(get_raw_page('t_a_idx', 2), 't_a_idx'::regclass);
+
+-- fails
+drop index if exists t_a_idx;
+truncate t;
+insert into t values (null);
+create index on t using brin (a);
+insert into t values (1);
+select * from brin_page_items(get_raw_page('t_a_idx', 2), 't_a_idx'::regclass);
+
+-- fails
+drop index if exists t_a_idx;
+truncate t;
+create index on t using brin (a);
+insert into t values (null);
+insert into t values (1);
+select * from brin_page_items(get_raw_page('t_a_idx', 2), 't_a_idx'::regclass);
+
+-- works
+drop index if exists t_a_idx;
+truncate t;
+create index on t using brin (a) with (pages_per_range=1);
+insert into t select null from generate_series(1,291); -- fill first page
+insert into t values (null), (1);
+select brin_summarize_new_values('t_a_idx');
+select * from brin_page_items(get_raw_page('t_a_idx', 2), 't_a_idx'::regclass);
+
+-- fails
+drop index if exists t_a_idx;
+truncate t;
+create index on t using brin (a) with (pages_per_range=1);
+insert into t select null from generate_series(1,291); -- fill first page
+insert into t values (null);
+select brin_summarize_new_values('t_a_idx');
+insert into t values (1);
+select * from brin_page_items(get_raw_page('t_a_idx', 2), 't_a_idx'::regclass);
+
+-- works
+drop index if exists t_a_idx;
+truncate t;
+create index on t using brin (a) with (pages_per_range=1);
+insert into t select null from generate_series(1,291); -- fill first page
+insert into t values (null);
+insert into t values (1);
+select brin_summarize_new_values('t_a_idx');
+insert into t values (1);
+select * from brin_page_items(get_raw_page('t_a_idx', 2), 't_a_idx'::regclass);
+
+-- fails
+drop index if exists t_a_idx;
+truncate t;
+create index on t using brin (a) with (pages_per_range=1);
+insert into t select null from generate_series(1,291); -- fill first page
+insert into t values (null);
+insert into t values (null);
+select brin_summarize_new_values('t_a_idx');
+insert into t values (1);
+select * from brin_page_items(get_raw_page('t_a_idx', 2), 't_a_idx'::regclass);
+
+-- fails
+drop index if exists t_a_idx;
+truncate t;
+create index on t using brin (a) with (pages_per_range=1);
+insert into t select null from generate_series(1,291); -- fill first page
+insert into t values (null);
+insert into t values (null);
+select brin_summarize_new_values('t_a_idx');
+insert into t values (null);
+insert into t values (1);
+select * from brin_page_items(get_raw_page('t_a_idx', 2), 't_a_idx'::regclass);
-- 
2.39.0

From 26905146ebb93422152f0f6ec3a835f62b1b8327 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.von...@postgresql.org>
Date: Sun, 8 Jan 2023 16:43:06 +0100
Subject: [PATCH] Fix handling of NULLs in BRIN indexes

BRIN indexes did not properly distinguish between summaries for empty
(no rows) and all-NULL ranges. All summaries were initialized with
allnulls=true, and the opclasses simply reset allnulls to false when
processing the first non-NULL value. This however fails if the range
starts with a NULL value (or a sequence of NULL values), in which case
we forget the range contains NULL values.

This happens because the allnulls flag is used for two separate
purposes - to mark empty ranges (not representing any rows yet) and
ranges containing only NULL values.

Opclasses don't know which of these cases it is, and so don't know
whether to set hasnulls=true. Setting hasnulls=true in both cases would
make it correct, but it would also make BRIN indexes useless for queries
with IS NULL clauses - all ranges start empty (and thus allnulls=true),
so all ranges would end up with either allnulls=true or hasnulls=true.

The severity of the issue is somewhat reduced by the fact that it only
happens when adding values to an existing summary with allnulls=true,
not when the summarization is processing values in bulk (e.g. during
CREATE INDEX or automatic summarization). In this case the flags were
updated in a slightly different way, not forgetting the NULL values.

The best solution would be to introduce a new flag marking index tuples
representing ranges with no rows, but that would break on-disk format
and/or ABI, depending on where we put the flag. Considering we need to
backpatch this, that's not acceptable.

So instead we use an "impossible" combination of both flags (allnulls
and hasnulls) set to true, to mark "empty" ranges with no rows. In
principle "empty" is a feature of the whole index tuple, which may
contain multiple summaries in a multi-column index, but this is where
the flags are, unfortunately.

We could also skip storing index tuples for empty summaries, but then
we'd have to always process such ranges - even if there are no rows in
large parts of the table (e.g. after a bulk DELETE), it would still
require reading the pages etc. So we store them, but ignore them when
building the bitmap.

Backpatch to 11. The issue exists since BRIN indexes were introduced in
9.5, but older releases are already EOL.

Backpatch-through: 11
Reviewed-by: Justin Pryzby, Matthias van de Meent
Discussion: https://postgr.es/m/402430e4-7d9d-6cf1-09ef-464d80aff...@enterprisedb.com
---
 src/backend/access/brin/brin.c                | 77 ++++++++++++++++++-
 src/backend/access/brin/brin_tuple.c          | 16 +++-
 ...summarization-and-inprogress-insertion.out |  8 +-
 ...ummarization-and-inprogress-insertion.spec |  1 +
 4 files changed, 94 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index de1427a1e0..aa8d4017a7 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -591,6 +591,17 @@ bringetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 
 					bval = &dtup->bt_columns[attno - 1];
 
+					/*
+					 * If the range has both allnulls and hasnulls set, it means
+					 * there are no rows in the range, so we can skip it (we have
+					 * scan keys, and we know there's nothing to match).
+					 */
+					if (bval->bv_allnulls && bval->bv_hasnulls)
+					{
+						addrange = false;
+						break;
+					}
+
 					/*
 					 * First check if there are any IS [NOT] NULL scan keys,
 					 * and if we're violating them. In that case we can
@@ -1608,11 +1619,32 @@ union_tuples(BrinDesc *bdesc, BrinMemTuple *a, BrinTuple *b)
 
 		if (opcinfo->oi_regular_nulls)
 		{
-			/* Adjust "hasnulls". */
+			/*
+			 * If B is empty (represents no rows), ignore it and just keep
+			 * A as is (might be empty etc.).
+			 */
+			if (col_b->bv_allnulls && col_b->bv_hasnulls)
+				continue;
+
+			/*
+			 * Adjust "hasnulls".
+			 *
+			 * It may happen A has allnulls=true, and we should reset it. We
+			 * need to copy the values from B first, which happens later.
+			 * We know the next condition can't trigger, because B is not
+			 * empty so only one of the flags is true.
+			 */
 			if (!col_a->bv_hasnulls && col_b->bv_hasnulls)
 				col_a->bv_hasnulls = true;
 
-			/* If there are no values in B, there's nothing left to do. */
+			/*
+			 * If there are no values in B, there's nothing left to do.
+			 *
+			 * Note this is mutually exclusive with the preceding condition.
+			 * We have skipped "empty" B, so hasnulls and allnulls can't be
+			 * both true. So if we adjusted hasnulls for A, there have to be
+			 * values for B (i.e. we're not terminating here).
+			 */
 			if (col_b->bv_allnulls)
 				continue;
 
@@ -1626,6 +1658,7 @@ union_tuples(BrinDesc *bdesc, BrinMemTuple *a, BrinTuple *b)
 			{
 				int			i;
 
+				/* This also applies if we adjusted hasnulls=true earlier. */
 				col_a->bv_allnulls = false;
 
 				for (i = 0; i < opcinfo->oi_nstored; i++)
@@ -1710,16 +1743,41 @@ add_values_to_range(Relation idxRel, BrinDesc *bdesc, BrinMemTuple *dtup,
 		Datum		result;
 		BrinValues *bval;
 		FmgrInfo   *addValue;
+		bool		first_row;
+		bool		hasnulls = false;
 
 		bval = &dtup->bt_columns[keyno];
 
+		/*
+		 * Is this the first tuple we're adding to the range range? We track
+		 * that by setting both bv_hasnulls and bval->bv_allnulls to true
+		 * during initialization. But it's not a valid combination (at most
+		 * one of those flags should be set), so we reset the second flag.
+		 */
+		first_row = (bval->bv_hasnulls && bval->bv_allnulls);
+
+		if (bval->bv_hasnulls && bval->bv_allnulls)
+		{
+			bval->bv_hasnulls = false;
+			modified = true;
+		}
+
 		if (bdesc->bd_info[keyno]->oi_regular_nulls && nulls[keyno])
 		{
 			/*
 			 * If the new value is null, we record that we saw it if it's the
 			 * first one; otherwise, there's nothing to do.
+			 *
+			 * We used to check "bv_hasnulls" which might result in having both
+			 * flags set. That used to be OK, because we just ignore hasnulls
+			 * flag in brin_form_tuple when bv_allnulls=true.
+			 *
+			 * But now we interpret this combination as "first row" so it would
+			 * confuse following calls. So make sure to only set one of these
+			 * flags - when allnulls=true we're done, as it already marks the
+			 * range as containing values.
 			 */
-			if (!bval->bv_hasnulls)
+			if (!bval->bv_allnulls)
 			{
 				bval->bv_hasnulls = true;
 				modified = true;
@@ -1728,6 +1786,12 @@ add_values_to_range(Relation idxRel, BrinDesc *bdesc, BrinMemTuple *dtup,
 			continue;
 		}
 
+		/*
+		 * Does the range already have NULL values? Either of the flags can
+		 * be set, but we ignore the state before adding first row.
+		 */
+		hasnulls = (bval->bv_hasnulls || bval->bv_allnulls) && !first_row;
+
 		addValue = index_getprocinfo(idxRel, keyno + 1,
 									 BRIN_PROCNUM_ADDVALUE);
 		result = FunctionCall4Coll(addValue,
@@ -1738,6 +1802,13 @@ add_values_to_range(Relation idxRel, BrinDesc *bdesc, BrinMemTuple *dtup,
 								   nulls[keyno]);
 		/* if that returned true, we need to insert the updated tuple */
 		modified |= DatumGetBool(result);
+
+		/*
+		 * If we had NULLS, and the opclass didn't set allnulls=true, clear
+		 * hasnulls so that we remember there are NULL values.
+		 */
+		if (hasnulls && !bval->bv_allnulls)
+			bval->bv_hasnulls = true;
 	}
 
 	return modified;
diff --git a/src/backend/access/brin/brin_tuple.c b/src/backend/access/brin/brin_tuple.c
index 84b79dbfc0..5078754f1e 100644
--- a/src/backend/access/brin/brin_tuple.c
+++ b/src/backend/access/brin/brin_tuple.c
@@ -516,8 +516,15 @@ brin_memtuple_initialize(BrinMemTuple *dtuple, BrinDesc *brdesc)
 	for (i = 0; i < brdesc->bd_tupdesc->natts; i++)
 	{
 		dtuple->bt_columns[i].bv_attno = i + 1;
+
+		/*
+		 * Each memtuple starts as if it represents no rows, which is indicated
+		 * by having bot allnulls and hasnulls set to true. We track this for
+		 * all columns, because we don't have a flag for the whole memtuple.
+		 */
 		dtuple->bt_columns[i].bv_allnulls = true;
-		dtuple->bt_columns[i].bv_hasnulls = false;
+		dtuple->bt_columns[i].bv_hasnulls = true;
+
 		dtuple->bt_columns[i].bv_values = (Datum *) currdatum;
 
 		dtuple->bt_columns[i].bv_mem_value = PointerGetDatum(NULL);
@@ -585,6 +592,13 @@ brin_deform_tuple(BrinDesc *brdesc, BrinTuple *tuple, BrinMemTuple *dMemtuple)
 	{
 		int			i;
 
+		/*
+		 * Make sure to overwrite the hasnulls flag, because it was initialized
+		 * to true by brin_memtuple_initialize and we don't want to skip it if
+		 * allnulls=true.
+		 */
+		dtup->bt_columns[keyno].bv_hasnulls = hasnulls[keyno];
+
 		if (allnulls[keyno])
 		{
 			valueno += brdesc->bd_info[keyno]->oi_nstored;
diff --git a/src/test/modules/brin/expected/summarization-and-inprogress-insertion.out b/src/test/modules/brin/expected/summarization-and-inprogress-insertion.out
index 2a4755d099..584ac2602f 100644
--- a/src/test/modules/brin/expected/summarization-and-inprogress-insertion.out
+++ b/src/test/modules/brin/expected/summarization-and-inprogress-insertion.out
@@ -4,7 +4,7 @@ starting permutation: s2check s1b s2b s1i s2summ s1c s2c s2check
 step s2check: SELECT * FROM brin_page_items(get_raw_page('brinidx', 2), 'brinidx'::regclass);
 itemoffset|blknum|attnum|allnulls|hasnulls|placeholder|value   
 ----------+------+------+--------+--------+-----------+--------
-         1|     0|     1|f       |f       |f          |{1 .. 1}
+         1|     0|     1|f       |t       |f          |{1 .. 1}
 (1 row)
 
 step s1b: BEGIN ISOLATION LEVEL REPEATABLE READ;
@@ -26,7 +26,7 @@ step s2c: COMMIT;
 step s2check: SELECT * FROM brin_page_items(get_raw_page('brinidx', 2), 'brinidx'::regclass);
 itemoffset|blknum|attnum|allnulls|hasnulls|placeholder|value      
 ----------+------+------+--------+--------+-----------+-----------
-         1|     0|     1|f       |f       |f          |{1 .. 1}   
+         1|     0|     1|f       |t       |f          |{1 .. 1}   
          2|     1|     1|f       |f       |f          |{1 .. 1000}
 (2 rows)
 
@@ -35,7 +35,7 @@ starting permutation: s2check s1b s1i s2vacuum s1c s2check
 step s2check: SELECT * FROM brin_page_items(get_raw_page('brinidx', 2), 'brinidx'::regclass);
 itemoffset|blknum|attnum|allnulls|hasnulls|placeholder|value   
 ----------+------+------+--------+--------+-----------+--------
-         1|     0|     1|f       |f       |f          |{1 .. 1}
+         1|     0|     1|f       |t       |f          |{1 .. 1}
 (1 row)
 
 step s1b: BEGIN ISOLATION LEVEL REPEATABLE READ;
@@ -45,7 +45,7 @@ step s1c: COMMIT;
 step s2check: SELECT * FROM brin_page_items(get_raw_page('brinidx', 2), 'brinidx'::regclass);
 itemoffset|blknum|attnum|allnulls|hasnulls|placeholder|value      
 ----------+------+------+--------+--------+-----------+-----------
-         1|     0|     1|f       |f       |f          |{1 .. 1}   
+         1|     0|     1|f       |t       |f          |{1 .. 1}   
          2|     1|     1|f       |f       |f          |{1 .. 1000}
 (2 rows)
 
diff --git a/src/test/modules/brin/specs/summarization-and-inprogress-insertion.spec b/src/test/modules/brin/specs/summarization-and-inprogress-insertion.spec
index 19ac18a2e8..18ba92b7ba 100644
--- a/src/test/modules/brin/specs/summarization-and-inprogress-insertion.spec
+++ b/src/test/modules/brin/specs/summarization-and-inprogress-insertion.spec
@@ -9,6 +9,7 @@ setup
     ) WITH (fillfactor=10);
     CREATE INDEX brinidx ON brin_iso USING brin (value) WITH (pages_per_range=1);
     -- this fills the first page
+    INSERT INTO brin_iso VALUES (NULL);
     DO $$
     DECLARE curtid tid;
     BEGIN
-- 
2.39.0

From 16047a0dabca1a0a31cc8d86b274859cd1166438 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.von...@postgresql.org>
Date: Sun, 8 Jan 2023 22:04:41 +0100
Subject: [PATCH] Fix handling of NULLs in BRIN indexes

BRIN indexes did not properly distinguish between summaries for empty
(no rows) and all-NULL ranges. All summaries were initialized with
allnulls=true, and the opclasses simply reset allnulls to false when
processing the first non-NULL value. This however fails if the range
starts with a NULL value (or a sequence of NULL values), in which case
we forget the range contains NULL values.

This happens because the allnulls flag is used for two separate
purposes - to mark empty ranges (not representing any rows yet) and
ranges containing only NULL values.

Opclasses don't know which of these cases it is, and so don't know
whether to set hasnulls=true. Setting hasnulls=true in both cases would
make it correct, but it would also make BRIN indexes useless for queries
with IS NULL clauses - all ranges start empty (and thus allnulls=true),
so all ranges would end up with either allnulls=true or hasnulls=true.

The severity of the issue is somewhat reduced by the fact that it only
happens when adding values to an existing summary with allnulls=true,
not when the summarization is processing values in bulk (e.g. during
CREATE INDEX or automatic summarization). In this case the flags were
updated in a slightly different way, not forgetting the NULL values.

The best solution would be to introduce a new flag marking index tuples
representing ranges with no rows, but that would break on-disk format
and/or ABI, depending on where we put the flag. Considering we need to
backpatch this, that's not acceptable.

So instead we use an "impossible" combination of both flags (allnulls
and hasnulls) set to true, to mark "empty" ranges with no rows. In
principle "empty" is a feature of the whole index tuple, which may
contain multiple summaries in a multi-column index, but this is where
the flags are, unfortunately.

We could also skip storing index tuples for empty summaries, but then
we'd have to always process such ranges - even if there are no rows in
large parts of the table (e.g. after a bulk DELETE), it would still
require reading the pages etc. So we store them, but ignore them when
building the bitmap.

Backpatch to 11. The issue exists since BRIN indexes were introduced in
9.5, but older releases are already EOL.

Backpatch-through: 11
Reviewed-by: Justin Pryzby, Matthias van de Meent
Discussion: https://postgr.es/m/402430e4-7d9d-6cf1-09ef-464d80aff...@enterprisedb.com
---
 src/backend/access/brin/brin.c                | 46 ++++++++++++++++
 src/backend/access/brin/brin_minmax.c         | 54 +++++++++++++++++--
 src/backend/access/brin/brin_tuple.c          | 16 +++++-
 ...summarization-and-inprogress-insertion.out |  8 +--
 ...ummarization-and-inprogress-insertion.spec |  1 +
 5 files changed, 116 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 0becfde113..f1dd39e016 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -253,8 +253,19 @@ brininsert(Relation idxRel, Datum *values, bool *nulls,
 			Datum		result;
 			BrinValues *bval;
 			FmgrInfo   *addValue;
+			bool		first_row;
+			bool		hasnulls;
 
 			bval = &dtup->bt_columns[keyno];
+
+			first_row = (bval->bv_hasnulls && bval->bv_allnulls);
+
+			/*
+			 * Does the range already have NULL values? Either of the flags can
+			 * be set, but we ignore the state before adding first row.
+			 */
+			hasnulls = (bval->bv_hasnulls || bval->bv_allnulls) && !first_row;
+
 			addValue = index_getprocinfo(idxRel, keyno + 1,
 										 BRIN_PROCNUM_ADDVALUE);
 			result = FunctionCall4Coll(addValue,
@@ -265,6 +276,13 @@ brininsert(Relation idxRel, Datum *values, bool *nulls,
 									   nulls[keyno]);
 			/* if that returned true, we need to insert the updated tuple */
 			need_insert |= DatumGetBool(result);
+
+			/*
+			 * If we had NULLS, and the opclass didn't set allnulls=true, clear
+			 * hasnulls so that we remember there are NULL values.
+			 */
+			if (hasnulls && !bval->bv_allnulls)
+				bval->bv_hasnulls = true;
 		}
 
 		if (!need_insert)
@@ -508,6 +526,17 @@ bringetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 									   CurrentMemoryContext);
 					}
 
+					/*
+					 * If the range has both allnulls and hasnulls set, it means
+					 * there are no rows in the range, so we can skip it (we have
+					 * scan keys, and we know there's nothing to match).
+					 */
+					if (bval->bv_allnulls && bval->bv_hasnulls)
+					{
+						addrange = false;
+						break;
+					}
+
 					/*
 					 * Check whether the scan key is consistent with the page
 					 * range values; if so, have the pages in the range added
@@ -645,8 +674,19 @@ brinbuildCallback(Relation index,
 		FmgrInfo   *addValue;
 		BrinValues *col;
 		Form_pg_attribute attr = TupleDescAttr(state->bs_bdesc->bd_tupdesc, i);
+		bool		first_row;
+		bool		hasnulls;
 
 		col = &state->bs_dtuple->bt_columns[i];
+
+		first_row = (col->bv_hasnulls && col->bv_allnulls);
+
+		/*
+		 * Does the range already have NULL values? Either of the flags can
+		 * be set, but we ignore the state before adding first row.
+		 */
+		hasnulls = (col->bv_hasnulls || col->bv_allnulls) && !first_row;
+
 		addValue = index_getprocinfo(index, i + 1,
 									 BRIN_PROCNUM_ADDVALUE);
 
@@ -658,6 +698,12 @@ brinbuildCallback(Relation index,
 						  PointerGetDatum(state->bs_bdesc),
 						  PointerGetDatum(col),
 						  values[i], isnull[i]);
+		/*
+		 * If we had NULLS, and the opclass didn't set allnulls=true, clear
+		 * hasnulls so that we remember there are NULL values.
+		 */
+		if (hasnulls && !col->bv_allnulls)
+			col->bv_hasnulls = true;
 	}
 }
 
diff --git a/src/backend/access/brin/brin_minmax.c b/src/backend/access/brin/brin_minmax.c
index 4b5d6a7213..fe19536c90 100644
--- a/src/backend/access/brin/brin_minmax.c
+++ b/src/backend/access/brin/brin_minmax.c
@@ -75,14 +75,38 @@ brin_minmax_add_value(PG_FUNCTION_ARGS)
 	Form_pg_attribute attr;
 	AttrNumber	attno;
 
+	/*
+	 * Is this the first tuple we're adding to the range range? We track
+	 * that by setting both bv_hasnulls and bval->bv_allnulls to true
+	 * during initialization. But it's not a valid combination (at most
+	 * one of those flags should be set), so we reset the second flag.
+	 *
+	 * XXX The caller is responsible for tracking first_row=true.
+	 */
+	if (column->bv_hasnulls && column->bv_allnulls)
+	{
+		column->bv_hasnulls = false;
+		updated = true;
+	}
+
 	/*
 	 * If the new value is null, we record that we saw it if it's the first
 	 * one; otherwise, there's nothing to do.
 	 */
 	if (isnull)
 	{
-		if (column->bv_hasnulls)
-			PG_RETURN_BOOL(false);
+		/*
+		 * We used to check "bv_hasnulls" which might result in having both
+		 * flags set. That used to be OK, because we just ignore hasnulls
+		 * flag in brin_form_tuple when bv_allnulls=true.
+		 *
+		 * But now we interpret this combination as "first row" so it would
+		 * confuse following calls. So make sure to only set one of these
+		 * flags - when allnulls=true we're done, as it already marks the
+		 * range as containing values.
+		 */
+		if (column->bv_allnulls)
+			PG_RETURN_BOOL(updated);
 
 		column->bv_hasnulls = true;
 		PG_RETURN_BOOL(true);
@@ -250,11 +274,32 @@ brin_minmax_union(PG_FUNCTION_ARGS)
 
 	Assert(col_a->bv_attno == col_b->bv_attno);
 
-	/* Adjust "hasnulls" */
+	/*
+	 * If B is empty (represents no rows), ignore it and just keep
+	 * A as is (might be empty etc.).
+	 */
+	if (col_b->bv_allnulls && col_b->bv_hasnulls)
+		PG_RETURN_VOID();
+
+	/*
+	 * Adjust "hasnulls".
+	 *
+	 * It may happen A has allnulls=true, and we should reset it. We
+	 * need to copy the values from B first, which happens later.
+	 * We know the next condition can't trigger, because B is not
+	 * empty so only one of the flags is true.
+	 */
 	if (!col_a->bv_hasnulls && col_b->bv_hasnulls)
 		col_a->bv_hasnulls = true;
 
-	/* If there are no values in B, there's nothing left to do */
+	/*
+	 * If there are no values in B, there's nothing left to do.
+	 *
+	 * Note this is mutually exclusive with the preceding condition.
+	 * We have skipped "empty" B, so hasnulls and allnulls can't be
+	 * both true. So if we adjusted hasnulls for A, there have to be
+	 * values for B (i.e. we're not terminating here).
+	 */
 	if (col_b->bv_allnulls)
 		PG_RETURN_VOID();
 
@@ -269,6 +314,7 @@ brin_minmax_union(PG_FUNCTION_ARGS)
 	 */
 	if (col_a->bv_allnulls)
 	{
+		/* This also applies if we adjusted hasnulls=true earlier. */
 		col_a->bv_allnulls = false;
 		col_a->bv_values[0] = datumCopy(col_b->bv_values[0],
 										attr->attbyval, attr->attlen);
diff --git a/src/backend/access/brin/brin_tuple.c b/src/backend/access/brin/brin_tuple.c
index b3b453aed1..fa0bfd8699 100644
--- a/src/backend/access/brin/brin_tuple.c
+++ b/src/backend/access/brin/brin_tuple.c
@@ -493,8 +493,15 @@ brin_memtuple_initialize(BrinMemTuple *dtuple, BrinDesc *brdesc)
 	for (i = 0; i < brdesc->bd_tupdesc->natts; i++)
 	{
 		dtuple->bt_columns[i].bv_attno = i + 1;
+
+		/*
+		 * Each memtuple starts as if it represents no rows, which is indicated
+		 * by having bot allnulls and hasnulls set to true. We track this for
+		 * all columns, because we don't have a flag for the whole memtuple.
+		 */
 		dtuple->bt_columns[i].bv_allnulls = true;
-		dtuple->bt_columns[i].bv_hasnulls = false;
+		dtuple->bt_columns[i].bv_hasnulls = true;
+
 		dtuple->bt_columns[i].bv_values = (Datum *) currdatum;
 		currdatum += sizeof(Datum) * brdesc->bd_info[i]->oi_nstored;
 	}
@@ -557,6 +564,13 @@ brin_deform_tuple(BrinDesc *brdesc, BrinTuple *tuple, BrinMemTuple *dMemtuple)
 	{
 		int			i;
 
+		/*
+		 * Make sure to overwrite the hasnulls flag, because it was initialized
+		 * to true by brin_memtuple_initialize and we don't want to skip it if
+		 * allnulls=true.
+		 */
+		dtup->bt_columns[keyno].bv_hasnulls = hasnulls[keyno];
+
 		if (allnulls[keyno])
 		{
 			valueno += brdesc->bd_info[keyno]->oi_nstored;
diff --git a/src/test/modules/brin/expected/summarization-and-inprogress-insertion.out b/src/test/modules/brin/expected/summarization-and-inprogress-insertion.out
index 2a4755d099..584ac2602f 100644
--- a/src/test/modules/brin/expected/summarization-and-inprogress-insertion.out
+++ b/src/test/modules/brin/expected/summarization-and-inprogress-insertion.out
@@ -4,7 +4,7 @@ starting permutation: s2check s1b s2b s1i s2summ s1c s2c s2check
 step s2check: SELECT * FROM brin_page_items(get_raw_page('brinidx', 2), 'brinidx'::regclass);
 itemoffset|blknum|attnum|allnulls|hasnulls|placeholder|value   
 ----------+------+------+--------+--------+-----------+--------
-         1|     0|     1|f       |f       |f          |{1 .. 1}
+         1|     0|     1|f       |t       |f          |{1 .. 1}
 (1 row)
 
 step s1b: BEGIN ISOLATION LEVEL REPEATABLE READ;
@@ -26,7 +26,7 @@ step s2c: COMMIT;
 step s2check: SELECT * FROM brin_page_items(get_raw_page('brinidx', 2), 'brinidx'::regclass);
 itemoffset|blknum|attnum|allnulls|hasnulls|placeholder|value      
 ----------+------+------+--------+--------+-----------+-----------
-         1|     0|     1|f       |f       |f          |{1 .. 1}   
+         1|     0|     1|f       |t       |f          |{1 .. 1}   
          2|     1|     1|f       |f       |f          |{1 .. 1000}
 (2 rows)
 
@@ -35,7 +35,7 @@ starting permutation: s2check s1b s1i s2vacuum s1c s2check
 step s2check: SELECT * FROM brin_page_items(get_raw_page('brinidx', 2), 'brinidx'::regclass);
 itemoffset|blknum|attnum|allnulls|hasnulls|placeholder|value   
 ----------+------+------+--------+--------+-----------+--------
-         1|     0|     1|f       |f       |f          |{1 .. 1}
+         1|     0|     1|f       |t       |f          |{1 .. 1}
 (1 row)
 
 step s1b: BEGIN ISOLATION LEVEL REPEATABLE READ;
@@ -45,7 +45,7 @@ step s1c: COMMIT;
 step s2check: SELECT * FROM brin_page_items(get_raw_page('brinidx', 2), 'brinidx'::regclass);
 itemoffset|blknum|attnum|allnulls|hasnulls|placeholder|value      
 ----------+------+------+--------+--------+-----------+-----------
-         1|     0|     1|f       |f       |f          |{1 .. 1}   
+         1|     0|     1|f       |t       |f          |{1 .. 1}   
          2|     1|     1|f       |f       |f          |{1 .. 1000}
 (2 rows)
 
diff --git a/src/test/modules/brin/specs/summarization-and-inprogress-insertion.spec b/src/test/modules/brin/specs/summarization-and-inprogress-insertion.spec
index 19ac18a2e8..18ba92b7ba 100644
--- a/src/test/modules/brin/specs/summarization-and-inprogress-insertion.spec
+++ b/src/test/modules/brin/specs/summarization-and-inprogress-insertion.spec
@@ -9,6 +9,7 @@ setup
     ) WITH (fillfactor=10);
     CREATE INDEX brinidx ON brin_iso USING brin (value) WITH (pages_per_range=1);
     -- this fills the first page
+    INSERT INTO brin_iso VALUES (NULL);
     DO $$
     DECLARE curtid tid;
     BEGIN
-- 
2.39.0

Attachment: stress-test.sh
Description: application/shellscript

Reply via email to