I wrote:
> Well, we don't have a most common element in this scenario --- the
> whole point is that the occurrence counts resulting from the lossy
> counting algorithm are too low to be trustworthy.  However, what we
> do have is the cutoff frequency, and it seems to me that we could use
> that as the estimate of the maximum frequency of the non-MCEs.

Here's a less crude patch for that.  The array_typanalyze.c changes
are the same as before, but I reconsidered what to do about this
stumbling block:

> Assuming we like this direction, the main thing that makes this a hack
> not a polished patch is that I had to strongarm the code into storing
> a zero-length values array.  What update_attstats would normally do
> is leave the values column of the MCELEM stats slot NULL, which then
> causes get_attstatsslot to throw a that-shouldn't-be-null error.
> An alternative answer is to change get_attstatsslot to allow a null,
> but I'm not sure that that's any cleaner.  Either way it seems like
> there's a hazard of breaking some code that isn't expecting the case.

I concluded that letting get_attstatsslot accept nulls is too risky;
there is probably code that assumes that get_attstatsslot would've
rejected that.  Instead, this version changes update_attstats' rule
for when to store an array rather than a null.  Now it will do so
if the passed-in stavalues pointer isn't null, even if numvalues
is zero.  I think that this doesn't change the outcome for any
existing analyze routines because they wouldn't pass a data pointer
if they have no values; and even if it does, storing an empty array
not a null seems like it should be pretty harmless.

> An alternative that feels cleaner but a good bit more invasive
> is to get rid of the convention of storing the min/max/nulls
> frequencies as extra entries in the MCELEM numbers entry ---
> which surely is a hack too --- and put them into some new slot
> type.  I'm not sure if that's enough nicer to be worth the
> conversion pain.

A year ago I would have seriously considered doing it that way.
But now that we have code to dump-n-restore stats, that code would
have to be adjusted to convert the old representation.  It's not
worth it for this case.

Hence, v1 attached, now with a commit message.

                        regards, tom lane

From ac1f22903594ce3613d615fe5fba09c77f216769 Mon Sep 17 00:00:00 2001
From: Tom Lane <t...@sss.pgh.pa.us>
Date: Fri, 18 Jul 2025 17:43:21 -0400
Subject: [PATCH v1] Track the maximum possible frequency of non-MCE array
 elements.

The lossy-counting algorithm that ANALYZE uses to identify
most-common array elements may decide that none of the values
are common enough to be called MCEs.  In such cases we stored
nothing in the STATISTIC_KIND_MCELEM pg_statistic slot, which
resulted in array_selfuncs.c falling back to default estimates.
However, we do in fact have valuable information to convey:
we know that none of the array elements are as common as the
cutoff frequency that lossy counting reported.  If we report
that as the minimum MCE frequency, array_selfuncs.c will
arrive at significantly-better-than-default estimates.

So we want to construct a STATISTIC_KIND_MCELEM entry that contains
no "values" but does have "numbers", to wit the three extra numbers
that the MCELEM entry type defines.  A small obstacle is that
update_attstats() has traditionally stored a null, not an empty array,
when passed zero "values" for a slot.  That gives rise to an MCELEM
entry that get_attstatsslot() will spit up on.  The least risky
solution seems to be to adjust update_attstats() so that it will emit
a non-null (but possibly empty) array when the passed stavalues array
pointer isn't NULL, rather than conditioning that on numvalues > 0.
In other existing cases I don't believe that that changes anything.
For consistency, handle the stanumbers array the same way.

Reported-by: Mark Frost <frost...@uk.ibm.com>
Author: Tom Lane <t...@sss.pgh.pa.us>
Discussion: https://postgr.es/m/ph3ppf1c905d6e6f24a5c1a1a1d8345b593e1...@ph3ppf1c905d6e6.namprd15.prod.outlook.com
---
 src/backend/commands/analyze.c           |  7 +++----
 src/backend/utils/adt/array_typanalyze.c | 15 ++++++++++++++-
 2 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 7111d5d5334..a47e23eaa8c 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1712,10 +1712,9 @@ update_attstats(Oid relid, bool inh, int natts, VacAttrStats **vacattrstats)
 		i = Anum_pg_statistic_stanumbers1 - 1;
 		for (k = 0; k < STATISTIC_NUM_SLOTS; k++)
 		{
-			int			nnum = stats->numnumbers[k];
-
-			if (nnum > 0)
+			if (stats->stanumbers[k] != NULL)
 			{
+				int			nnum = stats->numnumbers[k];
 				Datum	   *numdatums = (Datum *) palloc(nnum * sizeof(Datum));
 				ArrayType  *arry;
 
@@ -1733,7 +1732,7 @@ update_attstats(Oid relid, bool inh, int natts, VacAttrStats **vacattrstats)
 		i = Anum_pg_statistic_stavalues1 - 1;
 		for (k = 0; k < STATISTIC_NUM_SLOTS; k++)
 		{
-			if (stats->numvalues[k] > 0)
+			if (stats->stavalues[k] != NULL)
 			{
 				ArrayType  *arry;
 
diff --git a/src/backend/utils/adt/array_typanalyze.c b/src/backend/utils/adt/array_typanalyze.c
index 6f61629b977..346c190eaf3 100644
--- a/src/backend/utils/adt/array_typanalyze.c
+++ b/src/backend/utils/adt/array_typanalyze.c
@@ -497,6 +497,15 @@ compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
 		 * If we obtained more elements than we really want, get rid of those
 		 * with least frequencies.  The easiest way is to qsort the array into
 		 * descending frequency order and truncate the array.
+		 *
+		 * On the other extreme, we might have found no elements with
+		 * frequency above the cutoff.  Then there's nothing to store so far
+		 * as element values go, but we still want to record the cutoff
+		 * frequency, since array_selfuncs.c can use that as an upper bound on
+		 * the frequency of non-MCEs.  (Note: what we store is the minimum
+		 * frequency that would have been accepted as a valid MCE.  In
+		 * particular, this ensures we don't create a bogus stats entry with
+		 * frequency zero.)
 		 */
 		if (num_mcelem < track_len)
 		{
@@ -506,10 +515,14 @@ compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
 			minfreq = sort_table[num_mcelem - 1]->frequency;
 		}
 		else
+		{
 			num_mcelem = track_len;
+			if (track_len == 0)
+				minfreq = maxfreq = cutoff_freq + 1;
+		}
 
 		/* Generate MCELEM slot entry */
-		if (num_mcelem > 0)
+		if (num_mcelem >= 0)
 		{
 			MemoryContext old_context;
 			Datum	   *mcelem_values;
-- 
2.43.5

Reply via email to