I wrote: > Well, we don't have a most common element in this scenario --- the > whole point is that the occurrence counts resulting from the lossy > counting algorithm are too low to be trustworthy. However, what we > do have is the cutoff frequency, and it seems to me that we could use > that as the estimate of the maximum frequency of the non-MCEs.
Here's a less crude patch for that. The array_typanalyze.c changes are the same as before, but I reconsidered what to do about this stumbling block: > Assuming we like this direction, the main thing that makes this a hack > not a polished patch is that I had to strongarm the code into storing > a zero-length values array. What update_attstats would normally do > is leave the values column of the MCELEM stats slot NULL, which then > causes get_attstatsslot to throw a that-shouldn't-be-null error. > An alternative answer is to change get_attstatsslot to allow a null, > but I'm not sure that that's any cleaner. Either way it seems like > there's a hazard of breaking some code that isn't expecting the case. I concluded that letting get_attstatsslot accept nulls is too risky; there is probably code that assumes that get_attstatsslot would've rejected that. Instead, this version changes update_attstats' rule for when to store an array rather than a null. Now it will do so if the passed-in stavalues pointer isn't null, even if numvalues is zero. I think that this doesn't change the outcome for any existing analyze routines because they wouldn't pass a data pointer if they have no values; and even if it does, storing an empty array not a null seems like it should be pretty harmless. > An alternative that feels cleaner but a good bit more invasive > is to get rid of the convention of storing the min/max/nulls > frequencies as extra entries in the MCELEM numbers entry --- > which surely is a hack too --- and put them into some new slot > type. I'm not sure if that's enough nicer to be worth the > conversion pain. A year ago I would have seriously considered doing it that way. But now that we have code to dump-n-restore stats, that code would have to be adjusted to convert the old representation. It's not worth it for this case. Hence, v1 attached, now with a commit message. regards, tom lane
From ac1f22903594ce3613d615fe5fba09c77f216769 Mon Sep 17 00:00:00 2001 From: Tom Lane <t...@sss.pgh.pa.us> Date: Fri, 18 Jul 2025 17:43:21 -0400 Subject: [PATCH v1] Track the maximum possible frequency of non-MCE array elements. The lossy-counting algorithm that ANALYZE uses to identify most-common array elements may decide that none of the values are common enough to be called MCEs. In such cases we stored nothing in the STATISTIC_KIND_MCELEM pg_statistic slot, which resulted in array_selfuncs.c falling back to default estimates. However, we do in fact have valuable information to convey: we know that none of the array elements are as common as the cutoff frequency that lossy counting reported. If we report that as the minimum MCE frequency, array_selfuncs.c will arrive at significantly-better-than-default estimates. So we want to construct a STATISTIC_KIND_MCELEM entry that contains no "values" but does have "numbers", to wit the three extra numbers that the MCELEM entry type defines. A small obstacle is that update_attstats() has traditionally stored a null, not an empty array, when passed zero "values" for a slot. That gives rise to an MCELEM entry that get_attstatsslot() will spit up on. The least risky solution seems to be to adjust update_attstats() so that it will emit a non-null (but possibly empty) array when the passed stavalues array pointer isn't NULL, rather than conditioning that on numvalues > 0. In other existing cases I don't believe that that changes anything. For consistency, handle the stanumbers array the same way. Reported-by: Mark Frost <frost...@uk.ibm.com> Author: Tom Lane <t...@sss.pgh.pa.us> Discussion: https://postgr.es/m/ph3ppf1c905d6e6f24a5c1a1a1d8345b593e1...@ph3ppf1c905d6e6.namprd15.prod.outlook.com --- src/backend/commands/analyze.c | 7 +++---- src/backend/utils/adt/array_typanalyze.c | 15 ++++++++++++++- 2 files changed, 17 insertions(+), 5 deletions(-) diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index 7111d5d5334..a47e23eaa8c 100644 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -1712,10 +1712,9 @@ update_attstats(Oid relid, bool inh, int natts, VacAttrStats **vacattrstats) i = Anum_pg_statistic_stanumbers1 - 1; for (k = 0; k < STATISTIC_NUM_SLOTS; k++) { - int nnum = stats->numnumbers[k]; - - if (nnum > 0) + if (stats->stanumbers[k] != NULL) { + int nnum = stats->numnumbers[k]; Datum *numdatums = (Datum *) palloc(nnum * sizeof(Datum)); ArrayType *arry; @@ -1733,7 +1732,7 @@ update_attstats(Oid relid, bool inh, int natts, VacAttrStats **vacattrstats) i = Anum_pg_statistic_stavalues1 - 1; for (k = 0; k < STATISTIC_NUM_SLOTS; k++) { - if (stats->numvalues[k] > 0) + if (stats->stavalues[k] != NULL) { ArrayType *arry; diff --git a/src/backend/utils/adt/array_typanalyze.c b/src/backend/utils/adt/array_typanalyze.c index 6f61629b977..346c190eaf3 100644 --- a/src/backend/utils/adt/array_typanalyze.c +++ b/src/backend/utils/adt/array_typanalyze.c @@ -497,6 +497,15 @@ compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc, * If we obtained more elements than we really want, get rid of those * with least frequencies. The easiest way is to qsort the array into * descending frequency order and truncate the array. + * + * On the other extreme, we might have found no elements with + * frequency above the cutoff. Then there's nothing to store so far + * as element values go, but we still want to record the cutoff + * frequency, since array_selfuncs.c can use that as an upper bound on + * the frequency of non-MCEs. (Note: what we store is the minimum + * frequency that would have been accepted as a valid MCE. In + * particular, this ensures we don't create a bogus stats entry with + * frequency zero.) */ if (num_mcelem < track_len) { @@ -506,10 +515,14 @@ compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc, minfreq = sort_table[num_mcelem - 1]->frequency; } else + { num_mcelem = track_len; + if (track_len == 0) + minfreq = maxfreq = cutoff_freq + 1; + } /* Generate MCELEM slot entry */ - if (num_mcelem > 0) + if (num_mcelem >= 0) { MemoryContext old_context; Datum *mcelem_values; -- 2.43.5