Re: Poor row estimates from planner, stat `most_common_elems` sometimes missing for a text[] column

Matt Long Mon, 08 Sep 2025 12:21:14 -0700

Not to let perfect be the enemy of better, but we're facing a variant of
this issue that would not be addressed by the proposed patch. If you're
interested, the real world use case is described in
https://github.com/medplum/medplum/issues/7310. Rows have an array
consisting of one common element present in every row and one unique
element. With a statistics target at the default 100, this pushes the
threshold for vastly overestimated row counts for the unique elements down
to 7,937 rows.


In this case, the effects of the proposed patch are not applied since the
most_common_elems array is not empty. I'm not a statistician, so maybe this
wouldn't be valid, but it seems like using the highest frequency of an
element that did not qualify for the mce list instead of the 0.5% default
frequency could be an elegant, but more invasive solution.

Simple repro of the problem:

CREATE TABLE public.test(
    id integer,
    tags text[]
);

INSERT INTO public.test (id, tags)
        SELECT n, ARRAY['common', TO_CHAR(n,'fm00000000')] FROM ( SELECT
generate_series(1,7_937) AS n );

ANALYZE public.test;

SELECT most_common_elems, most_common_elem_freqs FROM pg_stats WHERE
schemaname = 'public' AND tablename = 'test' AND attname = 'tags';

EXPLAIN (ANALYZE,BUFFERS,VERBOSE)
        SELECT * FROM test WHERE tags && ARRAY['common'];

EXPLAIN (ANALYZE,BUFFERS,VERBOSE)
        SELECT * FROM test WHERE tags && ARRAY['00000001'];

With 7,936 rows:
    most_common_elems has 1,000 entries and looks like
{00000001,00000003,00000009,...,common}
    most_common_elem_freqs has 1,003 entries and looks like
{0.00012600806,0.00012600806,0.00012600806,...,1,0.00012600806,1,0}

With 7,937+ rows:
    most_common_elems is {common}
    most_common_elem_freqs is {1,1,1,0}

On Mon, Sep 8, 2025 at 11:27 AM Tom Lane <t...@sss.pgh.pa.us> wrote:

> I wrote:
> > Well, we don't have a most common element in this scenario --- the
> > whole point is that the occurrence counts resulting from the lossy
> > counting algorithm are too low to be trustworthy.  However, what we
> > do have is the cutoff frequency, and it seems to me that we could use
> > that as the estimate of the maximum frequency of the non-MCEs.
>
> Here's a less crude patch for that.  The array_typanalyze.c changes
> are the same as before, but I reconsidered what to do about this
> stumbling block:
>
> > Assuming we like this direction, the main thing that makes this a hack
> > not a polished patch is that I had to strongarm the code into storing
> > a zero-length values array.  What update_attstats would normally do
> > is leave the values column of the MCELEM stats slot NULL, which then
> > causes get_attstatsslot to throw a that-shouldn't-be-null error.
> > An alternative answer is to change get_attstatsslot to allow a null,
> > but I'm not sure that that's any cleaner.  Either way it seems like
> > there's a hazard of breaking some code that isn't expecting the case.
>
> I concluded that letting get_attstatsslot accept nulls is too risky;
> there is probably code that assumes that get_attstatsslot would've
> rejected that.  Instead, this version changes update_attstats' rule
> for when to store an array rather than a null.  Now it will do so
> if the passed-in stavalues pointer isn't null, even if numvalues
> is zero.  I think that this doesn't change the outcome for any
> existing analyze routines because they wouldn't pass a data pointer
> if they have no values; and even if it does, storing an empty array
> not a null seems like it should be pretty harmless.
>
> > An alternative that feels cleaner but a good bit more invasive
> > is to get rid of the convention of storing the min/max/nulls
> > frequencies as extra entries in the MCELEM numbers entry ---
> > which surely is a hack too --- and put them into some new slot
> > type.  I'm not sure if that's enough nicer to be worth the
> > conversion pain.
>
> A year ago I would have seriously considered doing it that way.
> But now that we have code to dump-n-restore stats, that code would
> have to be adjusted to convert the old representation.  It's not
> worth it for this case.
>
> Hence, v1 attached, now with a commit message.
>
>                         regards, tom lane
>
>

Re: Poor row estimates from planner, stat `most_common_elems` sometimes missing for a text[] column

Reply via email to