Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq

Tom Lane Fri, 27 Feb 2026 11:15:46 -0800

Tender Wang <[email protected]> writes:
> I added Tom to the cc list. He may know more about this.


Hmm, git blame says I originated this function 25 years ago
(f905d65ee), but I don't claim to remember that.

Looking at it now, though, I think that bd3e3e9e5 is indeed
wrong but not in the way Joel suggests.  The longstanding
way to compute mcv_freq is

            /*
             * The first MCV stat is for the most common value.
             */
            if (sslot.nnumbers > 0)
                *mcv_freq = sslot.numbers[0];

*This number is a fraction measured on the raw relation.*
(Necessarily so, because it's just a number computed by ANALYZE.)
Then bd3e3e9e5 added

            /*
             * If there are no recorded MCVs, but we do have a histogram, then
             * assume that ANALYZE determined that the column is unique.
             */
            if (vardata.rel && vardata.rel->rows > 0)
                *mcv_freq = 1.0 / vardata.rel->rows;

This is a pure thinko.  rel->rows is the estimated number
of filtered rows.  What I should have used is rel->tuples,
which is the estimated raw relation size, so as to get a
number that is commensurate with the longstanding way
of calculating mcv_freq.  Then that also matches up with
computing avgfreq on the raw relation.

So I think the correct fix is basically

-            if (vardata.rel && vardata.rel->rows > 0)
-                *mcv_freq = 1.0 / vardata.rel->rows;
+            if (vardata.rel && vardata.rel->tuples > 0)
+                *mcv_freq = 1.0 / vardata.rel->tuples;

and I wonder if that will wind up in reverting a lot of the plan
choice changes seen in bd3e3e9e5.

Joel, do you want to run this to ground, and in particular
see if that way of fixing it passes your sanity tests?

                        regards, tom lane

Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq

Reply via email to