I finally got around to testing your patch on a realistic data set. In short, the patch worked beautifully even with the division by 2 removed. In case it's helpful, the full write up of my investigation can be found at https://gist.github.com/mattlong/0617bec6e1cf5bc6b70c6c2951901df7
Your reasoning about the division by 2 being a way to approximate the "average" non-MCE value seems logical. The patch as tested above without a division yielded a row estimate of 28 for a non-MCE element on a table with ~2.1 million rows where all non-MCE elements occur exactly once. For our scenario/distribution of the data, the most conservative option, i.e. no division, was good enough to produce optimal query plans, but this is of course a very specific data distribution. Thinking more generally, there are two cases: When the MCE list is full and when the MCE list is NOT full. When the MCE list is NOT full, the case that this patch addresses, the data is definitionally heavily skewed; minfreq will already be quite low. Whether or not it gets divided by 2 (or some other number) probably won't make much of a difference either way. When the MCE list is full, the data is less skewed; there will generally be a longer tail of element frequencies. Dividing by 2 will be more impactful here in an attempt to approximate the "average" non-MCE element frequency. Putting that together, it seems like keeping the division by 2 would be preferable? On Tue, Sep 9, 2025 at 12:19 PM Tom Lane <t...@sss.pgh.pa.us> wrote: > > I wrote: > > * The selectivity functions believe that the upper bound on the > > frequency of non-MCEs is minfreq / 2, not the stored minfreq. > > This seems like complete brain fade: there could easily be > > elements with frequency just less than minfreq, and probably are > > if the data distribution follows Zipf's law. I did not dig into > > the git history, but I wonder if the divide-by-two business > > predates the introduction of the lossy-counting algorithm, and > > if so whether it was less insane with the original collection > > algorithm. In any case, this patch removes the divisions by 2, > > and makes some nearby cosmetic improvements. > > In the light of morning, I started to have second thoughts about > this aspect of the patch. I checked the git history this time, > and found that the oldest instance of "minfreq / 2" dates to > 4e57668da which originated from this discussion: > > https://www.postgresql.org/message-id/flat/488DAEB8.3000402%40students.mimuw.edu.pl > > It's already coded like that in Jan's initial patch, and there > was no discussion in the thread, so not a lot to be gleaned: > > + /* > + * The element is not in MCELEM. Punt, but assure that the > + * selectivity cannot be more than minfreq / 2. > + */ > + return (Selectivity) Min(DEFAULT_TS_SEL, minfreq / 2); > > But looking at this, I think the problem is really that the comment > fails to describe his thought process. We know that the frequency > of this not-in-the-MCE-list value cannot be more than minfreq, but > we have no idea how much less it is. I think the idea of the > division might have been to assume that an "average" non-MCE value > would have a frequency about half that of the lowest MCE. It does > seem reasonable to use some number lower than the cutoff, although > I dunno if 0.5 is exactly the right factor. > > So now I'm wondering if we should leave the divisions alone and > instead fix the comments to explain why they are there. Bolstering > this is that on the two test cases you guys submitted, the patch > with the divisions gets a spot-on estimate (1 row) while removing > the divisions yields an estimate of 2 rows. I don't put a lot of > weight on that, since these are toy examples. But I wonder if you > guys could test the patch on some of your real-world cases and > see if it looks like we should keep the divisions. > > regards, tom lane