> If we could increase the sampling ratio beyond the hard coded 300x to get a > more representative sample and use that to estimate ndistinct (and also the > frequency of the most common values) but only actually stored the 100 MCVs > (or whatever the stats target is set to for the system or column) then the > issue may be mitigated without increasing planning time because of stats that > are larger than prudent, and the "only" cost should be longer processing time > when (auto) analyzing... plus overhead for considering this potential new > setting in all analyze cases I suppose.
I found another large deviation in one of my bridge tables. It is an (int,int) table of 900M rows where the B column contains 2.7M distinct values, however the pg_stats table claims it to be only 10.400. These numbers are with a statistics target of 500. I'm not sure that really matters for the planner for the queries I run, but it makes me a little nervous :) Also, is it just my data samples, or is the n_distinct way more often underestimated by a larger ratio, than overestimated? K