> If we could increase the sampling ratio beyond the hard coded 300x to get a 
> more representative sample and use that to estimate ndistinct (and also the 
> frequency of the most common values) but only actually stored the 100 MCVs 
> (or whatever the stats target is set to for the system or column) then the 
> issue may be mitigated without increasing planning time because of stats that 
> are larger than prudent, and the "only" cost should be longer processing time 
> when (auto) analyzing... plus overhead for considering this potential new 
> setting in all analyze cases I suppose.

I found another large deviation in one of my bridge tables. It is an (int,int) 
table of 900M rows where the B column contains 2.7M distinct values, however 
the pg_stats table claims it to be only 10.400. These numbers are with a 
statistics target of 500. I'm not sure that really matters for the planner for 
the queries I run, but it makes me a little nervous :)

Also, is it just my data samples, or is the n_distinct way more often 
underestimated by a larger ratio, than overestimated?

K

Reply via email to