On Mon, Sep 29, 2014 at 2:54 PM, Josh Berkus <j...@agliodbs.com> wrote:
> On 09/26/2014 01:06 AM, Simon Riggs wrote: > > On 23 September 2014 00:56, Josh Berkus <j...@agliodbs.com> wrote: > > > >> We've hashed that out a bit, but frankly I think it's much more > >> profitable to pursue fixing the actual problem than providing a > >> workaround like "risk", such as: > >> > >> a) fixing n_distinct estimation > >> b) estimating stacked quals using better math (i.e. not assuming total > >> randomness) > >> c) developing some kind of correlation stats > >> > >> Otherwise we would be just providing users with another knob there's no > >> rational way to set. > > > > I believe this is a serious issue for PostgreSQL users and one that > > needs to be addressed. > > > > n_distinct can be fixed manually, so that is less of an issue. > > It's an issue for the 99.8% of our users who don't know what n_distinct > is, let alone how to calculate it. Also, changing it requires an > exclusive lock on the table. Of course, you and I have been over this > issue before. > If 99.6% of our users don't have a problem with n_distinct in their system, that would mean that only 50% of the people with the problem don't know how to solve it. And those people can usually get excellent free help on the internet. But if the problem not with n_distinct, but rather with most_common_freqs (which I encounter more often than problems with n_distinct), all I can do is shrug and say "yeah I know about that problem. Either crank up statistics target as high as it will go, or it sucks to be you." > > One thing I'm wondering is why our estimator is creates n_distinct as a > % so seldom. Really, any time n_distinct is over 10K we should be > estimating a % instead. Now, estimating that % has its own issues, but > it does seem like a peculiar quirk of our stats model. > > Anyway, in the particular case I posted fixing n_distinct to realistic > numbers (%) fixed the query plan. > But wouldn't fixing the absolute number also have fixed the plan? If you are going to set a number manually and then nail it in place so that analyze stops changing it, then I can certainly see how the fractional method is desirable. But if the goal is not to do that but have the correct value estimated in the first place, I don't really see much benefit from converting the estimate into a fraction and then back again. > > > > The problem, as I see it, is different. We assume that if there are > > 100 distinct values and you use LIMIT 1 that you would only need to > > scan 1% of rows. We assume that the data is arranged in the table in a > > very homogenous layout. When data is not, and it seldom is, we get > > problems. > > > > Simply put, assuming that LIMIT will reduce the size of all scans is > > just way wrong. I've seen many plans where increasing the LIMIT > > dramatically improves the plan. > > > > If we can at least agree it is a problem, we can try to move forwards. > I don't think anyone doubts there is a problem (many more than one of them), there is just disagreement about the priority and what can be done about it. > > That is certainly another problem. Does correlation stat figure in the > LIMIT calculation at all, currently? That's what correlation stat is > for, no? > I don't think correlation is up to the task as a complete solution, although it might help a little. There is no way a simple correlation can encode that John retired 15 years ago and hasn't logged on since, while Johannes was hired yesterday and never logged on before then. Cheers, Jeff