On Thu, 2016-03-03 at 11:42 -0800, Mark Dilger wrote: > > On Mar 3, 2016, at 11:27 AM, Alexander Korotkov <a.korot...@postgrespro.ru> > > wrote: > > > > On Thu, Mar 3, 2016 at 10:16 PM, Tomas Vondra > > <tomas.von...@2ndquadrant.com> wrote: > > So yes, each estimator works great for exactly the opposite cases. But > > notice that typically, the results of the new formula is much higher than > > the old one, sometimes by two orders of magnitude (and it shouldn't be > > difficult to construct examples of much higher differences). > > > > The table also includes the 'average' estimator you propose, but it's > > rather obvious that the result is always much closer to the new value, > > simply because > > > > (small number) + (huge number) > > ------------------------------ > > 2 > > > > is always much closer to the huge number. We're usually quite happy > > when the estimates are within the same order of magnitude, so whether > > it's K or K/2 makes pretty much no difference. > > > > I believe that Mark means geometrical average, i.e. sqrt((small number) * > > (huge number)).

Ah, OK. Haven't realized you meant geometric mean. With that it looks like this: 1) independent 10 50 100 500 1000 5000 ------------------------------------------------------------------ actual 919 3829 6244 9944 10001 10001 current 10 50 102 516 1018 4996 new 973 4001 6382 9897 9951 9951 average 491 2025 3205 5206 5484 7473 geom. mean 99 447 807 2260 3183 7051 2) dependent 10 50 100 500 1000 5000 ------------------------------------------------------------------ actual 10 50 100 500 1000 5000 current 10 53 105 508 1016 5014 new 880 4105 6472 9955 10018 10018 average 445 2079 3288 5231 5517 7516 geom. mean 94 466 824 2249 3190 7087 To better illustrate the differences, we may use the "ratio error" defined as err(a,b) = MAX(a/b, b/a) where 'a' is the actual value and 'b' is the estimate. That gives us: 1) independent 10 50 100 500 1000 5000 ------------------------------------------------------------------ current 91.90 76.58 61.22 19.27 9.82 2.00 new 1.06 1.04 1.02 1.00 1.01 1.01 average 1.87 1.89 1.95 1.91 1.82 1.34 geom. mean 9.32 8.56 7.74 4.40 3.14 1.42 2) dependent 10 50 100 500 1000 5000 ------------------------------------------------------------------ current 1.00 1.06 1.05 1.02 1.02 1.00 new 88.00 82.10 64.72 19.91 10.02 2.00 average 44.50 41.58 32.88 10.46 5.52 1.50 geom. mean 9.38 9.33 8.24 4.50 3.19 1.42 So the geometric mean seems to be working better than plain average. But I don't think we can simply conclude we should use the geometric mean of the estimates, as it assumes the risks of over- and under-estimating the cardinality are the same. Because they aren't. > > Yes, that is what I proposed upthread. I'm not wedded to that, though. > In general, I am with Tomas on this one, believing that his estimate > will be much better than the current estimate. But I believe the *best* > estimate will be somewhere between his and the current one, and I'm > fishing for any decent way of coming up with a weighted average that > is closer to his than to the current, but not simply equal to his proposal. > > The reason I want the formula to be closer to Tomas's than to the > current is that I think that on average, across all tables, across all > databases, in practice it will be closer to the right estimate than the > current formula. That's just my intuition, and so I can't defend it. > But if my intuition is right, the best formula we can adopt would be one > that is moderated from his by a little bit, and in the direction of the > estimate that the current code generates. I think looking merely at what fraction of data sets (or queries) uses dependent GROUP BY and WHERE clauses is not sufficient. The general behavior of the two GROUP BY estimators is that the current one tends to under-estimate, while the new one tends to over-estimate. (It's not difficult to construct counter-examples by using more complex dependencies between the columns etc. but let's ignore that.) The risk associated with over-estimation is that we may switch from HashAggregate to GroupAggregate, and generally select paths better suited for large number of rows. Not great, because the plan may not be optimal and thus run much slower - but that usually only happens on small tables, because on large tables we would probably end up using those same paths anyway. With under-estimation, the main risks are much graver, as we may end up using HashAggregate only to get killed by OOM. When we're lucky not to hit that, my experience this often leads to a cascade of NestedLoop nodes processing orders of magnitude more tuples than expected. Awful. So I think the risk is asymmetric, and that's why I like the new estimator more, because it tends to over-estimate, but in a very reasonable way. > > I could easily lose this debate merely for lack of a principled basis > for saying how far toward the current estimate the new estimate should > be adjusted. The geometric average is one suggestion, but I don't have > a principled argument for it. > > Like I said above, I'm fishing for a decent formula here. Thanks for the valuable feedback! > > Mark Dilger regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers