# Re: [HACKERS] improving GROUP BY estimation

```On Thu, 2016-03-03 at 11:42 -0800, Mark Dilger wrote:
> > On Mar 3, 2016, at 11:27 AM, Alexander Korotkov <a.korot...@postgrespro.ru>
> > wrote:
> >
> > On Thu, Mar 3, 2016 at 10:16 PM, Tomas Vondra
> > So yes, each estimator works great for exactly the opposite cases. But
> > notice that typically, the results of the new formula is much higher than
> > the old one, sometimes by two orders of magnitude (and it shouldn't be
> > difficult to construct examples of much higher differences).
> >
> > The table also includes the 'average' estimator you propose, but it's
> > rather obvious that the result is always much closer to the new value,
> > simply because
> >
> >    (small number) + (huge number)
> >    ------------------------------
> >                   2
> >
> > is always much closer to the huge number. We're usually quite happy
> > when the estimates are within the same order of magnitude, so whether
> > it's K or K/2 makes pretty much no difference.
> >
> > I believe that Mark means geometrical average, i.e. sqrt((small number) *
> > (huge number)).```
```
Ah, OK. Haven't realized you meant geometric mean. With that it looks
like this:

1) independent

10      50     100     500    1000    5000
------------------------------------------------------------------
actual               919    3829    6244    9944   10001   10001
current               10      50     102     516    1018    4996
new                  973    4001    6382    9897    9951    9951
average              491    2025    3205    5206    5484    7473
geom. mean            99     447     807    2260    3183    7051

2) dependent

10      50     100     500    1000    5000
------------------------------------------------------------------
actual                10      50     100     500    1000    5000
current               10      53     105     508    1016    5014
new                  880    4105    6472    9955   10018   10018
average              445    2079    3288    5231    5517    7516
geom. mean            94     466     824    2249    3190    7087

To better illustrate the differences, we may use the "ratio error"
defined as

err(a,b) = MAX(a/b, b/a)

where 'a' is the actual value and 'b' is the estimate. That gives us:

1) independent

10       50      100      500    1000    5000
------------------------------------------------------------------
current       91.90    76.58    61.22    19.27    9.82    2.00
new            1.06     1.04     1.02     1.00    1.01    1.01
average        1.87     1.89     1.95     1.91    1.82    1.34
geom. mean     9.32     8.56     7.74     4.40    3.14    1.42

2) dependent

10       50      100      500    1000    5000
------------------------------------------------------------------
current        1.00     1.06     1.05     1.02    1.02    1.00
new           88.00    82.10    64.72    19.91   10.02    2.00
average       44.50    41.58    32.88    10.46    5.52    1.50
geom. mean     9.38     9.33     8.24     4.50    3.19    1.42

So the geometric mean seems to be working better than plain average. But
I don't think we can simply conclude we should use the geometric mean of
the estimates, as it assumes the risks of over- and under-estimating the
cardinality are the same. Because they aren't.

>
> Yes, that is what I proposed upthread.  I'm not wedded to that, though.
> In general, I am with Tomas on this one, believing that his estimate
> will be much better than the current estimate.  But I believe the *best*
> estimate will be somewhere between his and the current one, and I'm
> fishing for any decent way of coming up with a weighted average that
> is closer to his than to the current, but not simply equal to his proposal.
>
> The reason I want the formula to be closer to Tomas's than to the
> current is that I think that on average, across all tables, across all
> databases, in practice it will be closer to the right estimate than the
> current formula.  That's just my intuition, and so I can't defend it.
> But if my intuition is right, the best formula we can adopt would be one
> that is moderated from his by a little bit, and in the direction of the
> estimate that the current code generates.

I think looking merely at what fraction of data sets (or queries) uses
dependent GROUP BY and WHERE clauses is not sufficient.

The general behavior of the two GROUP BY estimators is that the current
one tends to under-estimate, while the new one tends to over-estimate.
(It's not difficult to construct counter-examples by using more complex
dependencies between the columns etc. but let's ignore that.)

The risk associated with over-estimation is that we may switch from
HashAggregate to GroupAggregate, and generally select paths better
suited for large number of rows. Not great, because the plan may not be
optimal and thus run much slower - but that usually only happens on
small tables, because on large tables we would probably end up using
those same paths anyway.

With under-estimation, the main risks are much graver, as we may end up
using HashAggregate only to get killed by OOM. When we're lucky not to
hit that, my experience this often leads to a cascade of NestedLoop
nodes processing orders of magnitude more tuples than expected. Awful.

So I think the risk is asymmetric, and that's why I like the new
estimator more, because it tends to over-estimate, but in a very
reasonable way.

>
> I could easily lose this debate merely for lack of a principled basis
> for saying how far toward the current estimate the new estimate should
> be adjusted.  The geometric average is one suggestion, but I don't have
> a principled argument for it.
>
> Like I said above, I'm fishing for a decent formula here.

Thanks for the valuable feedback!

>
> Mark Dilger

regards

--