On 4 March 2016 at 13:10, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: > The risk associated with over-estimation is that we may switch from > HashAggregate to GroupAggregate, and generally select paths better > suited for large number of rows. Not great, because the plan may not be > optimal and thus run much slower - but that usually only happens on > small tables, because on large tables we would probably end up using > those same paths anyway. > > With under-estimation, the main risks are much graver, as we may end up > using HashAggregate only to get killed by OOM. When we're lucky not to > hit that, my experience this often leads to a cascade of NestedLoop > nodes processing orders of magnitude more tuples than expected. Awful. > > So I think the risk is asymmetric, and that's why I like the new > estimator more, because it tends to over-estimate, but in a very > reasonable way. >

Agreed. Under-estimating is worse than over-estimating. - reldistinct *= rel->rows / rel->tuples; + reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows) Looking at this, I agree that this new formula from [1] is generally better than the old one in most (but not all) cases. One problem with it is that it no longer takes into account rel->tuples, which means that when returning all the tuples from the table, it no longer gives an estimate equal to the total number of distinct values in the table. In fact it tends to underestimate when returning nearly all the rows from the table. The old formula is clearly not a good general-purpose estimate. However, there is one case where it does return an accurate estimate -- when all the values in the underlying table are distinct. In this case the new formula consistently produces underestimates, while the old formula works well. For example: CREATE TABLE t AS SELECT (100000000 * random())::int a, i::int b FROM generate_series(1,1000000) s(i); ANALYSE t; EXPLAIN ANALYSE SELECT a FROM t GROUP BY a; So there are clearly around 1 million distinct values for "a", but the new formula gives an estimate of around 630,000. That's not a massive underestimate, but it's sufficiently obvious and visible that it would be good to do better if we can. I think that a better formula to use would be reldistinct *= (1 - powl(1 - rel-rows / rel->tuples, rel->tuples / reldistinct) This comes from [2], which gives a general formula for selection without replacement, and then gives the above as an approximation to the uniform distribution case. It's not really made clear in that paper, but that approximation is actually the "with replacement" approximation, but starting from different initial assumptions to give a formula that has better boundary conditions and special-case behaviour, as described below. For the record, here's a quick analysis of how the 2 formulae come about: Assume there are: N rows in the table n distinct values (n <= N) p rows are chosen at random (p <= N) so the problem is to estimate how many distinct values there will be in the p rows chosen. Both approaches work by first estimating the probability that some particular value x is *not* chosen. [1] starts from the assumption that each row of the table has a probability of 1/n of being equal to x. So the probability that x is not the first value chosen is P(not x, 1) = 1 - 1/n Then, if the value chosen first is replaced, the probability that x is not the second value chosen is the same. The "with replacement" approximation makes each choice an independent probability, so the overall probability that x is not in any of the p rows chosen is just the product of the p individual probabilities, which is just P(not x, p) = (1 - 1/n)^p Then of course the probability that x *is* chosen at least once in the p rows is P(x, p) = 1 - (1 - 1/n)^p Finally, since there are n possible values of x, and they're all equally likely in the table, the expected number of distinct values is D(p) = n * (1 - (1 - 1/n)^p) The flaw in this approach is that for large p the "with replacement" approximation gets worse and worse, and the probabilities P(x, p) systematically under-estimate the actual probability that x is chosen, which increases as more values not equal to x are chosen. By the time the last value is chosen P(x, p=N) should actually be 1. [2] starts from a different initial assumption (uniform distribution case) -- for any given value x, assume that the table contains N/n instances of x (ignoring the fact that that might not be an integer). Note that with this assumption there is guaranteed to be at least one instance of x, which is not the case with the above approach. Now consider the first instance of x in the table. If we choose p rows from the table, the probability that that first instance of x is not chosen is P(not x[1], p) = 1 - p / N Making the same "with replacement" simplification, the probability that the second instance of x is not chosen is the same, and they're independent probabilities again. So, if there are N/n instances of x, the overall probability that none of them is chosen is P(not x, p) = (1 - p / N)^(N/n) So then the formula for the expected number of distinct values becomes D(p) = n * (1 - (1 - p / N)^(N/n)) This alternate formula has a few nice properties: 1). D(p=0) = 0 2). D(p=N) = n (choosing all the rows results in all the distinct values being chosen) 3). When n=N, D(p) = p (when all the values in the table are distinct, so are all the values chosen). This case matches the current formula. The first formula for D(p) lacks properties (2) and (3). The reason this second formula generally works better is that the "with replacement" approximation is only being made in for up to (N/n) selections, rather than up to N selections. So it remains a good approximation as long as (N/n) is large compared to N, i.e., as long as n is fairly large. In fact even for n=1 or 2, it still works adequately if the results are rounded to the nearest integer. The following snipet can be used in gnuplot to visualise the results for various values of n and N. I've included the exact "without replacement" formula too, by rewriting it in terms of the gamma function: N=1000000.0; n=500000.0; plot [p=0:N] [0:n] p*n/N, n*(1-(1-1/n)**p), n*(1-(1-p/N)**(N/n)), n*(1 - exp(lgamma(N-N/n+1) + lgamma(N-p+1) - lgamma(N+1) - lgamma(N-N/n-p+1))) For most values of N and n, the approximation from [2] is almost indistinguishable from the exact formula, whereas the formula from [1] tends to underestimate when selecting a large number of distinct values (e.g., try setting n=900000 in the plot above). Regards, Dean [1] http://math.stackexchange.com/questions/72223/finding-expected-number-of-distinct-values-selected-from-a-set-of-integers [2] http://www.adellera.it/investigations/distinct_balls/distinct_balls.pdf -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers