Re: Additional improvements to extended statistics

Tomas Vondra Mon, 09 Mar 2020 11:19:56 -0700

On Mon, Mar 09, 2020 at 08:35:48AM +0000, Dean Rasheed wrote:

On Mon, 9 Mar 2020 at 00:02, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote:


Speaking of which, would you take a look at [1]? I think supporting SAOP
is fine, but I wonder if you agree with my conclusion we can't really
support inclusion @> as explained in [2].


Hmm, I'm not sure. However, thinking about your example in [2] reminds
me of a thought I had a while ago, but then forgot about --- there is
a flaw in the formula used for computing probabilities with functional
dependencies:

 P(a,b) = P(a) * [f + (1-f)*P(b)]

because it might return a value that is larger that P(b), which
obviously should not be possible.


Hmmm, yeah. It took me a while to reproduce this - the trick is in
"inverting" the dependency on a subset of data, e.g. like this:

  create table t (a int, b int);
  insert into t select mod(i,50), mod(i,25)
    from generate_series(1,5000) s(i);
  update t set a = 0 where a < 3;
  create statistics s (dependencies) on a,b from t;

which then does this:

  test=# explain select * from t where a = 0;
                       QUERY PLAN
  ----------------------------------------------------
   Seq Scan on t  (cost=0.00..86.50 rows=300 width=8)
     Filter: (a = 0)
  (2 rows)

  test=# explain select * from t where b = 0;
                       QUERY PLAN
  ----------------------------------------------------
   Seq Scan on t  (cost=0.00..86.50 rows=200 width=8)
     Filter: (b = 0)
  (2 rows)

  test=# explain select * from t where a = 0 and b = 0;
                       QUERY PLAN
  ----------------------------------------------------
   Seq Scan on t  (cost=0.00..99.00 rows=283 width=8)
     Filter: ((a = 0) AND (b = 0))
  (2 rows)

Which I think is the issue you've described ...

We should amend that formula to prevent a result larger than P(b). The
obvious way to do that would be to use:

 P(a,b) = Min(P(a) * [f + (1-f)*P(b)], P(b))

but actually I think it would be better and more principled to use:

 P(a,b) = f*Min(P(a),P(b)) + (1-f)*P(a)*P(b)

I.e., for those rows believed to be functionally dependent, we use the
minimum probability, and for the rows believed to be independent, we
use the product.


Hmmm, yeah. The trouble is that we currently don't really have both
selectivities in dependencies_clauselist_selectivity :-(

We get both clauses, but we only compute selectivity of the "implied"
clause, and we leave the other one as not estimated (possibly up to
clauselist_selectivity).

It's also not clear to me how would this work for more than two clauses,
that are all functionally dependent. Like (a => b => c), for example.
But I haven't thought about this very much yet.

I think that would solve the problem with the example you gave at the
end of [2], but I'm not sure if it helps with the general case.


I don't follow. I think the issue with [2] is that we can't really apply
stats about the array values to queries on individual array elements.
Can you explain how would the proposed changes deal with this?


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Additional improvements to extended statistics

Reply via email to