Re: [HACKERS] multivariate statistics (v25)

Tomas Vondra Tue, 04 Apr 2017 11:20:13 -0700

On 04/04/2017 09:55 AM, David Rowley wrote:

On 1 April 2017 at 04:25, David Rowley <david.row...@2ndquadrant.com> wrote:

I've attached an updated patch.


I've made another pass at this and ended up removing the tryextstats
variable. We now only try to use extended statistics when
clauselist_selectivity() is given a valid RelOptInfo with rtekind ==
RTE_RELATION, and of course, it must also have some extended stats
defined too.

I've also cleaned up a few more comments, many of which I managed to
omit updating when I refactored how the selectivity estimates ties
into clauselist_selectivity()

I'm quite happy with all of this now, and would also be happy for
other people to take a look and comment.

As a reviewer, I'd be marking this ready for committer, but I've moved
a little way from just reviewing this now, having spent two weeks
hacking at it.

The latest patch is attached.

Thanks David, I agree the reworked patch is much cleaner that the lastversion I posted. Thanks for spending your time on it.


Two minor comments:

1) DEPENDENCY_MIN_GROUP_SIZE

I'm not sure we still need the min_group_size, when evaluatingdependencies. It was meant to deal with 'noisy' data, but I think itafter switching to the 'degree' it might actually be a bad idea.


Consider this:

    create table t (a int, b int);
    insert into t select 1, 1 from generate_series(1, 10000) s(i);
    insert into t select i, i from generate_series(2, 20000) s(i);
    create statistics s with (dependencies) on (a,b) from t;
    analyze t;

    select stadependencies from pg_statistic_ext ;
                  stadependencies
    --------------------------------------------
     [{1 => 2 : 0.333344}, {2 => 1 : 0.333344}]
    (1 row)

So the degree of the dependency is just ~0.333 although it's obviously aperfect dependency, i.e. a knowledge of 'a' determines 'b'. The reasonis that we discard 2/3 of rows, because those groups are only a singlerow each, except for the one large group (1/3 of rows).


Without the mininum group size limitation, the dependencies are:

    test=# select stadependencies from pg_statistic_ext ;
                  stadependencies
    --------------------------------------------
     [{1 => 2 : 1.000000}, {2 => 1 : 1.000000}]
    (1 row)

which seems way more reasonable, I think.


2) A minor detail is that instead of this

    if (estimatedclauses != NULL &&
        bms_is_member(listidx, estimatedclauses))
        continue;

perhaps we should do just this:

    if (bms_is_member(listidx, estimatedclauses))
        continue;

bms_is_member does the same NULL check right at the beginning, so Idon't think this might make a measurable difference.



kind regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multivariate statistics (v25)

Reply via email to