On Thu, 12 Nov 2020 at 14:18, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: > > Here is an improved WIP version of the patch series, modified to address > the issue with repeatedly applying the extended statistics, as discussed > with Dean in this thread. It's a bit rough and not committable, but I > need some feedback so I'm posting it in this state.
As it stands, it doesn't compile if 0003 is applied, because it missed one of the callers of clauselist_selectivity_simple(), but that's easily fixed. > 0001 is the original patch improving estimates of OR clauses > > 0002 adds thin wrappers for clause[list]_selectivity, with "internal" > functions allowing to specify whether to keep considering extended stats > > 0003 does the same for the "simple" functions > > > I've kept it like this to demonstrate that 0002 is not sufficient. In my > response from March 24 I wrote this: > > > Isn't it the case that clauselist_selectivity_simple (and the OR > > variant) should ignore extended stats entirely? That is, we'd need > > to add a flag (or _simple variant) to clause_selectivity, so that it > > calls causelist_selectivity_simple_or. > But that's actually wrong, as 0002 shows (as it breaks a couple of > regression tests), because of the way we handle OR clauses. At the top > level, an OR-clause is actually just a single clause and it may get > passed to clauselist_selectivity_simple. So entirely disabling extended > stats for the "simple" functions would also mean disabling extended > stats for a large number of OR clauses. Which is clearly wrong. > > So 0003 addresses that, by adding a flag to the two "simple" functions. > Ultimately, this should probably do the same thing as 0002 and add thin > wrappers, because the existing functions are part of the public API. I agree that, taken together, these patches fix the multiple-extended-stats-evaluation issue. However: I think this has ended up with too many variants of these functions, since we now have "_internal" and "_simple" variants, and you're proposing adding more. The original purpose of the "_simple" variants was to compute selectivities without looking at extended stats, and now the "_internal" variants compute selectivities with an additional "use_extended_stats" flag to control whether or not to look at extended stats. Thus they're basically the same, and could be rolled together. Additionally, it's worth noting that the "_simple" variants expose the "estimatedclauses" bitmap as an argument, which IMO is a bit messy as an API. All callers of the "_simple" functions outside of clausesel.c actually pass in estimatedclauses=NULL, so it's possible to refactor and get rid of that, turning estimatedclauses into a purely internal variable. Also, it's quite messy that clauselist_selectivity_simple_or() needs to be passed a Selectivity input (the final argument) that is the selectivity of any already-estimated clauses, or the value to return if no not-already-estimated clauses are found, and must be 0.0 when called from the extended stats code. Attached is the kind of thing I had in mind (as a single patch, since I don't think it's worth splitting up). This replaces the "_simple" and "_internal" variants of these functions with "_opt_ext_stats" variants whose signatures match the originals except for having the single extra "use_extended_stats" boolean parameter. Additionally, the "_simple" functions are merged into the originals (making them more like they were in PG11) so that the "estimatedclauses" bitmap and partial-OR-list Selectivity become internal details, no longer exposed in the API. Regards, Dean
diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c new file mode 100644 index 37a735b..ab16a4c --- a/src/backend/optimizer/path/clausesel.c +++ b/src/backend/optimizer/path/clausesel.c @@ -44,6 +44,12 @@ static void addRangeClause(RangeQueryCla bool varonleft, bool isLTsel, Selectivity s2); static RelOptInfo *find_single_rel_for_clauses(PlannerInfo *root, List *clauses); +static Selectivity clause_selectivity_opt_ext_stats(PlannerInfo *root, + Node *clause, + int varRelid, + JoinType jointype, + SpecialJoinInfo *sjinfo, + bool use_extended_stats); /**************************************************************************** * ROUTINES TO COMPUTE SELECTIVITIES @@ -61,64 +67,8 @@ static RelOptInfo *find_single_rel_for_c * * The basic approach is to apply extended statistics first, on as many * clauses as possible, in order to capture cross-column dependencies etc. - * The remaining clauses are then estimated using regular statistics tracked - * for individual columns. This is done by simply passing the clauses to - * clauselist_selectivity_simple. - */ -Selectivity -clauselist_selectivity(PlannerInfo *root, - List *clauses, - int varRelid, - JoinType jointype, - SpecialJoinInfo *sjinfo) -{ - Selectivity s1 = 1.0; - RelOptInfo *rel; - Bitmapset *estimatedclauses = NULL; - - /* - * Determine if these clauses reference a single relation. If so, and if - * it has extended statistics, try to apply those. - */ - rel = find_single_rel_for_clauses(root, clauses); - if (rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL) - { - /* - * Estimate as many clauses as possible using extended statistics. - * - * 'estimatedclauses' tracks the 0-based list position index of - * clauses that we've estimated using extended statistics, and that - * should be ignored. - */ - s1 *= statext_clauselist_selectivity(root, clauses, varRelid, - jointype, sjinfo, rel, - &estimatedclauses); - } - - /* - * Apply normal selectivity estimates for the remaining clauses, passing - * 'estimatedclauses' so that it skips already estimated ones. - */ - return s1 * clauselist_selectivity_simple(root, clauses, varRelid, - jointype, sjinfo, - estimatedclauses); -} - -/* - * clauselist_selectivity_simple - - * Compute the selectivity of an implicitly-ANDed list of boolean - * expression clauses. The list can be empty, in which case 1.0 - * must be returned. List elements may be either RestrictInfos - * or bare expression clauses --- the former is preferred since - * it allows caching of results. The estimatedclauses bitmap tracks - * clauses that have already been estimated by other means. - * - * See clause_selectivity() for the meaning of the additional parameters. - * - * Our basic approach is to take the product of the selectivities of the - * subclauses. However, that's only right if the subclauses have independent - * probabilities, and in reality they are often NOT independent. So, - * we want to be smarter where we can. + * The remaining clauses are then estimated by taking the product of their + * selectivities. * * We also recognize "range queries", such as "x > 34 AND x < 42". Clauses * are recognized as possible range query components if they are restriction @@ -147,28 +97,64 @@ clauselist_selectivity(PlannerInfo *root * selectivity functions; perhaps some day we can generalize the approach. */ Selectivity -clauselist_selectivity_simple(PlannerInfo *root, - List *clauses, - int varRelid, - JoinType jointype, - SpecialJoinInfo *sjinfo, - Bitmapset *estimatedclauses) +clauselist_selectivity(PlannerInfo *root, + List *clauses, + int varRelid, + JoinType jointype, + SpecialJoinInfo *sjinfo) +{ + return clauselist_selectivity_opt_ext_stats(root, clauses, varRelid, + jointype, sjinfo, true); +} + +Selectivity +clauselist_selectivity_opt_ext_stats(PlannerInfo *root, + List *clauses, + int varRelid, + JoinType jointype, + SpecialJoinInfo *sjinfo, + bool use_extended_stats) { Selectivity s1 = 1.0; + RelOptInfo *rel; + Bitmapset *estimatedclauses = NULL; RangeQueryClause *rqlist = NULL; ListCell *l; int listidx; /* - * If there's exactly one clause (and it was not estimated yet), just go - * directly to clause_selectivity(). None of what we might do below is - * relevant. + * If there's exactly one clause, just go directly to + * clause_selectivity(). None of what we might do below is relevant. */ - if (list_length(clauses) == 1 && bms_is_empty(estimatedclauses)) - return clause_selectivity(root, (Node *) linitial(clauses), - varRelid, jointype, sjinfo); + if (list_length(clauses) == 1) + return clause_selectivity_opt_ext_stats(root, + (Node *) linitial(clauses), + varRelid, jointype, sjinfo, + use_extended_stats); + + /* + * Determine if these clauses reference a single relation. If so, and if + * it has extended statistics, try to apply those. + */ + rel = find_single_rel_for_clauses(root, clauses); + if (use_extended_stats && rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL) + { + /* + * Estimate as many clauses as possible using extended statistics. + * + * 'estimatedclauses' tracks the 0-based list position index of + * clauses that we've estimated using extended statistics, and that + * should be ignored. + */ + s1 *= statext_clauselist_selectivity(root, clauses, varRelid, + jointype, sjinfo, rel, + &estimatedclauses, false); + } /* + * Apply normal selectivity estimates for remaining clauses. We'll be + * careful to skip any clauses which were already estimated above. + * * Anything that doesn't look like a potential rangequery clause gets * multiplied into s1 and forgotten. Anything that does gets inserted into * an rqlist entry. @@ -190,7 +176,9 @@ clauselist_selectivity_simple(PlannerInf continue; /* Always compute the selectivity using clause_selectivity */ - s2 = clause_selectivity(root, clause, varRelid, jointype, sjinfo); + s2 = clause_selectivity_opt_ext_stats(root, clause, varRelid, + jointype, sjinfo, + use_extended_stats); /* * Check for being passed a RestrictInfo. @@ -351,6 +339,91 @@ clauselist_selectivity_simple(PlannerInf } /* + * clauselist_selectivity_or - + * Compute the selectivity of an implicitly-ORed list of boolean + * expression clauses. The list can be empty, in which case 0.0 + * must be returned. List elements may be either RestrictInfos + * or bare expression clauses --- the former is preferred since + * it allows caching of results. + * + * See clause_selectivity() for the meaning of the additional parameters. + * + * The basic approach is to apply extended statistics first, on as many + * clauses as possible, in order to capture cross-column dependencies etc. + * The remaining clauses are then estimated as if they were independent. + */ +Selectivity +clauselist_selectivity_or(PlannerInfo *root, + List *clauses, + int varRelid, + JoinType jointype, + SpecialJoinInfo *sjinfo, + bool use_extended_stats) +{ + Selectivity s1 = 0.0; + RelOptInfo *rel; + Bitmapset *estimatedclauses = NULL; + ListCell *lc; + int listidx; + + /* + * Determine if these clauses reference a single relation. If so, and if + * it has extended statistics, try to apply those. + */ + rel = find_single_rel_for_clauses(root, clauses); + if (use_extended_stats && rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL) + { + /* + * Estimate as many clauses as possible using extended statistics. + * + * 'estimatedclauses' tracks the 0-based list position index of + * clauses that we've estimated using extended statistics, and that + * should be ignored. + * + * XXX We can't multiply with current value, because for OR clauses we + * start with 0.0, so we simply assign to 's' directly. + */ + s1 = statext_clauselist_selectivity(root, clauses, varRelid, + jointype, sjinfo, rel, + &estimatedclauses, true); + } + + /* + * Estimate the remaining clauses. + * + * Selectivities for an OR clause are computed as s1+s2 - s1*s2 to account + * for the probable overlap of selected tuple sets. + * + * XXX is this too conservative? + */ + listidx = -1; + foreach(lc, clauses) + { + Selectivity s2; + + listidx++; + + /* + * Skip this clause if it's already been estimated by some other + * statistics above. + */ + if (bms_is_member(listidx, estimatedclauses)) + continue; + + s2 = clause_selectivity_opt_ext_stats(root, + (Node *) lfirst(lc), + varRelid, + jointype, + sjinfo, + use_extended_stats); + + s1 = s1 + s2 - s1 * s2; + } + + return s1; +} + +/* * addRangeClause --- add a new range clause for clauselist_selectivity * * Here is where we try to match up pairs of range-query clauses @@ -602,6 +675,18 @@ clause_selectivity(PlannerInfo *root, JoinType jointype, SpecialJoinInfo *sjinfo) { + return clause_selectivity_opt_ext_stats(root, clause, varRelid, + jointype, sjinfo, true); +} + +static Selectivity +clause_selectivity_opt_ext_stats(PlannerInfo *root, + Node *clause, + int varRelid, + JoinType jointype, + SpecialJoinInfo *sjinfo, + bool use_extended_stats) +{ Selectivity s1 = 0.5; /* default for any unhandled clause type */ RestrictInfo *rinfo = NULL; bool cacheable = false; @@ -716,42 +801,35 @@ clause_selectivity(PlannerInfo *root, else if (is_notclause(clause)) { /* inverse of the selectivity of the underlying clause */ - s1 = 1.0 - clause_selectivity(root, - (Node *) get_notclausearg((Expr *) clause), - varRelid, - jointype, - sjinfo); + s1 = 1.0 - clause_selectivity_opt_ext_stats(root, + (Node *) get_notclausearg((Expr *) clause), + varRelid, + jointype, + sjinfo, + use_extended_stats); } else if (is_andclause(clause)) { /* share code with clauselist_selectivity() */ - s1 = clauselist_selectivity(root, - ((BoolExpr *) clause)->args, - varRelid, - jointype, - sjinfo); + s1 = clauselist_selectivity_opt_ext_stats(root, + ((BoolExpr *) clause)->args, + varRelid, + jointype, + sjinfo, + use_extended_stats); } else if (is_orclause(clause)) { /* - * Selectivities for an OR clause are computed as s1+s2 - s1*s2 to - * account for the probable overlap of selected tuple sets. - * - * XXX is this too conservative? + * Almost the same thing as clauselist_selectivity, but with the + * clauses connected by OR. */ - ListCell *arg; - - s1 = 0.0; - foreach(arg, ((BoolExpr *) clause)->args) - { - Selectivity s2 = clause_selectivity(root, - (Node *) lfirst(arg), - varRelid, - jointype, - sjinfo); - - s1 = s1 + s2 - s1 * s2; - } + s1 = clauselist_selectivity_or(root, + ((BoolExpr *) clause)->args, + varRelid, + jointype, + sjinfo, + use_extended_stats); } else if (is_opclause(clause) || IsA(clause, DistinctExpr)) { @@ -852,20 +930,22 @@ clause_selectivity(PlannerInfo *root, else if (IsA(clause, RelabelType)) { /* Not sure this case is needed, but it can't hurt */ - s1 = clause_selectivity(root, - (Node *) ((RelabelType *) clause)->arg, - varRelid, - jointype, - sjinfo); + s1 = clause_selectivity_opt_ext_stats(root, + (Node *) ((RelabelType *) clause)->arg, + varRelid, + jointype, + sjinfo, + use_extended_stats); } else if (IsA(clause, CoerceToDomain)) { /* Not sure this case is needed, but it can't hurt */ - s1 = clause_selectivity(root, - (Node *) ((CoerceToDomain *) clause)->arg, - varRelid, - jointype, - sjinfo); + s1 = clause_selectivity_opt_ext_stats(root, + (Node *) ((CoerceToDomain *) clause)->arg, + varRelid, + jointype, + sjinfo, + use_extended_stats); } else { diff --git a/src/backend/statistics/dependencies.c b/src/backend/statistics/dependencies.c new file mode 100644 index d950b4e..d05afa0 --- a/src/backend/statistics/dependencies.c +++ b/src/backend/statistics/dependencies.c @@ -1073,8 +1073,9 @@ clauselist_apply_dependencies(PlannerInf } } - simple_sel = clauselist_selectivity_simple(root, attr_clauses, varRelid, - jointype, sjinfo, NULL); + simple_sel = clauselist_selectivity_opt_ext_stats(root, attr_clauses, + varRelid, jointype, + sjinfo, false); attr_sel[attidx++] = simple_sel; } diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c new file mode 100644 index 3632692..06213b6 --- a/src/backend/statistics/extended_stats.c +++ b/src/backend/statistics/extended_stats.c @@ -1282,16 +1282,17 @@ statext_is_compatible_clause(PlannerInfo static Selectivity statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid, JoinType jointype, SpecialJoinInfo *sjinfo, - RelOptInfo *rel, Bitmapset **estimatedclauses) + RelOptInfo *rel, Bitmapset **estimatedclauses, + bool is_or) { ListCell *l; Bitmapset **list_attnums; int listidx; - Selectivity sel = 1.0; + Selectivity sel = (is_or) ? 0.0 : 1.0; /* check if there's any stats that might be useful for us. */ if (!has_stats_of_kind(rel->statlist, STATS_EXT_MCV)) - return 1.0; + return sel; list_attnums = (Bitmapset **) palloc(sizeof(Bitmapset *) * list_length(clauses)); @@ -1377,8 +1378,17 @@ statext_mcv_clauselist_selectivity(Plann * columns/clauses. We'll then use the various selectivities computed * from MCV list to improve it. */ - simple_sel = clauselist_selectivity_simple(root, stat_clauses, varRelid, - jointype, sjinfo, NULL); + if (is_or) + simple_sel = clauselist_selectivity_or(root, stat_clauses, + varRelid, jointype, sjinfo, + false); + else + simple_sel = clauselist_selectivity_opt_ext_stats(root, + stat_clauses, + varRelid, + jointype, + sjinfo, + false); /* * Now compute the multi-column estimate from the MCV list, along with @@ -1386,7 +1396,7 @@ statext_mcv_clauselist_selectivity(Plann */ mcv_sel = mcv_clauselist_selectivity(root, stat, stat_clauses, varRelid, jointype, sjinfo, rel, - &mcv_basesel, &mcv_totalsel); + &mcv_basesel, &mcv_totalsel, is_or); /* Estimated selectivity of values not covered by MCV matches */ other_sel = simple_sel - mcv_basesel; @@ -1404,7 +1414,10 @@ statext_mcv_clauselist_selectivity(Plann CLAMP_PROBABILITY(stat_sel); /* Factor the estimate from this MCV to the overall estimate. */ - sel *= stat_sel; + if (is_or) + sel = sel + stat_sel - sel * stat_sel; + else + sel *= stat_sel; } return sel; @@ -1417,13 +1430,21 @@ statext_mcv_clauselist_selectivity(Plann Selectivity statext_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid, JoinType jointype, SpecialJoinInfo *sjinfo, - RelOptInfo *rel, Bitmapset **estimatedclauses) + RelOptInfo *rel, Bitmapset **estimatedclauses, + bool is_or) { Selectivity sel; /* First, try estimating clauses using a multivariate MCV list. */ sel = statext_mcv_clauselist_selectivity(root, clauses, varRelid, jointype, - sjinfo, rel, estimatedclauses); + sjinfo, rel, estimatedclauses, is_or); + + /* + * Functional dependencies only work for clauses connected by AND, so for + * OR clauses we're done. + */ + if (is_or) + return sel; /* * Then, apply functional dependencies on the remaining clauses by calling diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c new file mode 100644 index 6a262f1..7c5841e --- a/src/backend/statistics/mcv.c +++ b/src/backend/statistics/mcv.c @@ -1904,7 +1904,8 @@ mcv_clauselist_selectivity(PlannerInfo * List *clauses, int varRelid, JoinType jointype, SpecialJoinInfo *sjinfo, RelOptInfo *rel, - Selectivity *basesel, Selectivity *totalsel) + Selectivity *basesel, Selectivity *totalsel, + bool is_or) { int i; MCVList *mcv; @@ -1917,7 +1918,7 @@ mcv_clauselist_selectivity(PlannerInfo * mcv = statext_mcv_load(stat->statOid); /* build a match bitmap for the clauses */ - matches = mcv_get_match_bitmap(root, clauses, stat->keys, mcv, false); + matches = mcv_get_match_bitmap(root, clauses, stat->keys, mcv, is_or); /* sum frequencies for all the matching MCV items */ *basesel = 0.0; diff --git a/src/include/optimizer/optimizer.h b/src/include/optimizer/optimizer.h new file mode 100644 index 3e41710..27f7985 --- a/src/include/optimizer/optimizer.h +++ b/src/include/optimizer/optimizer.h @@ -58,17 +58,23 @@ extern Selectivity clause_selectivity(Pl int varRelid, JoinType jointype, SpecialJoinInfo *sjinfo); -extern Selectivity clauselist_selectivity_simple(PlannerInfo *root, - List *clauses, - int varRelid, - JoinType jointype, - SpecialJoinInfo *sjinfo, - Bitmapset *estimatedclauses); extern Selectivity clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid, JoinType jointype, SpecialJoinInfo *sjinfo); +extern Selectivity clauselist_selectivity_opt_ext_stats(PlannerInfo *root, + List *clauses, + int varRelid, + JoinType jointype, + SpecialJoinInfo *sjinfo, + bool use_extended_stats); +Selectivity clauselist_selectivity_or(PlannerInfo *root, + List *clauses, + int varRelid, + JoinType jointype, + SpecialJoinInfo *sjinfo, + bool use_extended_stats); /* in path/costsize.c: */ diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h new file mode 100644 index 61e6969..08d639e --- a/src/include/statistics/extended_stats_internal.h +++ b/src/include/statistics/extended_stats_internal.h @@ -107,6 +107,7 @@ extern Selectivity mcv_clauselist_select SpecialJoinInfo *sjinfo, RelOptInfo *rel, Selectivity *basesel, - Selectivity *totalsel); + Selectivity *totalsel, + bool is_or); #endif /* EXTENDED_STATS_INTERNAL_H */ diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h new file mode 100644 index 50fce49..c9ed211 --- a/src/include/statistics/statistics.h +++ b/src/include/statistics/statistics.h @@ -116,7 +116,8 @@ extern Selectivity statext_clauselist_se JoinType jointype, SpecialJoinInfo *sjinfo, RelOptInfo *rel, - Bitmapset **estimatedclauses); + Bitmapset **estimatedclauses, + bool is_or); extern bool has_stats_of_kind(List *stats, char requiredkind); extern StatisticExtInfo *choose_best_statistics(List *stats, char requiredkind, Bitmapset **clause_attnums, diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out new file mode 100644 index 4c3edd2..abab5d6 --- a/src/test/regress/expected/stats_ext.out +++ b/src/test/regress/expected/stats_ext.out @@ -1113,6 +1113,12 @@ SELECT * FROM check_estimated_rows('SELE 200 | 200 (1 row) +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL'); + estimated | actual +-----------+-------- + 200 | 200 +(1 row) + SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')'); estimated | actual -----------+-------- @@ -1173,13 +1179,6 @@ SELECT * FROM check_estimated_rows('SELE 100 | 100 (1 row) --- we can't use the statistic for OR clauses that are not fully covered (missing 'd' attribute) -SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL'); - estimated | actual ------------+-------- - 343 | 200 -(1 row) - -- check change of unrelated column type does not reset the MCV statistics ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64); SELECT d.stxdmcv IS NOT NULL @@ -1477,6 +1476,110 @@ SELECT * FROM check_estimated_rows('SELE 1 | 0 (1 row) +-- mcv covering just a small fraction of data +CREATE TABLE mcv_lists_partial ( + a INT, + b INT, + c INT +); +-- 10 frequent groups, each with 100 elements +INSERT INTO mcv_lists_partial (a, b, c) + SELECT + mod(i,10), + mod(i,10), + mod(i,10) + FROM generate_series(0,999) s(i); +-- 100 groups that will make it to the MCV list (includes the 10 frequent ones) +INSERT INTO mcv_lists_partial (a, b, c) + SELECT + i, + i, + i + FROM generate_series(0,99) s(i); +-- 4000 groups in total, most of which won't make it (just a single item) +INSERT INTO mcv_lists_partial (a, b, c) + SELECT + i, + i, + i + FROM generate_series(0,3999) s(i); +ANALYZE mcv_lists_partial; +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 0'); + estimated | actual +-----------+-------- + 1 | 102 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0'); + estimated | actual +-----------+-------- + 300 | 102 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 AND b = 10 AND c = 10'); + estimated | actual +-----------+-------- + 1 | 2 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 OR b = 10 OR c = 10'); + estimated | actual +-----------+-------- + 6 | 2 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 10'); + estimated | actual +-----------+-------- + 1 | 0 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 10'); + estimated | actual +-----------+-------- + 204 | 104 +(1 row) + +CREATE STATISTICS mcv_lists_partial_stats (mcv) ON a, b, c + FROM mcv_lists_partial; +ANALYZE mcv_lists_partial; +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 0'); + estimated | actual +-----------+-------- + 102 | 102 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0'); + estimated | actual +-----------+-------- + 402 | 102 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 AND b = 10 AND c = 10'); + estimated | actual +-----------+-------- + 2 | 2 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 OR b = 10 OR c = 10'); + estimated | actual +-----------+-------- + 8 | 2 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 10'); + estimated | actual +-----------+-------- + 1 | 0 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 10'); + estimated | actual +-----------+-------- + 308 | 104 +(1 row) + +DROP TABLE mcv_lists_partial; -- check the ability to use multiple MCV lists CREATE TABLE mcv_lists_multi ( a INTEGER, @@ -1506,12 +1609,36 @@ SELECT * FROM check_estimated_rows('SELE 102 | 714 (1 row) +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0'); + estimated | actual +-----------+-------- + 143 | 142 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0'); + estimated | actual +-----------+-------- + 1571 | 1572 +(1 row) + SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0'); estimated | actual -----------+-------- 4 | 142 (1 row) +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)'); + estimated | actual +-----------+-------- + 298 | 1572 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0'); + estimated | actual +-----------+-------- + 2649 | 1572 +(1 row) + -- create separate MCV statistics CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi; CREATE STATISTICS mcv_lists_multi_2 (mcv) ON c, d FROM mcv_lists_multi; @@ -1528,12 +1655,36 @@ SELECT * FROM check_estimated_rows('SELE 714 | 714 (1 row) +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0'); + estimated | actual +-----------+-------- + 143 | 142 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0'); + estimated | actual +-----------+-------- + 1571 | 1572 +(1 row) + SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0'); estimated | actual -----------+-------- 143 | 142 (1 row) +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)'); + estimated | actual +-----------+-------- + 1571 | 1572 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0'); + estimated | actual +-----------+-------- + 1571 | 1572 +(1 row) + DROP TABLE mcv_lists_multi; -- Permission tests. Users should not be able to see specific data values in -- the extended statistics, if they lack permission to see those values in diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql new file mode 100644 index 9781e59..3ec6dda --- a/src/test/regress/sql/stats_ext.sql +++ b/src/test/regress/sql/stats_ext.sql @@ -561,6 +561,8 @@ SELECT * FROM check_estimated_rows('SELE SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1'); +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL'); + SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')'); SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52, NULL) AND b IN ( ''1'', ''2'', NULL)'); @@ -581,9 +583,6 @@ SELECT * FROM check_estimated_rows('SELE SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND b IN (''1'', ''2'', NULL, ''3'') AND c > ANY (ARRAY[1, 2, NULL, 3])'); --- we can't use the statistic for OR clauses that are not fully covered (missing 'd' attribute) -SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL'); - -- check change of unrelated column type does not reset the MCV statistics ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64); @@ -777,6 +776,70 @@ SELECT * FROM check_estimated_rows('SELE SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND b AND NOT c'); +-- mcv covering just a small fraction of data +CREATE TABLE mcv_lists_partial ( + a INT, + b INT, + c INT +); + +-- 10 frequent groups, each with 100 elements +INSERT INTO mcv_lists_partial (a, b, c) + SELECT + mod(i,10), + mod(i,10), + mod(i,10) + FROM generate_series(0,999) s(i); + +-- 100 groups that will make it to the MCV list (includes the 10 frequent ones) +INSERT INTO mcv_lists_partial (a, b, c) + SELECT + i, + i, + i + FROM generate_series(0,99) s(i); + +-- 4000 groups in total, most of which won't make it (just a single item) +INSERT INTO mcv_lists_partial (a, b, c) + SELECT + i, + i, + i + FROM generate_series(0,3999) s(i); + +ANALYZE mcv_lists_partial; + +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 0'); + +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0'); + +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 AND b = 10 AND c = 10'); + +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 OR b = 10 OR c = 10'); + +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 10'); + +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 10'); + +CREATE STATISTICS mcv_lists_partial_stats (mcv) ON a, b, c + FROM mcv_lists_partial; + +ANALYZE mcv_lists_partial; + +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 0'); + +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0'); + +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 AND b = 10 AND c = 10'); + +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 OR b = 10 OR c = 10'); + +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 10'); + +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 10'); + +DROP TABLE mcv_lists_partial; + -- check the ability to use multiple MCV lists CREATE TABLE mcv_lists_multi ( a INTEGER, @@ -799,7 +862,11 @@ ANALYZE mcv_lists_multi; -- estimates without any mcv statistics SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0'); SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AND d = 0'); +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0'); +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0'); SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0'); +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)'); +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0'); -- create separate MCV statistics CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi; @@ -809,7 +876,11 @@ ANALYZE mcv_lists_multi; SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0'); SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AND d = 0'); +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0'); +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0'); SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0'); +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)'); +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0'); DROP TABLE mcv_lists_multi;