On Thu, 12 Nov 2020 at 14:18, Tomas Vondra <[email protected]> wrote:
>
> Here is an improved WIP version of the patch series, modified to address
> the issue with repeatedly applying the extended statistics, as discussed
> with Dean in this thread. It's a bit rough and not committable, but I
> need some feedback so I'm posting it in this state.
As it stands, it doesn't compile if 0003 is applied, because it missed
one of the callers of clauselist_selectivity_simple(), but that's
easily fixed.
> 0001 is the original patch improving estimates of OR clauses
>
> 0002 adds thin wrappers for clause[list]_selectivity, with "internal"
> functions allowing to specify whether to keep considering extended stats
>
> 0003 does the same for the "simple" functions
>
>
> I've kept it like this to demonstrate that 0002 is not sufficient. In my
> response from March 24 I wrote this:
>
> > Isn't it the case that clauselist_selectivity_simple (and the OR
> > variant) should ignore extended stats entirely? That is, we'd need
> > to add a flag (or _simple variant) to clause_selectivity, so that it
> > calls causelist_selectivity_simple_or.
> But that's actually wrong, as 0002 shows (as it breaks a couple of
> regression tests), because of the way we handle OR clauses. At the top
> level, an OR-clause is actually just a single clause and it may get
> passed to clauselist_selectivity_simple. So entirely disabling extended
> stats for the "simple" functions would also mean disabling extended
> stats for a large number of OR clauses. Which is clearly wrong.
>
> So 0003 addresses that, by adding a flag to the two "simple" functions.
> Ultimately, this should probably do the same thing as 0002 and add thin
> wrappers, because the existing functions are part of the public API.
I agree that, taken together, these patches fix the
multiple-extended-stats-evaluation issue. However:
I think this has ended up with too many variants of these functions,
since we now have "_internal" and "_simple" variants, and you're
proposing adding more. The original purpose of the "_simple" variants
was to compute selectivities without looking at extended stats, and
now the "_internal" variants compute selectivities with an additional
"use_extended_stats" flag to control whether or not to look at
extended stats. Thus they're basically the same, and could be rolled
together.
Additionally, it's worth noting that the "_simple" variants expose the
"estimatedclauses" bitmap as an argument, which IMO is a bit messy as
an API. All callers of the "_simple" functions outside of clausesel.c
actually pass in estimatedclauses=NULL, so it's possible to refactor
and get rid of that, turning estimatedclauses into a purely internal
variable.
Also, it's quite messy that clauselist_selectivity_simple_or() needs
to be passed a Selectivity input (the final argument) that is the
selectivity of any already-estimated clauses, or the value to return
if no not-already-estimated clauses are found, and must be 0.0 when
called from the extended stats code.
Attached is the kind of thing I had in mind (as a single patch, since
I don't think it's worth splitting up). This replaces the "_simple"
and "_internal" variants of these functions with "_opt_ext_stats"
variants whose signatures match the originals except for having the
single extra "use_extended_stats" boolean parameter. Additionally, the
"_simple" functions are merged into the originals (making them more
like they were in PG11) so that the "estimatedclauses" bitmap and
partial-OR-list Selectivity become internal details, no longer exposed
in the API.
Regards,
Dean
diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
new file mode 100644
index 37a735b..ab16a4c
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -44,6 +44,12 @@ static void addRangeClause(RangeQueryCla
bool varonleft, bool isLTsel, Selectivity s2);
static RelOptInfo *find_single_rel_for_clauses(PlannerInfo *root,
List *clauses);
+static Selectivity clause_selectivity_opt_ext_stats(PlannerInfo *root,
+ Node *clause,
+ int varRelid,
+ JoinType jointype,
+ SpecialJoinInfo *sjinfo,
+ bool use_extended_stats);
/****************************************************************************
* ROUTINES TO COMPUTE SELECTIVITIES
@@ -61,64 +67,8 @@ static RelOptInfo *find_single_rel_for_c
*
* The basic approach is to apply extended statistics first, on as many
* clauses as possible, in order to capture cross-column dependencies etc.
- * The remaining clauses are then estimated using regular statistics tracked
- * for individual columns. This is done by simply passing the clauses to
- * clauselist_selectivity_simple.
- */
-Selectivity
-clauselist_selectivity(PlannerInfo *root,
- List *clauses,
- int varRelid,
- JoinType jointype,
- SpecialJoinInfo *sjinfo)
-{
- Selectivity s1 = 1.0;
- RelOptInfo *rel;
- Bitmapset *estimatedclauses = NULL;
-
- /*
- * Determine if these clauses reference a single relation. If so, and if
- * it has extended statistics, try to apply those.
- */
- rel = find_single_rel_for_clauses(root, clauses);
- if (rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
- {
- /*
- * Estimate as many clauses as possible using extended statistics.
- *
- * 'estimatedclauses' tracks the 0-based list position index of
- * clauses that we've estimated using extended statistics, and that
- * should be ignored.
- */
- s1 *= statext_clauselist_selectivity(root, clauses, varRelid,
- jointype, sjinfo, rel,
- &estimatedclauses);
- }
-
- /*
- * Apply normal selectivity estimates for the remaining clauses, passing
- * 'estimatedclauses' so that it skips already estimated ones.
- */
- return s1 * clauselist_selectivity_simple(root, clauses, varRelid,
- jointype, sjinfo,
- estimatedclauses);
-}
-
-/*
- * clauselist_selectivity_simple -
- * Compute the selectivity of an implicitly-ANDed list of boolean
- * expression clauses. The list can be empty, in which case 1.0
- * must be returned. List elements may be either RestrictInfos
- * or bare expression clauses --- the former is preferred since
- * it allows caching of results. The estimatedclauses bitmap tracks
- * clauses that have already been estimated by other means.
- *
- * See clause_selectivity() for the meaning of the additional parameters.
- *
- * Our basic approach is to take the product of the selectivities of the
- * subclauses. However, that's only right if the subclauses have independent
- * probabilities, and in reality they are often NOT independent. So,
- * we want to be smarter where we can.
+ * The remaining clauses are then estimated by taking the product of their
+ * selectivities.
*
* We also recognize "range queries", such as "x > 34 AND x < 42". Clauses
* are recognized as possible range query components if they are restriction
@@ -147,28 +97,64 @@ clauselist_selectivity(PlannerInfo *root
* selectivity functions; perhaps some day we can generalize the approach.
*/
Selectivity
-clauselist_selectivity_simple(PlannerInfo *root,
- List *clauses,
- int varRelid,
- JoinType jointype,
- SpecialJoinInfo *sjinfo,
- Bitmapset *estimatedclauses)
+clauselist_selectivity(PlannerInfo *root,
+ List *clauses,
+ int varRelid,
+ JoinType jointype,
+ SpecialJoinInfo *sjinfo)
+{
+ return clauselist_selectivity_opt_ext_stats(root, clauses, varRelid,
+ jointype, sjinfo, true);
+}
+
+Selectivity
+clauselist_selectivity_opt_ext_stats(PlannerInfo *root,
+ List *clauses,
+ int varRelid,
+ JoinType jointype,
+ SpecialJoinInfo *sjinfo,
+ bool use_extended_stats)
{
Selectivity s1 = 1.0;
+ RelOptInfo *rel;
+ Bitmapset *estimatedclauses = NULL;
RangeQueryClause *rqlist = NULL;
ListCell *l;
int listidx;
/*
- * If there's exactly one clause (and it was not estimated yet), just go
- * directly to clause_selectivity(). None of what we might do below is
- * relevant.
+ * If there's exactly one clause, just go directly to
+ * clause_selectivity(). None of what we might do below is relevant.
*/
- if (list_length(clauses) == 1 && bms_is_empty(estimatedclauses))
- return clause_selectivity(root, (Node *) linitial(clauses),
- varRelid, jointype, sjinfo);
+ if (list_length(clauses) == 1)
+ return clause_selectivity_opt_ext_stats(root,
+ (Node *) linitial(clauses),
+ varRelid, jointype, sjinfo,
+ use_extended_stats);
+
+ /*
+ * Determine if these clauses reference a single relation. If so, and if
+ * it has extended statistics, try to apply those.
+ */
+ rel = find_single_rel_for_clauses(root, clauses);
+ if (use_extended_stats && rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
+ {
+ /*
+ * Estimate as many clauses as possible using extended statistics.
+ *
+ * 'estimatedclauses' tracks the 0-based list position index of
+ * clauses that we've estimated using extended statistics, and that
+ * should be ignored.
+ */
+ s1 *= statext_clauselist_selectivity(root, clauses, varRelid,
+ jointype, sjinfo, rel,
+ &estimatedclauses, false);
+ }
/*
+ * Apply normal selectivity estimates for remaining clauses. We'll be
+ * careful to skip any clauses which were already estimated above.
+ *
* Anything that doesn't look like a potential rangequery clause gets
* multiplied into s1 and forgotten. Anything that does gets inserted into
* an rqlist entry.
@@ -190,7 +176,9 @@ clauselist_selectivity_simple(PlannerInf
continue;
/* Always compute the selectivity using clause_selectivity */
- s2 = clause_selectivity(root, clause, varRelid, jointype, sjinfo);
+ s2 = clause_selectivity_opt_ext_stats(root, clause, varRelid,
+ jointype, sjinfo,
+ use_extended_stats);
/*
* Check for being passed a RestrictInfo.
@@ -351,6 +339,91 @@ clauselist_selectivity_simple(PlannerInf
}
/*
+ * clauselist_selectivity_or -
+ * Compute the selectivity of an implicitly-ORed list of boolean
+ * expression clauses. The list can be empty, in which case 0.0
+ * must be returned. List elements may be either RestrictInfos
+ * or bare expression clauses --- the former is preferred since
+ * it allows caching of results.
+ *
+ * See clause_selectivity() for the meaning of the additional parameters.
+ *
+ * The basic approach is to apply extended statistics first, on as many
+ * clauses as possible, in order to capture cross-column dependencies etc.
+ * The remaining clauses are then estimated as if they were independent.
+ */
+Selectivity
+clauselist_selectivity_or(PlannerInfo *root,
+ List *clauses,
+ int varRelid,
+ JoinType jointype,
+ SpecialJoinInfo *sjinfo,
+ bool use_extended_stats)
+{
+ Selectivity s1 = 0.0;
+ RelOptInfo *rel;
+ Bitmapset *estimatedclauses = NULL;
+ ListCell *lc;
+ int listidx;
+
+ /*
+ * Determine if these clauses reference a single relation. If so, and if
+ * it has extended statistics, try to apply those.
+ */
+ rel = find_single_rel_for_clauses(root, clauses);
+ if (use_extended_stats && rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
+ {
+ /*
+ * Estimate as many clauses as possible using extended statistics.
+ *
+ * 'estimatedclauses' tracks the 0-based list position index of
+ * clauses that we've estimated using extended statistics, and that
+ * should be ignored.
+ *
+ * XXX We can't multiply with current value, because for OR clauses we
+ * start with 0.0, so we simply assign to 's' directly.
+ */
+ s1 = statext_clauselist_selectivity(root, clauses, varRelid,
+ jointype, sjinfo, rel,
+ &estimatedclauses, true);
+ }
+
+ /*
+ * Estimate the remaining clauses.
+ *
+ * Selectivities for an OR clause are computed as s1+s2 - s1*s2 to account
+ * for the probable overlap of selected tuple sets.
+ *
+ * XXX is this too conservative?
+ */
+ listidx = -1;
+ foreach(lc, clauses)
+ {
+ Selectivity s2;
+
+ listidx++;
+
+ /*
+ * Skip this clause if it's already been estimated by some other
+ * statistics above.
+ */
+ if (bms_is_member(listidx, estimatedclauses))
+ continue;
+
+ s2 = clause_selectivity_opt_ext_stats(root,
+ (Node *) lfirst(lc),
+ varRelid,
+ jointype,
+ sjinfo,
+ use_extended_stats);
+
+ s1 = s1 + s2 - s1 * s2;
+ }
+
+ return s1;
+}
+
+/*
* addRangeClause --- add a new range clause for clauselist_selectivity
*
* Here is where we try to match up pairs of range-query clauses
@@ -602,6 +675,18 @@ clause_selectivity(PlannerInfo *root,
JoinType jointype,
SpecialJoinInfo *sjinfo)
{
+ return clause_selectivity_opt_ext_stats(root, clause, varRelid,
+ jointype, sjinfo, true);
+}
+
+static Selectivity
+clause_selectivity_opt_ext_stats(PlannerInfo *root,
+ Node *clause,
+ int varRelid,
+ JoinType jointype,
+ SpecialJoinInfo *sjinfo,
+ bool use_extended_stats)
+{
Selectivity s1 = 0.5; /* default for any unhandled clause type */
RestrictInfo *rinfo = NULL;
bool cacheable = false;
@@ -716,42 +801,35 @@ clause_selectivity(PlannerInfo *root,
else if (is_notclause(clause))
{
/* inverse of the selectivity of the underlying clause */
- s1 = 1.0 - clause_selectivity(root,
- (Node *) get_notclausearg((Expr *) clause),
- varRelid,
- jointype,
- sjinfo);
+ s1 = 1.0 - clause_selectivity_opt_ext_stats(root,
+ (Node *) get_notclausearg((Expr *) clause),
+ varRelid,
+ jointype,
+ sjinfo,
+ use_extended_stats);
}
else if (is_andclause(clause))
{
/* share code with clauselist_selectivity() */
- s1 = clauselist_selectivity(root,
- ((BoolExpr *) clause)->args,
- varRelid,
- jointype,
- sjinfo);
+ s1 = clauselist_selectivity_opt_ext_stats(root,
+ ((BoolExpr *) clause)->args,
+ varRelid,
+ jointype,
+ sjinfo,
+ use_extended_stats);
}
else if (is_orclause(clause))
{
/*
- * Selectivities for an OR clause are computed as s1+s2 - s1*s2 to
- * account for the probable overlap of selected tuple sets.
- *
- * XXX is this too conservative?
+ * Almost the same thing as clauselist_selectivity, but with the
+ * clauses connected by OR.
*/
- ListCell *arg;
-
- s1 = 0.0;
- foreach(arg, ((BoolExpr *) clause)->args)
- {
- Selectivity s2 = clause_selectivity(root,
- (Node *) lfirst(arg),
- varRelid,
- jointype,
- sjinfo);
-
- s1 = s1 + s2 - s1 * s2;
- }
+ s1 = clauselist_selectivity_or(root,
+ ((BoolExpr *) clause)->args,
+ varRelid,
+ jointype,
+ sjinfo,
+ use_extended_stats);
}
else if (is_opclause(clause) || IsA(clause, DistinctExpr))
{
@@ -852,20 +930,22 @@ clause_selectivity(PlannerInfo *root,
else if (IsA(clause, RelabelType))
{
/* Not sure this case is needed, but it can't hurt */
- s1 = clause_selectivity(root,
- (Node *) ((RelabelType *) clause)->arg,
- varRelid,
- jointype,
- sjinfo);
+ s1 = clause_selectivity_opt_ext_stats(root,
+ (Node *) ((RelabelType *) clause)->arg,
+ varRelid,
+ jointype,
+ sjinfo,
+ use_extended_stats);
}
else if (IsA(clause, CoerceToDomain))
{
/* Not sure this case is needed, but it can't hurt */
- s1 = clause_selectivity(root,
- (Node *) ((CoerceToDomain *) clause)->arg,
- varRelid,
- jointype,
- sjinfo);
+ s1 = clause_selectivity_opt_ext_stats(root,
+ (Node *) ((CoerceToDomain *) clause)->arg,
+ varRelid,
+ jointype,
+ sjinfo,
+ use_extended_stats);
}
else
{
diff --git a/src/backend/statistics/dependencies.c b/src/backend/statistics/dependencies.c
new file mode 100644
index d950b4e..d05afa0
--- a/src/backend/statistics/dependencies.c
+++ b/src/backend/statistics/dependencies.c
@@ -1073,8 +1073,9 @@ clauselist_apply_dependencies(PlannerInf
}
}
- simple_sel = clauselist_selectivity_simple(root, attr_clauses, varRelid,
- jointype, sjinfo, NULL);
+ simple_sel = clauselist_selectivity_opt_ext_stats(root, attr_clauses,
+ varRelid, jointype,
+ sjinfo, false);
attr_sel[attidx++] = simple_sel;
}
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
new file mode 100644
index 3632692..06213b6
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -1282,16 +1282,17 @@ statext_is_compatible_clause(PlannerInfo
static Selectivity
statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
JoinType jointype, SpecialJoinInfo *sjinfo,
- RelOptInfo *rel, Bitmapset **estimatedclauses)
+ RelOptInfo *rel, Bitmapset **estimatedclauses,
+ bool is_or)
{
ListCell *l;
Bitmapset **list_attnums;
int listidx;
- Selectivity sel = 1.0;
+ Selectivity sel = (is_or) ? 0.0 : 1.0;
/* check if there's any stats that might be useful for us. */
if (!has_stats_of_kind(rel->statlist, STATS_EXT_MCV))
- return 1.0;
+ return sel;
list_attnums = (Bitmapset **) palloc(sizeof(Bitmapset *) *
list_length(clauses));
@@ -1377,8 +1378,17 @@ statext_mcv_clauselist_selectivity(Plann
* columns/clauses. We'll then use the various selectivities computed
* from MCV list to improve it.
*/
- simple_sel = clauselist_selectivity_simple(root, stat_clauses, varRelid,
- jointype, sjinfo, NULL);
+ if (is_or)
+ simple_sel = clauselist_selectivity_or(root, stat_clauses,
+ varRelid, jointype, sjinfo,
+ false);
+ else
+ simple_sel = clauselist_selectivity_opt_ext_stats(root,
+ stat_clauses,
+ varRelid,
+ jointype,
+ sjinfo,
+ false);
/*
* Now compute the multi-column estimate from the MCV list, along with
@@ -1386,7 +1396,7 @@ statext_mcv_clauselist_selectivity(Plann
*/
mcv_sel = mcv_clauselist_selectivity(root, stat, stat_clauses, varRelid,
jointype, sjinfo, rel,
- &mcv_basesel, &mcv_totalsel);
+ &mcv_basesel, &mcv_totalsel, is_or);
/* Estimated selectivity of values not covered by MCV matches */
other_sel = simple_sel - mcv_basesel;
@@ -1404,7 +1414,10 @@ statext_mcv_clauselist_selectivity(Plann
CLAMP_PROBABILITY(stat_sel);
/* Factor the estimate from this MCV to the overall estimate. */
- sel *= stat_sel;
+ if (is_or)
+ sel = sel + stat_sel - sel * stat_sel;
+ else
+ sel *= stat_sel;
}
return sel;
@@ -1417,13 +1430,21 @@ statext_mcv_clauselist_selectivity(Plann
Selectivity
statext_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
JoinType jointype, SpecialJoinInfo *sjinfo,
- RelOptInfo *rel, Bitmapset **estimatedclauses)
+ RelOptInfo *rel, Bitmapset **estimatedclauses,
+ bool is_or)
{
Selectivity sel;
/* First, try estimating clauses using a multivariate MCV list. */
sel = statext_mcv_clauselist_selectivity(root, clauses, varRelid, jointype,
- sjinfo, rel, estimatedclauses);
+ sjinfo, rel, estimatedclauses, is_or);
+
+ /*
+ * Functional dependencies only work for clauses connected by AND, so for
+ * OR clauses we're done.
+ */
+ if (is_or)
+ return sel;
/*
* Then, apply functional dependencies on the remaining clauses by calling
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
new file mode 100644
index 6a262f1..7c5841e
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -1904,7 +1904,8 @@ mcv_clauselist_selectivity(PlannerInfo *
List *clauses, int varRelid,
JoinType jointype, SpecialJoinInfo *sjinfo,
RelOptInfo *rel,
- Selectivity *basesel, Selectivity *totalsel)
+ Selectivity *basesel, Selectivity *totalsel,
+ bool is_or)
{
int i;
MCVList *mcv;
@@ -1917,7 +1918,7 @@ mcv_clauselist_selectivity(PlannerInfo *
mcv = statext_mcv_load(stat->statOid);
/* build a match bitmap for the clauses */
- matches = mcv_get_match_bitmap(root, clauses, stat->keys, mcv, false);
+ matches = mcv_get_match_bitmap(root, clauses, stat->keys, mcv, is_or);
/* sum frequencies for all the matching MCV items */
*basesel = 0.0;
diff --git a/src/include/optimizer/optimizer.h b/src/include/optimizer/optimizer.h
new file mode 100644
index 3e41710..27f7985
--- a/src/include/optimizer/optimizer.h
+++ b/src/include/optimizer/optimizer.h
@@ -58,17 +58,23 @@ extern Selectivity clause_selectivity(Pl
int varRelid,
JoinType jointype,
SpecialJoinInfo *sjinfo);
-extern Selectivity clauselist_selectivity_simple(PlannerInfo *root,
- List *clauses,
- int varRelid,
- JoinType jointype,
- SpecialJoinInfo *sjinfo,
- Bitmapset *estimatedclauses);
extern Selectivity clauselist_selectivity(PlannerInfo *root,
List *clauses,
int varRelid,
JoinType jointype,
SpecialJoinInfo *sjinfo);
+extern Selectivity clauselist_selectivity_opt_ext_stats(PlannerInfo *root,
+ List *clauses,
+ int varRelid,
+ JoinType jointype,
+ SpecialJoinInfo *sjinfo,
+ bool use_extended_stats);
+Selectivity clauselist_selectivity_or(PlannerInfo *root,
+ List *clauses,
+ int varRelid,
+ JoinType jointype,
+ SpecialJoinInfo *sjinfo,
+ bool use_extended_stats);
/* in path/costsize.c: */
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
new file mode 100644
index 61e6969..08d639e
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -107,6 +107,7 @@ extern Selectivity mcv_clauselist_select
SpecialJoinInfo *sjinfo,
RelOptInfo *rel,
Selectivity *basesel,
- Selectivity *totalsel);
+ Selectivity *totalsel,
+ bool is_or);
#endif /* EXTENDED_STATS_INTERNAL_H */
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
new file mode 100644
index 50fce49..c9ed211
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -116,7 +116,8 @@ extern Selectivity statext_clauselist_se
JoinType jointype,
SpecialJoinInfo *sjinfo,
RelOptInfo *rel,
- Bitmapset **estimatedclauses);
+ Bitmapset **estimatedclauses,
+ bool is_or);
extern bool has_stats_of_kind(List *stats, char requiredkind);
extern StatisticExtInfo *choose_best_statistics(List *stats, char requiredkind,
Bitmapset **clause_attnums,
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
new file mode 100644
index 4c3edd2..abab5d6
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -1113,6 +1113,12 @@ SELECT * FROM check_estimated_rows('SELE
200 | 200
(1 row)
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
+ estimated | actual
+-----------+--------
+ 200 | 200
+(1 row)
+
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
estimated | actual
-----------+--------
@@ -1173,13 +1179,6 @@ SELECT * FROM check_estimated_rows('SELE
100 | 100
(1 row)
--- we can't use the statistic for OR clauses that are not fully covered (missing 'd' attribute)
-SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
- estimated | actual
------------+--------
- 343 | 200
-(1 row)
-
-- check change of unrelated column type does not reset the MCV statistics
ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64);
SELECT d.stxdmcv IS NOT NULL
@@ -1477,6 +1476,110 @@ SELECT * FROM check_estimated_rows('SELE
1 | 0
(1 row)
+-- mcv covering just a small fraction of data
+CREATE TABLE mcv_lists_partial (
+ a INT,
+ b INT,
+ c INT
+);
+-- 10 frequent groups, each with 100 elements
+INSERT INTO mcv_lists_partial (a, b, c)
+ SELECT
+ mod(i,10),
+ mod(i,10),
+ mod(i,10)
+ FROM generate_series(0,999) s(i);
+-- 100 groups that will make it to the MCV list (includes the 10 frequent ones)
+INSERT INTO mcv_lists_partial (a, b, c)
+ SELECT
+ i,
+ i,
+ i
+ FROM generate_series(0,99) s(i);
+-- 4000 groups in total, most of which won't make it (just a single item)
+INSERT INTO mcv_lists_partial (a, b, c)
+ SELECT
+ i,
+ i,
+ i
+ FROM generate_series(0,3999) s(i);
+ANALYZE mcv_lists_partial;
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 0');
+ estimated | actual
+-----------+--------
+ 1 | 102
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0');
+ estimated | actual
+-----------+--------
+ 300 | 102
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 AND b = 10 AND c = 10');
+ estimated | actual
+-----------+--------
+ 1 | 2
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 OR b = 10 OR c = 10');
+ estimated | actual
+-----------+--------
+ 6 | 2
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 10');
+ estimated | actual
+-----------+--------
+ 1 | 0
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 10');
+ estimated | actual
+-----------+--------
+ 204 | 104
+(1 row)
+
+CREATE STATISTICS mcv_lists_partial_stats (mcv) ON a, b, c
+ FROM mcv_lists_partial;
+ANALYZE mcv_lists_partial;
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 0');
+ estimated | actual
+-----------+--------
+ 102 | 102
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0');
+ estimated | actual
+-----------+--------
+ 402 | 102
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 AND b = 10 AND c = 10');
+ estimated | actual
+-----------+--------
+ 2 | 2
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 OR b = 10 OR c = 10');
+ estimated | actual
+-----------+--------
+ 8 | 2
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 10');
+ estimated | actual
+-----------+--------
+ 1 | 0
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 10');
+ estimated | actual
+-----------+--------
+ 308 | 104
+(1 row)
+
+DROP TABLE mcv_lists_partial;
-- check the ability to use multiple MCV lists
CREATE TABLE mcv_lists_multi (
a INTEGER,
@@ -1506,12 +1609,36 @@ SELECT * FROM check_estimated_rows('SELE
102 | 714
(1 row)
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+ estimated | actual
+-----------+--------
+ 143 | 142
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
+ estimated | actual
+-----------+--------
+ 1571 | 1572
+(1 row)
+
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
estimated | actual
-----------+--------
4 | 142
(1 row)
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+ estimated | actual
+-----------+--------
+ 298 | 1572
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
+ estimated | actual
+-----------+--------
+ 2649 | 1572
+(1 row)
+
-- create separate MCV statistics
CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
CREATE STATISTICS mcv_lists_multi_2 (mcv) ON c, d FROM mcv_lists_multi;
@@ -1528,12 +1655,36 @@ SELECT * FROM check_estimated_rows('SELE
714 | 714
(1 row)
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+ estimated | actual
+-----------+--------
+ 143 | 142
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
+ estimated | actual
+-----------+--------
+ 1571 | 1572
+(1 row)
+
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
estimated | actual
-----------+--------
143 | 142
(1 row)
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+ estimated | actual
+-----------+--------
+ 1571 | 1572
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
+ estimated | actual
+-----------+--------
+ 1571 | 1572
+(1 row)
+
DROP TABLE mcv_lists_multi;
-- Permission tests. Users should not be able to see specific data values in
-- the extended statistics, if they lack permission to see those values in
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
new file mode 100644
index 9781e59..3ec6dda
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -561,6 +561,8 @@ SELECT * FROM check_estimated_rows('SELE
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
+
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52, NULL) AND b IN ( ''1'', ''2'', NULL)');
@@ -581,9 +583,6 @@ SELECT * FROM check_estimated_rows('SELE
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND b IN (''1'', ''2'', NULL, ''3'') AND c > ANY (ARRAY[1, 2, NULL, 3])');
--- we can't use the statistic for OR clauses that are not fully covered (missing 'd' attribute)
-SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
-
-- check change of unrelated column type does not reset the MCV statistics
ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64);
@@ -777,6 +776,70 @@ SELECT * FROM check_estimated_rows('SELE
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND b AND NOT c');
+-- mcv covering just a small fraction of data
+CREATE TABLE mcv_lists_partial (
+ a INT,
+ b INT,
+ c INT
+);
+
+-- 10 frequent groups, each with 100 elements
+INSERT INTO mcv_lists_partial (a, b, c)
+ SELECT
+ mod(i,10),
+ mod(i,10),
+ mod(i,10)
+ FROM generate_series(0,999) s(i);
+
+-- 100 groups that will make it to the MCV list (includes the 10 frequent ones)
+INSERT INTO mcv_lists_partial (a, b, c)
+ SELECT
+ i,
+ i,
+ i
+ FROM generate_series(0,99) s(i);
+
+-- 4000 groups in total, most of which won't make it (just a single item)
+INSERT INTO mcv_lists_partial (a, b, c)
+ SELECT
+ i,
+ i,
+ i
+ FROM generate_series(0,3999) s(i);
+
+ANALYZE mcv_lists_partial;
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 0');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 AND b = 10 AND c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 OR b = 10 OR c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 10');
+
+CREATE STATISTICS mcv_lists_partial_stats (mcv) ON a, b, c
+ FROM mcv_lists_partial;
+
+ANALYZE mcv_lists_partial;
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 0');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 AND b = 10 AND c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 10 OR b = 10 OR c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 AND b = 0 AND c = 10');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 10');
+
+DROP TABLE mcv_lists_partial;
+
-- check the ability to use multiple MCV lists
CREATE TABLE mcv_lists_multi (
a INTEGER,
@@ -799,7 +862,11 @@ ANALYZE mcv_lists_multi;
-- estimates without any mcv statistics
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0');
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
-- create separate MCV statistics
CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
@@ -809,7 +876,11 @@ ANALYZE mcv_lists_multi;
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0');
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 AND c = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE b = 0 OR c = 0');
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE (a = 0 AND b = 0) OR (c = 0 AND d = 0)');
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0');
DROP TABLE mcv_lists_multi;