Peter Eisentraut <[email protected]> writes:
> On lör, 2011-08-27 at 13:32 -0400, Tom Lane wrote:
>> The larger problem is that if a subquery didn't get flattened, it's
>> often because it's got LIMIT, or GROUP BY, or some similar clause that
>> makes it highly suspect whether the statistics available for the table
>> column are reasonable to use for the subquery outputs. It wouldn't be
>> that hard to grab the stats for test2.sha1, but then how do you want
>> to adjust them to reflect the LIMIT?
> It turns out that this is a regression introduced in 8.4.8;
Well, the fact that examine_variable punts on subqueries is certainly
not a "regression introduced in 8.4.8"; it's always been like that.
I think your observation that 8.4.8 is worse has to be blamed on
commit 0ae8b300388c2a3eaf90e6e6f13d6be1f4d4ac2d, which introduced a
fallback rule of assuming 0.5 selectivity for a semijoin if we didn't
have non-default ndistinct estimates on both sides. Before that, 8.4.x
would go ahead and apply its heuristic rule, essentially Min(nd2/nd1, 1),
even when one or both ndistinct values were completely made-up.
I'm not sure what we could do instead. Perhaps you could argue that
we should just revert that commit on the grounds that it's doing more
harm than good, but I don't really believe that --- I think reverting
would just move the pain points around. It's pure luck that 8.4.8
is worse rather than better on the particular example you cite.
On a longer-term basis, I'm looking into what we could do with
extracting stats from subqueries, but that doesn't seem like material
for a backpatch. I have a draft patch that I've been playing with
(attached). The main thing I feel unsure about is whether it's
reasonable to extract stats in this way from a subquery that has GROUP
BY or DISTINCT. ISTM it's probably okay to ignore joining, sorting, or
limiting in the subquery: those might affect the stats of the subquery
output, but this is no worse than using the unmodified column statistics
for any other join-level selectivity estimate (where we already ignore
the effects of scan-level restriction clauses that will filter the
column values). But GROUP BY or DISTINCT would entirely invalidate the
column frequency statistics, which makes me think that ignoring the
pg_statistic entry might be the thing to do. Comments?
regards, tom lane
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index ef29374fcccae23cb663c04470f12c22321a0e2c..a703f4a910c0e5f521f09cf9564a05c73cf803b8 100644
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
***************
*** 95,100 ****
--- 95,101 ----
#include "access/sysattr.h"
#include "catalog/index.h"
#include "catalog/pg_collation.h"
+ #include "catalog/pg_inherits_fn.h"
#include "catalog/pg_opfamily.h"
#include "catalog/pg_statistic.h"
#include "catalog/pg_type.h"
*************** static double convert_one_bytea_to_scala
*** 168,173 ****
--- 169,176 ----
int rangelo, int rangehi);
static char *convert_string_datum(Datum value, Oid typid);
static double convert_timevalue_to_scalar(Datum value, Oid typid);
+ static void examine_variable_recurse(Query *subquery, AttrNumber resno,
+ VariableStatData *vardata);
static bool get_variable_range(PlannerInfo *root, VariableStatData *vardata,
Oid sortop, Datum *min, Datum *max);
static bool get_actual_variable_range(PlannerInfo *root,
*************** examine_variable(PlannerInfo *root, Node
*** 4176,4196 ****
}
else if (rte->rtekind == RTE_RELATION)
{
vardata->statsTuple = SearchSysCache3(STATRELATTINH,
ObjectIdGetDatum(rte->relid),
Int16GetDatum(var->varattno),
BoolGetDatum(rte->inh));
vardata->freefunc = ReleaseSysCache;
}
else
{
/*
! * XXX This means the Var comes from a JOIN or sub-SELECT. Later
! * add code to dig down into the join etc and see if we can trace
! * the variable to something with stats. (But beware of
! * sub-SELECTs with DISTINCT/GROUP BY/etc. Perhaps there are no
! * cases where this would really be useful, because we'd have
! * flattened the subselect if it is??)
*/
}
--- 4179,4205 ----
}
else if (rte->rtekind == RTE_RELATION)
{
+ /* plain table, so look up the column in pg_statistic */
vardata->statsTuple = SearchSysCache3(STATRELATTINH,
ObjectIdGetDatum(rte->relid),
Int16GetDatum(var->varattno),
BoolGetDatum(rte->inh));
vardata->freefunc = ReleaseSysCache;
}
+ else if (rte->rtekind == RTE_SUBQUERY)
+ {
+ /* recurse into subquery */
+ examine_variable_recurse(rte->subquery, var->varattno,
+ vardata);
+ }
else
{
/*
! * Otherwise, the Var comes from a FUNCTION, VALUES, or CTE RTE.
! * (We won't see RTE_JOIN here because join alias Vars have
! * already been flattened.) There's not much we can do with
! * function outputs, but maybe someday try to be smarter about
! * VALUES and/or CTEs.
*/
}
*************** examine_variable(PlannerInfo *root, Node
*** 4335,4340 ****
--- 4344,4435 ----
}
/*
+ * examine_variable_recurse
+ * Handle a Var referring to a subquery result for examine_variable
+ *
+ * If the Var ultimately refers to a column of a table, we can fill in the
+ * statsTuple; none of the other fields of vardata need to be changed.
+ * (In particular, we don't try to set isunique, since even if the underlying
+ * column is unique, the subquery may have joined to other tables in a way
+ * that creates duplicates.)
+ *
+ * XXX For the moment, we ignore the possibility that the subquery may change
+ * the column's statistics so much that we'd be better off not consulting
+ * pg_statistic at all. Possibly we should give up if the subquery uses
+ * GROUP BY or DISTINCT, for instance.
+ *
+ * The subquery we are looking at has not been through planner preprocessing,
+ * so some things have to be done differently here than in examine_variable.
+ * (Perhaps we should rearrange things so that the sub-root data structure
+ * is kept around after we plan a subquery?)
+ */
+ static void
+ examine_variable_recurse(Query *subquery, AttrNumber resno,
+ VariableStatData *vardata)
+ {
+ TargetEntry *ste;
+ Var *var;
+ RangeTblEntry *rte;
+
+ /* Get the subquery output expression referenced by the upper Var */
+ ste = get_tle_by_resno(subquery->targetList, resno);
+ if (ste == NULL || ste->resjunk)
+ elog(ERROR, "subquery does not have attribute %d", resno);
+ var = (Var *) ste->expr;
+
+ restart:
+ /* Can only handle a simple Var of subquery's query level */
+ if (var && IsA(var, Var) &&
+ var->varlevelsup == 0)
+ {
+ /* OK, fetch the RTE referenced by the Var */
+ rte = rt_fetch(var->varno, subquery->rtable);
+
+ /* XXX Can't call get_relation_stats_hook for lack of subroot */
+
+ if (rte->rtekind == RTE_RELATION)
+ {
+ /*
+ * We can't just rely on rte->inh here, because in parser output
+ * that only signifies the absence of ONLY. We have to check
+ * relhassubclass as well. For speed, we don't actually examine
+ * pg_inherits, so we might get fooled by a table that once had
+ * children but no longer does. But that's okay, because ANALYZE
+ * will still produce inheritance-tree statistics in such cases,
+ * so we will still find an applicable pg_statistic entry.
+ */
+ bool inh;
+
+ inh = rte->inh && has_subclass(rte->relid);
+ vardata->statsTuple = SearchSysCache3(STATRELATTINH,
+ ObjectIdGetDatum(rte->relid),
+ Int16GetDatum(var->varattno),
+ BoolGetDatum(inh));
+ vardata->freefunc = ReleaseSysCache;
+ }
+ else if (rte->rtekind == RTE_SUBQUERY)
+ {
+ /* recurse some more ... */
+ examine_variable_recurse(rte->subquery, var->varattno,
+ vardata);
+ }
+ else if (rte->rtekind == RTE_JOIN)
+ {
+ /*
+ * Since join alias variables haven't been flattened in the
+ * subquery, we have to be prepared to dereference them.
+ */
+ if (var->varattno == InvalidAttrNumber)
+ return; /* punt on whole-row reference */
+ Assert(var->varattno > 0);
+ var = (Var *) list_nth(rte->joinaliasvars, var->varattno - 1);
+ goto restart;
+ }
+ /* As in examine_variable, do nothing for other RTE kinds */
+ }
+ }
+
+ /*
* get_variable_numdistinct
* Estimate the number of distinct values of a variable.
*
--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers