Peter Eisentraut <pete...@gmx.net> writes: > On lör, 2011-08-27 at 13:32 -0400, Tom Lane wrote: >> The larger problem is that if a subquery didn't get flattened, it's >> often because it's got LIMIT, or GROUP BY, or some similar clause that >> makes it highly suspect whether the statistics available for the table >> column are reasonable to use for the subquery outputs. It wouldn't be >> that hard to grab the stats for test2.sha1, but then how do you want >> to adjust them to reflect the LIMIT?
> It turns out that this is a regression introduced in 8.4.8; Well, the fact that examine_variable punts on subqueries is certainly not a "regression introduced in 8.4.8"; it's always been like that. I think your observation that 8.4.8 is worse has to be blamed on commit 0ae8b300388c2a3eaf90e6e6f13d6be1f4d4ac2d, which introduced a fallback rule of assuming 0.5 selectivity for a semijoin if we didn't have non-default ndistinct estimates on both sides. Before that, 8.4.x would go ahead and apply its heuristic rule, essentially Min(nd2/nd1, 1), even when one or both ndistinct values were completely made-up. I'm not sure what we could do instead. Perhaps you could argue that we should just revert that commit on the grounds that it's doing more harm than good, but I don't really believe that --- I think reverting would just move the pain points around. It's pure luck that 8.4.8 is worse rather than better on the particular example you cite. On a longer-term basis, I'm looking into what we could do with extracting stats from subqueries, but that doesn't seem like material for a backpatch. I have a draft patch that I've been playing with (attached). The main thing I feel unsure about is whether it's reasonable to extract stats in this way from a subquery that has GROUP BY or DISTINCT. ISTM it's probably okay to ignore joining, sorting, or limiting in the subquery: those might affect the stats of the subquery output, but this is no worse than using the unmodified column statistics for any other join-level selectivity estimate (where we already ignore the effects of scan-level restriction clauses that will filter the column values). But GROUP BY or DISTINCT would entirely invalidate the column frequency statistics, which makes me think that ignoring the pg_statistic entry might be the thing to do. Comments? regards, tom lane
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index ef29374fcccae23cb663c04470f12c22321a0e2c..a703f4a910c0e5f521f09cf9564a05c73cf803b8 100644 *** a/src/backend/utils/adt/selfuncs.c --- b/src/backend/utils/adt/selfuncs.c *************** *** 95,100 **** --- 95,101 ---- #include "access/sysattr.h" #include "catalog/index.h" #include "catalog/pg_collation.h" + #include "catalog/pg_inherits_fn.h" #include "catalog/pg_opfamily.h" #include "catalog/pg_statistic.h" #include "catalog/pg_type.h" *************** static double convert_one_bytea_to_scala *** 168,173 **** --- 169,176 ---- int rangelo, int rangehi); static char *convert_string_datum(Datum value, Oid typid); static double convert_timevalue_to_scalar(Datum value, Oid typid); + static void examine_variable_recurse(Query *subquery, AttrNumber resno, + VariableStatData *vardata); static bool get_variable_range(PlannerInfo *root, VariableStatData *vardata, Oid sortop, Datum *min, Datum *max); static bool get_actual_variable_range(PlannerInfo *root, *************** examine_variable(PlannerInfo *root, Node *** 4176,4196 **** } else if (rte->rtekind == RTE_RELATION) { vardata->statsTuple = SearchSysCache3(STATRELATTINH, ObjectIdGetDatum(rte->relid), Int16GetDatum(var->varattno), BoolGetDatum(rte->inh)); vardata->freefunc = ReleaseSysCache; } else { /* ! * XXX This means the Var comes from a JOIN or sub-SELECT. Later ! * add code to dig down into the join etc and see if we can trace ! * the variable to something with stats. (But beware of ! * sub-SELECTs with DISTINCT/GROUP BY/etc. Perhaps there are no ! * cases where this would really be useful, because we'd have ! * flattened the subselect if it is??) */ } --- 4179,4205 ---- } else if (rte->rtekind == RTE_RELATION) { + /* plain table, so look up the column in pg_statistic */ vardata->statsTuple = SearchSysCache3(STATRELATTINH, ObjectIdGetDatum(rte->relid), Int16GetDatum(var->varattno), BoolGetDatum(rte->inh)); vardata->freefunc = ReleaseSysCache; } + else if (rte->rtekind == RTE_SUBQUERY) + { + /* recurse into subquery */ + examine_variable_recurse(rte->subquery, var->varattno, + vardata); + } else { /* ! * Otherwise, the Var comes from a FUNCTION, VALUES, or CTE RTE. ! * (We won't see RTE_JOIN here because join alias Vars have ! * already been flattened.) There's not much we can do with ! * function outputs, but maybe someday try to be smarter about ! * VALUES and/or CTEs. */ } *************** examine_variable(PlannerInfo *root, Node *** 4335,4340 **** --- 4344,4435 ---- } /* + * examine_variable_recurse + * Handle a Var referring to a subquery result for examine_variable + * + * If the Var ultimately refers to a column of a table, we can fill in the + * statsTuple; none of the other fields of vardata need to be changed. + * (In particular, we don't try to set isunique, since even if the underlying + * column is unique, the subquery may have joined to other tables in a way + * that creates duplicates.) + * + * XXX For the moment, we ignore the possibility that the subquery may change + * the column's statistics so much that we'd be better off not consulting + * pg_statistic at all. Possibly we should give up if the subquery uses + * GROUP BY or DISTINCT, for instance. + * + * The subquery we are looking at has not been through planner preprocessing, + * so some things have to be done differently here than in examine_variable. + * (Perhaps we should rearrange things so that the sub-root data structure + * is kept around after we plan a subquery?) + */ + static void + examine_variable_recurse(Query *subquery, AttrNumber resno, + VariableStatData *vardata) + { + TargetEntry *ste; + Var *var; + RangeTblEntry *rte; + + /* Get the subquery output expression referenced by the upper Var */ + ste = get_tle_by_resno(subquery->targetList, resno); + if (ste == NULL || ste->resjunk) + elog(ERROR, "subquery does not have attribute %d", resno); + var = (Var *) ste->expr; + + restart: + /* Can only handle a simple Var of subquery's query level */ + if (var && IsA(var, Var) && + var->varlevelsup == 0) + { + /* OK, fetch the RTE referenced by the Var */ + rte = rt_fetch(var->varno, subquery->rtable); + + /* XXX Can't call get_relation_stats_hook for lack of subroot */ + + if (rte->rtekind == RTE_RELATION) + { + /* + * We can't just rely on rte->inh here, because in parser output + * that only signifies the absence of ONLY. We have to check + * relhassubclass as well. For speed, we don't actually examine + * pg_inherits, so we might get fooled by a table that once had + * children but no longer does. But that's okay, because ANALYZE + * will still produce inheritance-tree statistics in such cases, + * so we will still find an applicable pg_statistic entry. + */ + bool inh; + + inh = rte->inh && has_subclass(rte->relid); + vardata->statsTuple = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(rte->relid), + Int16GetDatum(var->varattno), + BoolGetDatum(inh)); + vardata->freefunc = ReleaseSysCache; + } + else if (rte->rtekind == RTE_SUBQUERY) + { + /* recurse some more ... */ + examine_variable_recurse(rte->subquery, var->varattno, + vardata); + } + else if (rte->rtekind == RTE_JOIN) + { + /* + * Since join alias variables haven't been flattened in the + * subquery, we have to be prepared to dereference them. + */ + if (var->varattno == InvalidAttrNumber) + return; /* punt on whole-row reference */ + Assert(var->varattno > 0); + var = (Var *) list_nth(rte->joinaliasvars, var->varattno - 1); + goto restart; + } + /* As in examine_variable, do nothing for other RTE kinds */ + } + } + + /* * get_variable_numdistinct * Estimate the number of distinct values of a variable. *
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers