Hi Tomas and Hackers, Thanks for your reply and feedback!
> I don't understand how you could achieve this by mutating pg_statistic, without also breaking estimation for queries that only have Y<20. I agree, if we mutate pg_statistics, we will break lots of stuff and the process becomes complicated. That's also why I think mutating the bounds makes more sense and is easier to achieve. > Maybe, but it's really hard to comment on this without seeing any PoC patches. We don't know where you you'd like the hook called, what info would it have access to, how would it tweak the selectivities etc. I have attached a PoC patch to this mail. Essentially in this patch, I only try to pass the pointer of the constval in ```scalarineqsql``` function. It is enough from the Postgres side. With that, I can handle other things in an independent extension. I hope this makes sense. Best regards, Xiaozhe On Wed, Nov 17, 2021 at 2:49 PM Tomas Vondra <tomas.von...@enterprisedb.com> wrote: > > > On 11/17/21 2:24 PM, Xiaozhe Yao wrote: > > Hi hackers, > > > > I am currently working on improving the cardinality estimation component > > in PostgreSQL with machine learning. I came up with a solution that > > mutates the bounds for different columns. For example, assume that we > > have a query > > > > ``` > > select * from test where X<10 and Y<20; > > ``` > > > > Our approach tries to learn the relation between X and Y. For example, > > if we have a linear relation, Y=X+10. Then Y<20 is essentially > > equivalent to X<10. Therefore we can mutate the Y<20 to Y<INT_MAX so > > that the selectivity will be 1, and we will have a more accurate > estimation. > > > > OK. FWIW the extended statistics patch originally included a patch for > multi-dimensional histograms, and that would have worked for this > example just fine, I guess. But yeah, there are various other > dependencies for which a histogram would not help. And ML might discover > that and help ... > > > It seems to me that we can achieve something similar by mutating the > > pg_statistics, however, mutating the bounds is something more > > straightforward to me and less expensive. > > > > I don't understand how you could achieve this by mutating pg_statistic, > without also breaking estimation for queries that only have Y<20. > > > I am wondering if it is possible to have such an extension? Or if there > > is a better solution to this? I have already implemented this stuff in a > > private repository, and if this is something you like, I can further > > propose the patch to the list. > > > > Maybe, but it's really hard to comment on this without seeing any PoC > patches. We don't know where you you'd like the hook called, what info > would it have access to, how would it tweak the selectivities etc. > > If you think this would work, write a PoC patch and we'll see. > > > regards > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company >
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 10895fb287..cdf4762c60 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -143,6 +143,8 @@ /* Hooks for plugins to get control when we ask for stats */ get_relation_stats_hook_type get_relation_stats_hook = NULL; get_index_stats_hook_type get_index_stats_hook = NULL; +mutate_bounds_hook_type mutate_bounds_hook = NULL; + static double eqsel_internal(PG_FUNCTION_ARGS, bool negate); static double eqjoinsel_inner(Oid opfuncoid, Oid collation, @@ -669,6 +671,9 @@ scalarineqsel(PlannerInfo *root, Oid operator, bool isgt, bool iseq, fmgr_info(get_opcode(operator), &opproc); + if (mutate_bounds_hook) { + mutate_bounds_hook(root, &constval, isgt, iseq); + } /* * If we have most-common-values info, add up the fractions of the MCV * entries that satisfy MCV OP CONST. These fractions contribute directly diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h index 9dd444e1ff..b406b0aa2b 100644 --- a/src/include/utils/selfuncs.h +++ b/src/include/utils/selfuncs.h @@ -145,6 +145,9 @@ typedef bool (*get_index_stats_hook_type) (PlannerInfo *root, VariableStatData *vardata); extern PGDLLIMPORT get_index_stats_hook_type get_index_stats_hook; +typedef bool (*mutate_bounds_hook_type) (PlannerInfo *root, Datum *bound, bool isgt, bool iseq); +extern PGDLLIMPORT mutate_bounds_hook_type mutate_bounds_hook; + /* Functions in selfuncs.c */ extern void examine_variable(PlannerInfo *root, Node *node, int varRelid,