Hello Konstantin,
Sorry for not responding to this thread earlier. I definitely agree the features proposed here are very interesting and useful, and I appreciate you kept rebasing the patch. I think the patch improving join estimates can be treated as separate, and I see it already has a separate CF entry - it however still points to this thread, which will be confusing. I suggest we start a different thread for it, to keep the discussions separate. I'll focus on the auto_explain part here. I did have some ideas about adaptive query optimization too, although maybe in a slightly different form. My plan was to collect information about estimated / actual cardinalities, and then use this knowledge to directly tweak the estimates. Directly, without creating extended stats, but treat the collected info about estimates / row counts as a kind of ad hoc statistics. (Not sure if this is what the AQE extension does.) What is being proposed here - an extension suggesting which statistics to create (and possibly creating them automatically) is certainly useful, but I'm not sure I'd call it "adaptive query optimization". I think "adaptive" means the extension directly modifies the estimates based on past executions. So I propose calling it maybe "statistics advisor" or something like that. A couple additional points: 1) I think we should create a new extension for this. auto_explain has a fairly well defined purpose, I don't think this is consistent with it. It's quite likely it'll require stuff like shared memory, etc. which auto_explain does not (and should not) need. Let's call it statistics_advisor, or something like that. It will use about the same planner/executor callbacks as auto_explain, but that's fine I think. 2) I'm not sure creating statistics automatically based on a single query execution is a good idea. I think we'll need to collect data from multiple runs (in shared memory), and do suggestions based on that. 3) I wonder if it should also consider duration of the query (who cares about estimates if it still executed in 10ms)? Similarly, it probably should require some minimal number of rows (1 vs. 10 rows is likely different from 1M vs. 10M rows, but both is 10x difference). 4) Ideally it'd evaluate impact of the improved estimates on the whole query plan (you may fix one node, but the cost difference for the whole query may be negligible). But that seems very hard/expensive :-( 5) I think AddMultiColumnStatisticsForQual() needs refactoring - it mixes stuff at many different levels of abstraction (generating names, deciding which statistics to build, ...). I think it'll also need some improvements to better identify which Vars to consider for statistics, and once we get support for statistics on expressions committed (which seems to be fairly close now) also to handle expressions. BTW Why is "qual" in static void AddMultiColumnStatisticsForQual(void* qual, ExplainState *es) declared as "void *"? Shouldn't that be "List *"? 5) I'm not sure about automatically creating the stats. I can't imagine anyone actually enabling that on production, TBH (I myself probably would not do that). I suggest we instead provide an easy way to show which statistics are suggested. For one execution that might be integrated into EXPLAIN ANALYZE, I guess (through some callback, which seems fairly easy to do). For many executions (you can leave it running for a coupel days, then see what is the suggestion based on X runs) we could have a view or something. This would also work for read-only replicas, where just creating the statistics is impossible. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company