Re: Columns correlation and adaptive query optimization

Tomas Vondra Tue, 09 Mar 2021 18:01:00 -0800

Hello Konstantin,


Sorry for not responding to this thread earlier. I definitely agree the
features proposed here are very interesting and useful, and I appreciate
you kept rebasing the patch.

I think the patch improving join estimates can be treated as separate,
and I see it already has a separate CF entry - it however still points
to this thread, which will be confusing. I suggest we start a different
thread for it, to keep the discussions separate.

I'll focus on the auto_explain part here.


I did have some ideas about adaptive query optimization too, although
maybe in a slightly different form. My plan was to collect information
about estimated / actual cardinalities, and then use this knowledge to
directly tweak the estimates. Directly, without creating extended stats,
but treat the collected info about estimates / row counts as a kind of
ad hoc statistics. (Not sure if this is what the AQE extension does.)


What is being proposed here - an extension suggesting which statistics
to create (and possibly creating them automatically) is certainly
useful, but I'm not sure I'd call it "adaptive query optimization". I
think "adaptive" means the extension directly modifies the estimates
based on past executions. So I propose calling it maybe "statistics
advisor" or something like that.


A couple additional points:

1) I think we should create a new extension for this.

auto_explain has a fairly well defined purpose, I don't think this is
consistent with it. It's quite likely it'll require stuff like shared
memory, etc. which auto_explain does not (and should not) need.

Let's call it statistics_advisor, or something like that. It will use
about the same planner/executor callbacks as auto_explain, but that's
fine I think.


2) I'm not sure creating statistics automatically based on a single
query execution is a good idea. I think we'll need to collect data from
multiple runs (in shared memory), and do suggestions based on that.


3) I wonder if it should also consider duration of the query (who cares
about estimates if it still executed in 10ms)? Similarly, it probably
should require some minimal number of rows (1 vs. 10 rows is likely
different from 1M vs. 10M rows, but both is 10x difference).


4) Ideally it'd evaluate impact of the improved estimates on the whole
query plan (you may fix one node, but the cost difference for the whole
query may be negligible). But that seems very hard/expensive :-(


5) I think AddMultiColumnStatisticsForQual() needs refactoring - it
mixes stuff at many different levels of abstraction (generating names,
deciding which statistics to build, ...). I think it'll also need some
improvements to better identify which Vars to consider for statistics,
and once we get support for statistics on expressions committed (which
seems to be fairly close now) also to handle expressions.

BTW Why is "qual" in

  static void
  AddMultiColumnStatisticsForQual(void* qual, ExplainState *es)

declared as "void *"? Shouldn't that be "List *"?


5) I'm not sure about automatically creating the stats. I can't imagine
anyone actually enabling that on production, TBH (I myself probably
would not do that). I suggest we instead provide an easy way to show
which statistics are suggested.

For one execution that might be integrated into EXPLAIN ANALYZE, I guess
(through some callback, which seems fairly easy to do).

For many executions (you can leave it running for a coupel days, then
see what is the suggestion based on X runs) we could have a view or
something. This would also work for read-only replicas, where just
creating the statistics is impossible.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Columns correlation and adaptive query optimization

Reply via email to