I consider there to be two fairly independent sub-problems:

1. Improve our cardinality estimates (especially for queries with
complex joins and aggregation).

2. Calibrate physical operators so that, given a good estimate of the
number of rows they will see, we can come up with a reasonable
estimate of the physical cost (e.g. how long the query will take to
execute, and how much memory).

For 1, the paper you cite, and the join-order benchmark it introduces,
is an excellent contribution to the field. It inspired me to do work
on profiling [1]. I would encourage you to build on the work I have
already done.

For 2, I have not done any work personally. An approach would be to
give each physical operator (e.g. EnumerableHashJoin) a cost model
parameterized by certain constants, and then run experiments to
determine the values of those constants empirically. Perhaps we could
write a "TuningTool" that generates an "operator constants file", and
thereby start to formalize the process.

Julian

[1] https://www.slideshare.net/julianhyde/data-profiling-in-apache-calcite

On Fri, Aug 7, 2020 at 6:57 AM Thomas Rebele <[email protected]> wrote:
>
> Hi all,
>
> I'm working on basic query optimization. I once stumbled on the case that
> two operators had the same row count but one had a much higher CPU cost.
> Unfortunately the default cost model only takes the row count into account
> (see [1]). Stamatis had pointed out in another mail that the row count
> might be much more important than the other costs [2]. However, if there
> are two possible choices with the same row count, we should prefer the one
> with the least CPU cost. I'm wondering whether the assumption that a
> smaller row count is better in most cases is actually correct. Also, what
> is "better" in this context? The query plan with the least execution time?
> Maybe there's a plan that is just <10% slower, but consumes much less
> CPU/memory/etc.
>
> So I thought about the cost model in general, and how to improve it. I
> assume the better the estimated cost corresponds to the real cost, the
> better the optimized plans. So the first step would be to collect the real
> world statistics and the second step to adapt the cost estimation so that
> there's a better correspondence. For the beginning I would just measure how
> many rows have been in the result and how much time has passed for each
> RelNode during query execution. Is there already a way to do this in
> Calcite? Does this make sense at all?
>
> [1]
> https://github.com/apache/calcite/blob/52a57078ba081b24b9d086ed363c715485d1a519/core/src/main/java/org/apache/calcite/plan/volcano/VolcanoCost.java#L100
> [2]
> https://15721.courses.cs.cmu.edu/spring2019/papers/24-costmodels/p204-leis.pdf
>
> Cordialement / Best Regards,
> *Thomas Rebele* | R&D Developer | 18 rue du 4 septembre, 75002 Paris, France
> | www.tibco.com

Reply via email to