Re: Gather real world statistics about row count, CPU, and IO cost

Thomas Rebele Mon, 10 Aug 2020 03:48:51 -0700

Thank you for your analysis and the references. I would be more interested
in sub-problem 2.
A kind of tuning tool would be nice. If a project using Calcite  introduces
custom operators, it could help to find a cost model for them.


I would start with a simple case, e.g., deciding whether a particular join
should be translated into a hash join, merge join, or correlate.
With a different cardinality of the inputs, one algorithm or the other
might be better. The real-world statistics might help to decide which one
to choose.

Cordialement / Best Regards,
*Thomas Rebele* | R&D Developer | 18 rue du 4 septembre, 75002 Paris, France
| www.tibco.com


On Fri, Aug 7, 2020 at 11:32 PM Julian Hyde <[email protected]> wrote:

> Also, check out the paper "Learning to Optimize Join Queries With Deep
> Reinforcement Learning" by Krishnan, Yang, Goldberg, Hellerstein,
> Stoica 2019 (which aims to improve the join-order benchmark and uses
> Calcite as one of its test platforms):
> https://arxiv.org/pdf/1808.03196.pdf
>
> On Fri, Aug 7, 2020 at 2:02 PM Julian Hyde <[email protected]> wrote:
> >
> > I consider there to be two fairly independent sub-problems:
> >
> > 1. Improve our cardinality estimates (especially for queries with
> > complex joins and aggregation).
> >
> > 2. Calibrate physical operators so that, given a good estimate of the
> > number of rows they will see, we can come up with a reasonable
> > estimate of the physical cost (e.g. how long the query will take to
> > execute, and how much memory).
> >
> > For 1, the paper you cite, and the join-order benchmark it introduces,
> > is an excellent contribution to the field. It inspired me to do work
> > on profiling [1]. I would encourage you to build on the work I have
> > already done.
> >
> > For 2, I have not done any work personally. An approach would be to
> > give each physical operator (e.g. EnumerableHashJoin) a cost model
> > parameterized by certain constants, and then run experiments to
> > determine the values of those constants empirically. Perhaps we could
> > write a "TuningTool" that generates an "operator constants file", and
> > thereby start to formalize the process.
> >
> > Julian
> >
> > [1]
> https://www.slideshare.net/julianhyde/data-profiling-in-apache-calcite
> >
> > On Fri, Aug 7, 2020 at 6:57 AM Thomas Rebele <[email protected]>
> wrote:
> > >
> > > Hi all,
> > >
> > > I'm working on basic query optimization. I once stumbled on the case
> that
> > > two operators had the same row count but one had a much higher CPU
> cost.
> > > Unfortunately the default cost model only takes the row count into
> account
> > > (see [1]). Stamatis had pointed out in another mail that the row count
> > > might be much more important than the other costs [2]. However, if
> there
> > > are two possible choices with the same row count, we should prefer the
> one
> > > with the least CPU cost. I'm wondering whether the assumption that a
> > > smaller row count is better in most cases is actually correct. Also,
> what
> > > is "better" in this context? The query plan with the least execution
> time?
> > > Maybe there's a plan that is just <10% slower, but consumes much less
> > > CPU/memory/etc.
> > >
> > > So I thought about the cost model in general, and how to improve it. I
> > > assume the better the estimated cost corresponds to the real cost, the
> > > better the optimized plans. So the first step would be to collect the
> real
> > > world statistics and the second step to adapt the cost estimation so
> that
> > > there's a better correspondence. For the beginning I would just
> measure how
> > > many rows have been in the result and how much time has passed for each
> > > RelNode during query execution. Is there already a way to do this in
> > > Calcite? Does this make sense at all?
> > >
> > > [1]
> > >
> https://github.com/apache/calcite/blob/52a57078ba081b24b9d086ed363c715485d1a519/core/src/main/java/org/apache/calcite/plan/volcano/VolcanoCost.java#L100
> > > [2]
> > >
> https://15721.courses.cs.cmu.edu/spring2019/papers/24-costmodels/p204-leis.pdf
> > >
> > > Cordialement / Best Regards,
> > > *Thomas Rebele* | R&D Developer | 18 rue du 4 septembre, 75002 Paris,
> France
> > > | www.tibco.com
>

Re: Gather real world statistics about row count, CPU, and IO cost

Reply via email to