Re: Calcite's CBO

Lekshmi Tue, 12 Feb 2019 05:35:54 -0800

Hi Stamatis,
   Thanks a lot for the quick response and support. I think,  now I can
give an overview of the cost model used in calcite to my other group
members. ,


Thanks and Regards

Lekshmi B.G
Email: [email protected]




On Tue, Feb 12, 2019 at 1:58 PM Stamatis Zampetakis <[email protected]>
wrote:

> Hi Lekshmi,
>
> There is really no need to apologise. Hope you and your kid are better now.
>
> I don't think there is right or wrong regarding the cost model since it
> depends mostly on the use-case.
> If you have a look at the RelOptCost interface you will notice that it is
> an implementation detail when
> a cost is less than (Lt) another cost. If you implement an optimizer
> yourself it is up to you to pass
> the appropriate cost factory in the constructor of the Volcano planner. If
> you are interested about the
> current state and what happens by default in the VolcanoPlanner have a look
> in the VolcanoCost class.
> You will observe that the default implementation of cost for Volcano is
> based only on the number
> of rows (io and cpu are ignored) so the algorithms in [1] are not used.
>
> Best,
> Stamatis
>
> Στις Τρί, 12 Φεβ 2019 στις 10:52 π.μ., ο/η Lekshmi <[email protected]>
> έγραψε:
>
> > Hi,
> >    Thanks a lot all for your support. I'm really sorry for the late
> > response because my kid and myself were affected with flu since the last
> > few days.
> > I have one more doubt,
> > > records are a partial order: (1, 3) is less than (2, 4), but (1, 3) is
> > neither less than nor greater than (2, 2).
> > If this is the case, then how should apache calcite find the best plan?
> Is
> > that using any of the specified algorithms in that referenced paper? ([1]
> > http://www.vldb.org/pvldb/vol8/p221-trummer.pdf)
> > or
> > if we have 2 records like (a,b,c) and (x,y,z) and then comparison happens
> > between them like,
> > if (a<x), then go with the plan having (a,b,c)
> >  else if (a=x, and b<y), then also go with the plan having cost (a,b,c)
> >  and else if (a=x and b=y and c>z) then go with the plan having (x,y,z)
> ...
> > comparing the record elements one by one..
> > Is this correct? to compare the cost records in Calcite?
> >
> >
> > Thanks and Regards
> >
> > Lekshmi B.G
> > Email: [email protected]
> >
> >
> >
> >
> > On Sat, Feb 9, 2019 at 12:51 AM Julian Hyde <[email protected]> wrote:
> >
> > > In retrospect, I think it was a mistake to make cost a record. Of
> course
> > > we would like to represent different aspects of cost (cpu, io, number
> of
> > > bytes moved, amount of memory used) but the optimization algorithm has
> > > trouble if the cost function does not have a total order (i.e. c1 <= c3
> > if
> > > and only if c1 <= c2 and c2 <= c3), and records are a partial order:
> (1,
> > 3)
> > > is less than (2, 4), but (1, 3) is neither less than nor greater than
> (2,
> > > 2).
> > >
> > > Multi-objective parametric query optimization [1] is an approach that
> > > works with costs that are partial orders.
> > >
> > > Julian
> > >
> > > [1] http://www.vldb.org/pvldb/vol8/p221-trummer.pdf
> > >
> > > > On Feb 8, 2019, at 3:34 PM, Michael Mior <[email protected]> wrote:
> > > >
> > > > Cost is always a record in Calcite, so yes. Costs from different
> > > > conventions are compared directly. It's assumed that all the costs
> are
> > > > expressed in a way they can be correctly compared. Again, this is not
> > > > ideal, but it's what we have now and it works relatively well. The
> > > > planner finds the cheapest plan using essentially the same algorithm
> > > > as used in the Volcano planner which I referenced earlier. You can
> see
> > > > the code in VolcanoPlanner#findBestExp.
> > > > --
> > > > Michael Mior
> > > > [email protected]
> > > >
> > > > Le ven. 8 févr. 2019 à 14:34, Lekshmi <[email protected]> a
> écrit
> > :
> > > >>
> > > >> Hi,
> > > >>  I understand... If we are overriding the calcite's cost function in
> > > >> adapter, then should we also generate cost as record? And how does
> > > calcite
> > > >> compare the costs of different nodes in different convention? And
> how
> > > >> calcite generate the cheapest plan? Is there any algorithm dedicated
> > for
> > > >> that? I would like to know more about that. Because we would like to
> > > >> develop a system in which Calcite plays the major role of
> optimization
> > > and
> > > >> execution. So for that we would be developing a cost function and
> > costs
> > > >> estimated for this should be in the comparable format of calcite.
> > > >>
> > > >> Thank you so much for the support and suggestions....
> > > >>
> > > >> On Thu, 7 Feb 2019, 4:07 pm Michael Mior <[email protected] wrote:
> > > >>
> > > >>> I/O cost is always zero because Calcite itself doesn't try to
> > estimate
> > > >>> I/O cost. As I mentioned. my previous reply, this can be overridden
> > by
> > > >>> adapters. Row count is an estimate which starts off as the
> estimated
> > > >>> number of rows in the table but then is impacted by the estimated
> > > >>> filter factors of queries. So if the estimated number of rows is
> 100
> > > >>> and filters are applied that remove an estimated 95% and 30% of
> rows,
> > > >>> then the estimated rows produced by the query will be 3.5.
> > > >>>
> > > >>> --
> > > >>> Michael Mior
> > > >>> [email protected]
> > > >>>
> > > >>> Le jeu. 7 févr. 2019 à 04:55, Lekshmi <[email protected]> a
> > écrit
> > > :
> > > >>>>
> > > >>>> Dear Michael Mior,
> > > >>>>   Thank you so much for the reply. But still, I'm confused with
> > > >>> following
> > > >>>>
> > > >>>> When I connected PostgreSQL with Calcite and execute TPCH queries,
> > > >>> provided
> > > >>>> data residing in the Postgres database, I got the cheapest plan
> like
> > > >>> below
> > > >>>> for TPCH query 1.
> > > >>>>
> > > >>>> EnumerableProject(l_returnflag=[$0], l_linestatus=[$1],
> > > >>> SUM_QTY=[CASE(=($3,
> > > >>>> 0), null, $2)], SUM_BASE_PRICE=[CASE(=($5, 0), null, $4)],
> > > >>>> SUM_DISC_PRICE=[CASE(=($7, 0), null, $6)], SUM_CHARGE=[CASE(=($9,
> > 0),
> > > >>> null,
> > > >>>> $8)], AVG_QTY=[/(CASE(=($3, 0), null, $2), $3)],
> > > AVG_PRICE=[/(CASE(=($5,
> > > >>>> 0), null, $4), $5)], AVG_DISC=[/(CASE(=($11, 0), null, $10),
> $11)],
> > > >>>> COUNT_ORDER=[$12]): rowcount = 1.0, cumulative cost = {216.75
> rows,
> > > >>>> 1870.344248356904 cpu, 0.0 io}, id = 316
> > > >>>>
> > > >>>>  EnumerableLimit(fetch=[1]): rowcount = 1.0, cumulative cost =
> > {215.75
> > > >>>> rows, 1860.344248356904 cpu, 0.0 io}, id = 315
> > > >>>>
> > > >>>>    EnumerableSort(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC]):
> > > >>>> rowcount = 10.0, cumulative cost = {214.75 rows, 1859.344248356904
> > > cpu,
> > > >>> 0.0
> > > >>>> io}, id = 314
> > > >>>>
> > > >>>>      JdbcToEnumerableConverter: rowcount = 10.0, cumulative cost =
> > > >>> {204.75
> > > >>>> rows, 662.0 cpu, 0.0 io}, id = 313
> > > >>>>
> > > >>>>        JdbcAggregate(group=[{0, 1}], SUM_QTY=[$SUM0($2)],
> > > >>>> agg#1=[COUNT($2)], SUM_BASE_PRICE=[$SUM0($3)], agg#3=[COUNT($3)],
> > > >>>> SUM_DISC_PRICE=[$SUM0($4)], agg#5=[COUNT($4)],
> > SUM_CHARGE=[$SUM0($5)],
> > > >>>> agg#7=[COUNT($5)], agg#8=[$SUM0($6)], agg#9=[COUNT($6)],
> > > >>>> COUNT_ORDER=[COUNT()]): rowcount = 10.0, cumulative cost = {203.75
> > > rows,
> > > >>>> 661.0 cpu, 0.0 io}, id = 312
> > > >>>>
> > > >>>>          JdbcProject(l_returnflag=[$8], l_linestatus=[$9],
> > > >>>> l_quantity=[$4], l_extendedprice=[$5], $f4=[*($5, -(1, $6))],
> > > >>> $f5=[*(*($5,
> > > >>>> -(1, $6)), +(1, $7))], l_discount=[$6]): rowcount = 100.0,
> > cumulative
> > > >>> cost
> > > >>>> = {180.0 rows, 661.0 cpu, 0.0 io}, id = 311
> > > >>>>
> > > >>>>            JdbcTableScan(table=[[TPCH, lineitem]]): rowcount =
> > 100.0,
> > > >>>> cumulative cost = {100.0 rows, 101.0 cpu, 0.0 io}, id = 0
> > > >>>>
> > > >>>>
> > > >>>> But I didn't understand why I/O cost is always zero? And why row
> > > count is
> > > >>>> not an integer? Is row count is not actually a count? What is an
> I/O
> > > >>> cost?
> > > >>>> Is it count of I/O read for required input or time taken for
> reading
> > > >>>> required input? because i/o is also not an integer above.
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> Thanks and Regards
> > > >>>>
> > > >>>> Lekshmi B.G
> > > >>>> Email: [email protected]
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> On Thu, Feb 7, 2019 at 1:37 AM Michael Mior <[email protected]>
> > wrote:
> > > >>>>
> > > >>>>> You're correct that it's not a single value. That said, in
> > practice,
> > > >>>>> not much is actually done with. CPU or I/O cost. You'll see in
> > > >>>>> TableScan that computeSelfCost returns a cost which really only
> > makes
> > > >>>>> use of the number of rows in the table. Individual adapters may
> > > >>>>> override this, but few do.
> > > >>>>> --
> > > >>>>> Michael Mior
> > > >>>>> [email protected]
> > > >>>>>
> > > >>>>> Le mer. 6 févr. 2019 à 05:22, Lekshmi <[email protected]> a
> > > écrit
> > > >>> :
> > > >>>>>>
> > > >>>>>> Hi,
> > > >>>>>> In [0], they suggest, "the optimizer implementer can choose the
> > > >>> cost to
> > > >>>>> be
> > > >>>>>> a number or record". Which does one Apache Calcite use? I found,
> > it
> > > >>> as a
> > > >>>>>> record, when I run queries in debug mode.  Is that correct? Then
> > can
> > > >>> you
> > > >>>>>> please define (rows, CPU, I/O) in Calcite? Also, when, we are
> > > >>> connecting
> > > >>>>>> calcite across multiple systems, then 'i/o' should be greater
> than
> > > >>> '0.0',
> > > >>>>>> right? Because it needs to read the results from the push-down
> > > >>>>> operations?
> > > >>>>>> Can you please explain these in a bit more detail? I appreciate
> > your
> > > >>>>>> support and suggestions.
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> Thanks and Regards
> > > >>>>>>
> > > >>>>>> Lekshmi B.G
> > > >>>>>> Email: [email protected]
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> On Tue, Feb 5, 2019 at 11:30 PM Michael Mior <[email protected]>
> > > >>> wrote:
> > > >>>>>>
> > > >>>>>>> Calcite's optimizer is based off the Volcano optimizer[0]. In
> > that
> > > >>>>>>> paper you'll find an outline of the algorithm which is
> basically
> > > >>>>>>> equivalent to what Calcite uses. Adding multiple systems
> doesn't
> > > >>>>>>> complicate things very much. The main addition used by Calcite
> is
> > > >>> what
> > > >>>>>>> we call a "convention" trait that allows the optimizer to deal
> > with
> > > >>>>>>> expressions across multiple systems. More details are available
> > in
> > > >>> a
> > > >>>>>>> recently published paper on Calcite [1].
> > > >>>>>>>
> > > >>>>>>> One important caveat to note is that the cost model used is not
> > > >>> likely
> > > >>>>>>> to reflect the actual cost of query execution in many cases.
> It's
> > > >>>>>>> generally "good enough" in that the ordering of plans by cost
> > will
> > > >>> be
> > > >>>>>>> approximately correct. So although the optimal plan will be
> > > >>> selected
> > > >>>>>>> according to the cost model, the plan which is actually the
> best
> > in
> > > >>>>>>> practice may not be selected. That said, I would expect Calcite
> > > >>> will
> > > >>>>>>> pick a plan which is generally quite close to the optimal,  but
> > we
> > > >>>>>>> have no guarantee of this.
> > > >>>>>>>
> > > >>>>>>> [0]
> > > >>>>>>>
> > > >>>>>
> > > >>>
> > >
> >
> https://pdfs.semanticscholar.org/a817/a3e74d1663d9eb35b4baf3161ab16f57df85.pdf
> > > >>>>>>> [1] https://arxiv.org/pdf/1802.10233.pdf
> > > >>>>>>>
> > > >>>>>>> --
> > > >>>>>>> Michael Mior
> > > >>>>>>> [email protected]
> > > >>>>>>>
> > > >>>>>>> Le mar. 5 févr. 2019 à 15:52, Lekshmi <[email protected]>
> a
> > > >>> écrit
> > > >>>>> :
> > > >>>>>>>>
> > > >>>>>>>> Hi,
> > > >>>>>>>>   I would like to know about the Calcite CBO in detail,
> > > >>> including
> > > >>>>> how it
> > > >>>>>>>> deals with global optimization when multiple processing
> systems
> > > >>> are
> > > >>>>>>>> associated with it. Any documentation, pointers are much
> > > >>> appreciated.
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> Thanks and Regards
> > > >>>>>>>>
> > > >>>>>>>> Lekshmi B.G
> > > >>>>>>>> Email: [email protected]
> > > >>>>>>>
> > > >>>>>
> > > >>>
> > >
> > >
> >
>

Re: Calcite's CBO

Reply via email to