Hi Stamatis, Thanks a lot for the quick response and support. I think, now I can give an overview of the cost model used in calcite to my other group members. ,
Thanks and Regards Lekshmi B.G Email: lekshmib...@gmail.com On Tue, Feb 12, 2019 at 1:58 PM Stamatis Zampetakis <zabe...@gmail.com> wrote: > Hi Lekshmi, > > There is really no need to apologise. Hope you and your kid are better now. > > I don't think there is right or wrong regarding the cost model since it > depends mostly on the use-case. > If you have a look at the RelOptCost interface you will notice that it is > an implementation detail when > a cost is less than (Lt) another cost. If you implement an optimizer > yourself it is up to you to pass > the appropriate cost factory in the constructor of the Volcano planner. If > you are interested about the > current state and what happens by default in the VolcanoPlanner have a look > in the VolcanoCost class. > You will observe that the default implementation of cost for Volcano is > based only on the number > of rows (io and cpu are ignored) so the algorithms in [1] are not used. > > Best, > Stamatis > > Στις Τρί, 12 Φεβ 2019 στις 10:52 π.μ., ο/η Lekshmi <lekshmib...@gmail.com> > έγραψε: > > > Hi, > > Thanks a lot all for your support. I'm really sorry for the late > > response because my kid and myself were affected with flu since the last > > few days. > > I have one more doubt, > > > records are a partial order: (1, 3) is less than (2, 4), but (1, 3) is > > neither less than nor greater than (2, 2). > > If this is the case, then how should apache calcite find the best plan? > Is > > that using any of the specified algorithms in that referenced paper? ([1] > > http://www.vldb.org/pvldb/vol8/p221-trummer.pdf) > > or > > if we have 2 records like (a,b,c) and (x,y,z) and then comparison happens > > between them like, > > if (a<x), then go with the plan having (a,b,c) > > else if (a=x, and b<y), then also go with the plan having cost (a,b,c) > > and else if (a=x and b=y and c>z) then go with the plan having (x,y,z) > ... > > comparing the record elements one by one.. > > Is this correct? to compare the cost records in Calcite? > > > > > > Thanks and Regards > > > > Lekshmi B.G > > Email: lekshmib...@gmail.com > > > > > > > > > > On Sat, Feb 9, 2019 at 12:51 AM Julian Hyde <jh...@apache.org> wrote: > > > > > In retrospect, I think it was a mistake to make cost a record. Of > course > > > we would like to represent different aspects of cost (cpu, io, number > of > > > bytes moved, amount of memory used) but the optimization algorithm has > > > trouble if the cost function does not have a total order (i.e. c1 <= c3 > > if > > > and only if c1 <= c2 and c2 <= c3), and records are a partial order: > (1, > > 3) > > > is less than (2, 4), but (1, 3) is neither less than nor greater than > (2, > > > 2). > > > > > > Multi-objective parametric query optimization [1] is an approach that > > > works with costs that are partial orders. > > > > > > Julian > > > > > > [1] http://www.vldb.org/pvldb/vol8/p221-trummer.pdf > > > > > > > On Feb 8, 2019, at 3:34 PM, Michael Mior <mm...@apache.org> wrote: > > > > > > > > Cost is always a record in Calcite, so yes. Costs from different > > > > conventions are compared directly. It's assumed that all the costs > are > > > > expressed in a way they can be correctly compared. Again, this is not > > > > ideal, but it's what we have now and it works relatively well. The > > > > planner finds the cheapest plan using essentially the same algorithm > > > > as used in the Volcano planner which I referenced earlier. You can > see > > > > the code in VolcanoPlanner#findBestExp. > > > > -- > > > > Michael Mior > > > > mm...@apache.org > > > > > > > > Le ven. 8 févr. 2019 à 14:34, Lekshmi <lekshmib...@gmail.com> a > écrit > > : > > > >> > > > >> Hi, > > > >> I understand... If we are overriding the calcite's cost function in > > > >> adapter, then should we also generate cost as record? And how does > > > calcite > > > >> compare the costs of different nodes in different convention? And > how > > > >> calcite generate the cheapest plan? Is there any algorithm dedicated > > for > > > >> that? I would like to know more about that. Because we would like to > > > >> develop a system in which Calcite plays the major role of > optimization > > > and > > > >> execution. So for that we would be developing a cost function and > > costs > > > >> estimated for this should be in the comparable format of calcite. > > > >> > > > >> Thank you so much for the support and suggestions.... > > > >> > > > >> On Thu, 7 Feb 2019, 4:07 pm Michael Mior <mm...@apache.org wrote: > > > >> > > > >>> I/O cost is always zero because Calcite itself doesn't try to > > estimate > > > >>> I/O cost. As I mentioned. my previous reply, this can be overridden > > by > > > >>> adapters. Row count is an estimate which starts off as the > estimated > > > >>> number of rows in the table but then is impacted by the estimated > > > >>> filter factors of queries. So if the estimated number of rows is > 100 > > > >>> and filters are applied that remove an estimated 95% and 30% of > rows, > > > >>> then the estimated rows produced by the query will be 3.5. > > > >>> > > > >>> -- > > > >>> Michael Mior > > > >>> mm...@apache.org > > > >>> > > > >>> Le jeu. 7 févr. 2019 à 04:55, Lekshmi <lekshmib...@gmail.com> a > > écrit > > > : > > > >>>> > > > >>>> Dear Michael Mior, > > > >>>> Thank you so much for the reply. But still, I'm confused with > > > >>> following > > > >>>> > > > >>>> When I connected PostgreSQL with Calcite and execute TPCH queries, > > > >>> provided > > > >>>> data residing in the Postgres database, I got the cheapest plan > like > > > >>> below > > > >>>> for TPCH query 1. > > > >>>> > > > >>>> EnumerableProject(l_returnflag=[$0], l_linestatus=[$1], > > > >>> SUM_QTY=[CASE(=($3, > > > >>>> 0), null, $2)], SUM_BASE_PRICE=[CASE(=($5, 0), null, $4)], > > > >>>> SUM_DISC_PRICE=[CASE(=($7, 0), null, $6)], SUM_CHARGE=[CASE(=($9, > > 0), > > > >>> null, > > > >>>> $8)], AVG_QTY=[/(CASE(=($3, 0), null, $2), $3)], > > > AVG_PRICE=[/(CASE(=($5, > > > >>>> 0), null, $4), $5)], AVG_DISC=[/(CASE(=($11, 0), null, $10), > $11)], > > > >>>> COUNT_ORDER=[$12]): rowcount = 1.0, cumulative cost = {216.75 > rows, > > > >>>> 1870.344248356904 cpu, 0.0 io}, id = 316 > > > >>>> > > > >>>> EnumerableLimit(fetch=[1]): rowcount = 1.0, cumulative cost = > > {215.75 > > > >>>> rows, 1860.344248356904 cpu, 0.0 io}, id = 315 > > > >>>> > > > >>>> EnumerableSort(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC]): > > > >>>> rowcount = 10.0, cumulative cost = {214.75 rows, 1859.344248356904 > > > cpu, > > > >>> 0.0 > > > >>>> io}, id = 314 > > > >>>> > > > >>>> JdbcToEnumerableConverter: rowcount = 10.0, cumulative cost = > > > >>> {204.75 > > > >>>> rows, 662.0 cpu, 0.0 io}, id = 313 > > > >>>> > > > >>>> JdbcAggregate(group=[{0, 1}], SUM_QTY=[$SUM0($2)], > > > >>>> agg#1=[COUNT($2)], SUM_BASE_PRICE=[$SUM0($3)], agg#3=[COUNT($3)], > > > >>>> SUM_DISC_PRICE=[$SUM0($4)], agg#5=[COUNT($4)], > > SUM_CHARGE=[$SUM0($5)], > > > >>>> agg#7=[COUNT($5)], agg#8=[$SUM0($6)], agg#9=[COUNT($6)], > > > >>>> COUNT_ORDER=[COUNT()]): rowcount = 10.0, cumulative cost = {203.75 > > > rows, > > > >>>> 661.0 cpu, 0.0 io}, id = 312 > > > >>>> > > > >>>> JdbcProject(l_returnflag=[$8], l_linestatus=[$9], > > > >>>> l_quantity=[$4], l_extendedprice=[$5], $f4=[*($5, -(1, $6))], > > > >>> $f5=[*(*($5, > > > >>>> -(1, $6)), +(1, $7))], l_discount=[$6]): rowcount = 100.0, > > cumulative > > > >>> cost > > > >>>> = {180.0 rows, 661.0 cpu, 0.0 io}, id = 311 > > > >>>> > > > >>>> JdbcTableScan(table=[[TPCH, lineitem]]): rowcount = > > 100.0, > > > >>>> cumulative cost = {100.0 rows, 101.0 cpu, 0.0 io}, id = 0 > > > >>>> > > > >>>> > > > >>>> But I didn't understand why I/O cost is always zero? And why row > > > count is > > > >>>> not an integer? Is row count is not actually a count? What is an > I/O > > > >>> cost? > > > >>>> Is it count of I/O read for required input or time taken for > reading > > > >>>> required input? because i/o is also not an integer above. > > > >>>> > > > >>>> > > > >>>> > > > >>>> > > > >>>> > > > >>>> > > > >>>> > > > >>>> Thanks and Regards > > > >>>> > > > >>>> Lekshmi B.G > > > >>>> Email: lekshmib...@gmail.com > > > >>>> > > > >>>> > > > >>>> > > > >>>> > > > >>>> On Thu, Feb 7, 2019 at 1:37 AM Michael Mior <mm...@apache.org> > > wrote: > > > >>>> > > > >>>>> You're correct that it's not a single value. That said, in > > practice, > > > >>>>> not much is actually done with. CPU or I/O cost. You'll see in > > > >>>>> TableScan that computeSelfCost returns a cost which really only > > makes > > > >>>>> use of the number of rows in the table. Individual adapters may > > > >>>>> override this, but few do. > > > >>>>> -- > > > >>>>> Michael Mior > > > >>>>> mm...@apache.org > > > >>>>> > > > >>>>> Le mer. 6 févr. 2019 à 05:22, Lekshmi <lekshmib...@gmail.com> a > > > écrit > > > >>> : > > > >>>>>> > > > >>>>>> Hi, > > > >>>>>> In [0], they suggest, "the optimizer implementer can choose the > > > >>> cost to > > > >>>>> be > > > >>>>>> a number or record". Which does one Apache Calcite use? I found, > > it > > > >>> as a > > > >>>>>> record, when I run queries in debug mode. Is that correct? Then > > can > > > >>> you > > > >>>>>> please define (rows, CPU, I/O) in Calcite? Also, when, we are > > > >>> connecting > > > >>>>>> calcite across multiple systems, then 'i/o' should be greater > than > > > >>> '0.0', > > > >>>>>> right? Because it needs to read the results from the push-down > > > >>>>> operations? > > > >>>>>> Can you please explain these in a bit more detail? I appreciate > > your > > > >>>>>> support and suggestions. > > > >>>>>> > > > >>>>>> > > > >>>>>> Thanks and Regards > > > >>>>>> > > > >>>>>> Lekshmi B.G > > > >>>>>> Email: lekshmib...@gmail.com > > > >>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> On Tue, Feb 5, 2019 at 11:30 PM Michael Mior <mm...@apache.org> > > > >>> wrote: > > > >>>>>> > > > >>>>>>> Calcite's optimizer is based off the Volcano optimizer[0]. In > > that > > > >>>>>>> paper you'll find an outline of the algorithm which is > basically > > > >>>>>>> equivalent to what Calcite uses. Adding multiple systems > doesn't > > > >>>>>>> complicate things very much. The main addition used by Calcite > is > > > >>> what > > > >>>>>>> we call a "convention" trait that allows the optimizer to deal > > with > > > >>>>>>> expressions across multiple systems. More details are available > > in > > > >>> a > > > >>>>>>> recently published paper on Calcite [1]. > > > >>>>>>> > > > >>>>>>> One important caveat to note is that the cost model used is not > > > >>> likely > > > >>>>>>> to reflect the actual cost of query execution in many cases. > It's > > > >>>>>>> generally "good enough" in that the ordering of plans by cost > > will > > > >>> be > > > >>>>>>> approximately correct. So although the optimal plan will be > > > >>> selected > > > >>>>>>> according to the cost model, the plan which is actually the > best > > in > > > >>>>>>> practice may not be selected. That said, I would expect Calcite > > > >>> will > > > >>>>>>> pick a plan which is generally quite close to the optimal, but > > we > > > >>>>>>> have no guarantee of this. > > > >>>>>>> > > > >>>>>>> [0] > > > >>>>>>> > > > >>>>> > > > >>> > > > > > > https://pdfs.semanticscholar.org/a817/a3e74d1663d9eb35b4baf3161ab16f57df85.pdf > > > >>>>>>> [1] https://arxiv.org/pdf/1802.10233.pdf > > > >>>>>>> > > > >>>>>>> -- > > > >>>>>>> Michael Mior > > > >>>>>>> mm...@apache.org > > > >>>>>>> > > > >>>>>>> Le mar. 5 févr. 2019 à 15:52, Lekshmi <lekshmib...@gmail.com> > a > > > >>> écrit > > > >>>>> : > > > >>>>>>>> > > > >>>>>>>> Hi, > > > >>>>>>>> I would like to know about the Calcite CBO in detail, > > > >>> including > > > >>>>> how it > > > >>>>>>>> deals with global optimization when multiple processing > systems > > > >>> are > > > >>>>>>>> associated with it. Any documentation, pointers are much > > > >>> appreciated. > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> Thanks and Regards > > > >>>>>>>> > > > >>>>>>>> Lekshmi B.G > > > >>>>>>>> Email: lekshmib...@gmail.com > > > >>>>>>> > > > >>>>> > > > >>> > > > > > > > > >