Re: Volcano Planner for TPCDS query

John Pullokkaran Thu, 08 Oct 2015 10:27:13 -0700

Raajay,

   Lets set up a call.
I would like to better understand what you are trying to achieve.
It would be a good to revive Hive¹s use of Volcano.


I am out of town till next Monday.

Thanks
John

On 10/8/15, 10:14 AM, "Raajay" <[email protected]> wrote:

>I was finally able to get Volcano Planner working with Hive by adding
>other
>rules defined in "ql/optimizer/calcite/rules/" in the hive repo :)
>
>
>Thanks John for the pointers; it would be good to learn about traits, so
>that we can reducer the number of active rules to improve Volcano
>Planner's
>efficiency.
>
>Thanks
>Raajay
>
>
>
>
>On Thu, Oct 8, 2015 at 11:12 AM, Raajay <[email protected]> wrote:
>
>> Thanks for the suggestions.
>>
>> 1. The current version of Calcite does not have the SwapJoinRule
>>defined.
>> I guess the equivalent now is the JoinCommuteRule. Having that is not
>> sufficient to get the alternate join order.
>>
>> 2. What are some good pointers to learning more about what traits mean
>>and
>> how they are used in query planning? Thanks.
>>
>> - Raajay
>>
>>
>> On Wed, Oct 7, 2015 at 9:43 PM, John Pullokkaran <
>> [email protected]> wrote:
>>
>>> #1 You need SwapJoinRule.
>>> Actually if you go to Hive branch 13 or Hive CBO branch you should get
>>>to
>>> the code that used Volcano.
>>>
>>> #2 I belive it is possible; since the trait could be different
>>>propagated
>>> from below.
>>>
>>> John
>>>
>>> On 10/7/15, 3:53 PM, "Raajay" <[email protected]> wrote:
>>>
>>> >Hello,
>>> >
>>> >I am trying to optimize  a TPCDS query (#3) in Hive using the Volcano
>>> >planner. I have included snippets of the query and the pre-Volcano
>>> >optimization query plan below. HiveSort, HiveTableScan, etc  are
>>> basically
>>> >extensions of Sort, TableScan Relational operators defined in calcite.
>>> >Hive by default uses the HepPlanner, where as I wish to use the
>>>Volcano
>>> >planner.
>>> >
>>> >For this query in particular, I clear all the default rules from the
>>> >Volcano Planner and just include the following two rules:
>>> >
>>> >JoinPushThroughJoinRule:right and JoinPushThroughJoinRule:left
>>> >
>>> >
>>> >While executing the optimization I am able to observe that the "left"
>>> rule
>>> >kicks in and an alternate join order in generated. I can also see that
>>> the
>>> >cumulative cost of the new join order is less than the original join
>>> >order.  Please find a snippet of the recursive display of the new join
>>> >order below.
>>> >
>>> >However, findBestExp does not return a plan with the modified join
>>>order
>>> >:(
>>> >
>>> >
>>> >1. Are these two rules sufficient ? If not, why ? Also, what other
>>>rules
>>> >required for this particular query.
>>> >
>>> >
>>> >2. Is it possible that a new sub-tree created upon a rule match on a
>>>root
>>> >node, to be not put in the same RelSubSet as the root node. If yes,
>>>will
>>> >the new generated plan be considered while building the cheapest
>>>plan. I
>>> >ask this question specifically because, I found that the new operators
>>> >(HiveProject, id=194 below) that were generated were not put in the
>>>same
>>> >RelSubSet but were in the same RelSet.
>>> >
>>> >
>>> >Thanks a lot for your patience in reading this long mail :) Hoping,
>>>that
>>> I
>>> >get some info to get Volcano Planner going for hive.
>>> >
>>> >Thanks
>>> >Raajay
>>> >
>>> >
>>> >
>>> >* The query looks like this:*
>>> >
>>> >select  dt.d_year
>>> >       ,item.i_brand_id brand_id
>>> >       ,item.i_brand brand
>>> >       ,sum(ss_ext_sales_price) sum_agg
>>> > from  date_dim dt
>>> >      ,store_sales
>>> >      ,item
>>> > where dt.d_date_sk = store_sales.ss_sold_date_sk
>>> >   and store_sales.ss_item_sk = item.i_item_sk
>>> >   and item.i_manufact_id = 436
>>> >   and dt.d_moy=12
>>> > group by dt.d_year
>>> >      ,item.i_brand
>>> >      ,item.i_brand_id
>>> > order by dt.d_year
>>> >         ,sum_agg desc
>>> >         ,brand_id
>>> > limit 100;
>>> >
>>> >
>>> >*The query plan before passing to Volcano planner is looks like this:*
>>> >
>>> >HiveSort(fetch=[100]): rowcount = 354.9838716449557, cumulative cost =
>>> >{3133795.037494761 rows, 0.0 cpu, 0.0 io}, id = 141
>>> >  HiveSort(sort0=[$0], sort1=[$3], sort2=[$1], dir0=[ASC],
>>>dir1=[DESC],
>>> >dir2=[ASC]): rowcount = 354.9838716449557, cumulative cost =
>>> >{3133795.037494761 rows, 0.0 cpu, 0.0 io}, id = 139
>>> >    HiveProject(d_year=[$0], brand_id=[$2], brand=[$1], sum_agg=[$3]):
>>> >rowcount = 354.9838716449557, cumulative cost = {3133795.037494761
>>>rows,
>>> >0.0 cpu, 0.0 io}, id = 137
>>> >      HiveAggregate(group=[{0, 1, 2}], agg#0=[sum($3)]): rowcount =
>>> >354.9838716449557, cumulative cost = {3133795.037494761 rows, 0.0 cpu,
>>> 0.0
>>> >io}, id = 135
>>> >        HiveProject($f0=[$1], $f1=[$8], $f2=[$7], $f3=[$5]): rowcount
>>>=
>>> >358.53076132315454, cumulative cost = {3133795.037494761 rows, 0.0
>>>cpu,
>>> >0.0
>>> >io}, id = 133
>>> >          HiveJoin(condition=[=($4, $6)], joinType=[inner],
>>> >algorithm=[none], cost=[{247770.8067255299 rows, 0.0 cpu, 0.0 io}]):
>>> >rowcount = 358.53076132315454, cumulative cost = {3133795.037494761
>>>rows,
>>> >0.0 cpu, 0.0 io}, id = 131
>>> >            HiveJoin(condition=[=($0, $3)], joinType=[inner],
>>> >algorithm=[none], cost=[{2886024.230769231 rows, 0.0 cpu, 0.0 io}]):
>>> >rowcount = 247744.7560742998, cumulative cost = {2886024.230769231
>>>rows,
>>> >0.0 cpu, 0.0 io}, id = 124
>>> >              HiveProject(d_date_sk=[$0], d_year=[$6], d_moy=[$8]):
>>> >rowcount = 5619.2307692307695, cumulative cost = {0.0 rows, 0.0 cpu,
>>>0.0
>>> >io}, id = 151
>>> >                HiveFilter(condition=[=($8, 12)]): rowcount =
>>> >5619.2307692307695, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id
>>>=
>>> >148
>>> >                  HiveTableScan(table=[[tpcds_small.date_dim]]):
>>>rowcount
>>> >=
>>> >73050.0, cumulative cost = {0}, id = 101
>>> >              HiveProject(ss_sold_date_sk=[$0], ss_item_sk=[$2],
>>> >ss_ext_sales_price=[$15]): rowcount = 2880405.0, cumulative cost =
>>>{0.0
>>> >rows, 0.0 cpu, 0.0 io}, id = 122
>>> >                HiveTableScan(table=[[tpcds_small.store_sales]]):
>>> rowcount
>>> >= 2880405.0, cumulative cost = {0}, id = 104
>>> >            HiveProject(i_item_sk=[$0], i_brand_id=[$7], i_brand=[$8],
>>> >i_manufact_id=[$13]): rowcount = 26.050651230101302, cumulative cost =
>>> >{0.0
>>> >rows, 0.0 cpu, 0.0 io}, id = 146
>>> >              HiveFilter(condition=[=($13, 436)]): rowcount =
>>> >26.050651230101302, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id
>>>=
>>> >143
>>> >                HiveTableScan(table=[[tpcds_small.item]]): rowcount =
>>> >18001.0, cumulative cost = {0}, id = 107
>>> >
>>> >
>>> >*The new join order looks like this:*
>>> >
>>> >HiveProject(d_date_sk=[$7], d_year=[$8], d_moy=[$9],
>>> ss_sold_date_sk=[$4],
>>> >ss_item_sk=[$5], ss_ext_sales_price=[$6], i_item_sk=[$0],
>>> i_brand_id=[$1],
>>> >i_brand=[$2], i_manufact_id=[$3]): rowcount = 197.5727739722679,
>>> >cumulative
>>> >cost = {2888347.3617503582 rows, 0.0 cpu, 0.0 io}, id = 194
>>> >  HiveJoin(condition=[=($7, $4)], joinType=[inner], algorithm=[none],
>>> >cost=[{7916.31109912852 rows, 0.0 cpu, 0.0 io}]): rowcount =
>>> >197.5727739722679, cumulative cost = {2888347.3617503582 rows, 0.0
>>>cpu,
>>> >0.0
>>> >io}, id = 193
>>> >    HiveJoin(condition=[=($5, $0)], joinType=[inner],
>>>algorithm=[none],
>>> >cost=[{2880431.05065123 rows, 0.0 cpu, 0.0 io}]): rowcount =
>>> >2297.0803298977507, cumulative cost = {2880431.05065123 rows, 0.0 cpu,
>>> 0.0
>>> >io}, id = 192
>>> >      HiveProject(i_item_sk=[$0], i_brand_id=[$7], i_brand=[$8],
>>> >i_manufact_id=[$13]): rowcount = 26.050651230101302, cumulative cost =
>>> >{0.0
>>> >rows, 0.0 cpu, 0.0 io}, id = 166
>>> >        HiveFilter(subset=[rel#165:Subset#7.HIVE.[]],
>>>condition=[=($13,
>>> >436)]): rowcount = 26.050651230101302, cumulative cost = {0.0 rows,
>>>0.0
>>> >cpu, 0.0 io}, id = 164
>>> >          HiveTableScan(subset=[rel#163:Subset#6.HIVE.[]],
>>> >table=[[tpcds_small.item]]): rowcount = 18001.0, cumulative cost =
>>>{0},
>>> id
>>> >= 107
>>> >      HiveProject(subset=[rel#160:Subset#4.HIVE.[]],
>>> ss_sold_date_sk=[$0],
>>> >ss_item_sk=[$2], ss_ext_sales_price=[$15]): rowcount = 2880405.0,
>>> >cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 159
>>> >        HiveTableScan(subset=[rel#158:Subset#3.HIVE.[]],
>>> >table=[[tpcds_small.store_sales]]): rowcount = 2880405.0, cumulative
>>>cost
>>> >=
>>> >{0}, id = 104
>>> >    HiveProject(subset=[rel#157:Subset#2.HIVE.[]], d_date_sk=[$0],
>>> >d_year=[$6], d_moy=[$8]): rowcount = 5619.2307692307695, cumulative
>>>cost
>>> =
>>> >{0.0 rows, 0.0 cpu, 0.0 io}, id = 156
>>> >      HiveFilter(subset=[rel#155:Subset#1.HIVE.[]], condition=[=($8,
>>> >12)]):
>>> >rowcount = 5619.2307692307695, cumulative cost = {0.0 rows, 0.0 cpu,
>>>0.0
>>> >io}, id = 154
>>> >        HiveTableScan(subset=[rel#153:Subset#0.HIVE.[]],
>>> >table=[[tpcds_small.date_dim]]): rowcount = 73050.0, cumulative cost =
>>> >{0},
>>> >id = 101
>>>
>>>
>>

Re: Volcano Planner for TPCDS query

Reply via email to