Re: Volcano Planner for TPCDS query

Raajay Thu, 08 Oct 2015 10:15:19 -0700

I was finally able to get Volcano Planner working with Hive by adding other
rules defined in "ql/optimizer/calcite/rules/" in the hive repo :)



Thanks John for the pointers; it would be good to learn about traits, so
that we can reducer the number of active rules to improve Volcano Planner's
efficiency.

Thanks
Raajay




On Thu, Oct 8, 2015 at 11:12 AM, Raajay <[email protected]> wrote:

> Thanks for the suggestions.
>
> 1. The current version of Calcite does not have the SwapJoinRule defined.
> I guess the equivalent now is the JoinCommuteRule. Having that is not
> sufficient to get the alternate join order.
>
> 2. What are some good pointers to learning more about what traits mean and
> how they are used in query planning? Thanks.
>
> - Raajay
>
>
> On Wed, Oct 7, 2015 at 9:43 PM, John Pullokkaran <
> [email protected]> wrote:
>
>> #1 You need SwapJoinRule.
>> Actually if you go to Hive branch 13 or Hive CBO branch you should get to
>> the code that used Volcano.
>>
>> #2 I belive it is possible; since the trait could be different propagated
>> from below.
>>
>> John
>>
>> On 10/7/15, 3:53 PM, "Raajay" <[email protected]> wrote:
>>
>> >Hello,
>> >
>> >I am trying to optimize  a TPCDS query (#3) in Hive using the Volcano
>> >planner. I have included snippets of the query and the pre-Volcano
>> >optimization query plan below. HiveSort, HiveTableScan, etc  are
>> basically
>> >extensions of Sort, TableScan Relational operators defined in calcite.
>> >Hive by default uses the HepPlanner, where as I wish to use the Volcano
>> >planner.
>> >
>> >For this query in particular, I clear all the default rules from the
>> >Volcano Planner and just include the following two rules:
>> >
>> >JoinPushThroughJoinRule:right and JoinPushThroughJoinRule:left
>> >
>> >
>> >While executing the optimization I am able to observe that the "left"
>> rule
>> >kicks in and an alternate join order in generated. I can also see that
>> the
>> >cumulative cost of the new join order is less than the original join
>> >order.  Please find a snippet of the recursive display of the new join
>> >order below.
>> >
>> >However, findBestExp does not return a plan with the modified join order
>> >:(
>> >
>> >
>> >1. Are these two rules sufficient ? If not, why ? Also, what other rules
>> >required for this particular query.
>> >
>> >
>> >2. Is it possible that a new sub-tree created upon a rule match on a root
>> >node, to be not put in the same RelSubSet as the root node. If yes, will
>> >the new generated plan be considered while building the cheapest plan. I
>> >ask this question specifically because, I found that the new operators
>> >(HiveProject, id=194 below) that were generated were not put in the same
>> >RelSubSet but were in the same RelSet.
>> >
>> >
>> >Thanks a lot for your patience in reading this long mail :) Hoping, that
>> I
>> >get some info to get Volcano Planner going for hive.
>> >
>> >Thanks
>> >Raajay
>> >
>> >
>> >
>> >* The query looks like this:*
>> >
>> >select  dt.d_year
>> >       ,item.i_brand_id brand_id
>> >       ,item.i_brand brand
>> >       ,sum(ss_ext_sales_price) sum_agg
>> > from  date_dim dt
>> >      ,store_sales
>> >      ,item
>> > where dt.d_date_sk = store_sales.ss_sold_date_sk
>> >   and store_sales.ss_item_sk = item.i_item_sk
>> >   and item.i_manufact_id = 436
>> >   and dt.d_moy=12
>> > group by dt.d_year
>> >      ,item.i_brand
>> >      ,item.i_brand_id
>> > order by dt.d_year
>> >         ,sum_agg desc
>> >         ,brand_id
>> > limit 100;
>> >
>> >
>> >*The query plan before passing to Volcano planner is looks like this:*
>> >
>> >HiveSort(fetch=[100]): rowcount = 354.9838716449557, cumulative cost =
>> >{3133795.037494761 rows, 0.0 cpu, 0.0 io}, id = 141
>> >  HiveSort(sort0=[$0], sort1=[$3], sort2=[$1], dir0=[ASC], dir1=[DESC],
>> >dir2=[ASC]): rowcount = 354.9838716449557, cumulative cost =
>> >{3133795.037494761 rows, 0.0 cpu, 0.0 io}, id = 139
>> >    HiveProject(d_year=[$0], brand_id=[$2], brand=[$1], sum_agg=[$3]):
>> >rowcount = 354.9838716449557, cumulative cost = {3133795.037494761 rows,
>> >0.0 cpu, 0.0 io}, id = 137
>> >      HiveAggregate(group=[{0, 1, 2}], agg#0=[sum($3)]): rowcount =
>> >354.9838716449557, cumulative cost = {3133795.037494761 rows, 0.0 cpu,
>> 0.0
>> >io}, id = 135
>> >        HiveProject($f0=[$1], $f1=[$8], $f2=[$7], $f3=[$5]): rowcount =
>> >358.53076132315454, cumulative cost = {3133795.037494761 rows, 0.0 cpu,
>> >0.0
>> >io}, id = 133
>> >          HiveJoin(condition=[=($4, $6)], joinType=[inner],
>> >algorithm=[none], cost=[{247770.8067255299 rows, 0.0 cpu, 0.0 io}]):
>> >rowcount = 358.53076132315454, cumulative cost = {3133795.037494761 rows,
>> >0.0 cpu, 0.0 io}, id = 131
>> >            HiveJoin(condition=[=($0, $3)], joinType=[inner],
>> >algorithm=[none], cost=[{2886024.230769231 rows, 0.0 cpu, 0.0 io}]):
>> >rowcount = 247744.7560742998, cumulative cost = {2886024.230769231 rows,
>> >0.0 cpu, 0.0 io}, id = 124
>> >              HiveProject(d_date_sk=[$0], d_year=[$6], d_moy=[$8]):
>> >rowcount = 5619.2307692307695, cumulative cost = {0.0 rows, 0.0 cpu, 0.0
>> >io}, id = 151
>> >                HiveFilter(condition=[=($8, 12)]): rowcount =
>> >5619.2307692307695, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id =
>> >148
>> >                  HiveTableScan(table=[[tpcds_small.date_dim]]): rowcount
>> >=
>> >73050.0, cumulative cost = {0}, id = 101
>> >              HiveProject(ss_sold_date_sk=[$0], ss_item_sk=[$2],
>> >ss_ext_sales_price=[$15]): rowcount = 2880405.0, cumulative cost = {0.0
>> >rows, 0.0 cpu, 0.0 io}, id = 122
>> >                HiveTableScan(table=[[tpcds_small.store_sales]]):
>> rowcount
>> >= 2880405.0, cumulative cost = {0}, id = 104
>> >            HiveProject(i_item_sk=[$0], i_brand_id=[$7], i_brand=[$8],
>> >i_manufact_id=[$13]): rowcount = 26.050651230101302, cumulative cost =
>> >{0.0
>> >rows, 0.0 cpu, 0.0 io}, id = 146
>> >              HiveFilter(condition=[=($13, 436)]): rowcount =
>> >26.050651230101302, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id =
>> >143
>> >                HiveTableScan(table=[[tpcds_small.item]]): rowcount =
>> >18001.0, cumulative cost = {0}, id = 107
>> >
>> >
>> >*The new join order looks like this:*
>> >
>> >HiveProject(d_date_sk=[$7], d_year=[$8], d_moy=[$9],
>> ss_sold_date_sk=[$4],
>> >ss_item_sk=[$5], ss_ext_sales_price=[$6], i_item_sk=[$0],
>> i_brand_id=[$1],
>> >i_brand=[$2], i_manufact_id=[$3]): rowcount = 197.5727739722679,
>> >cumulative
>> >cost = {2888347.3617503582 rows, 0.0 cpu, 0.0 io}, id = 194
>> >  HiveJoin(condition=[=($7, $4)], joinType=[inner], algorithm=[none],
>> >cost=[{7916.31109912852 rows, 0.0 cpu, 0.0 io}]): rowcount =
>> >197.5727739722679, cumulative cost = {2888347.3617503582 rows, 0.0 cpu,
>> >0.0
>> >io}, id = 193
>> >    HiveJoin(condition=[=($5, $0)], joinType=[inner], algorithm=[none],
>> >cost=[{2880431.05065123 rows, 0.0 cpu, 0.0 io}]): rowcount =
>> >2297.0803298977507, cumulative cost = {2880431.05065123 rows, 0.0 cpu,
>> 0.0
>> >io}, id = 192
>> >      HiveProject(i_item_sk=[$0], i_brand_id=[$7], i_brand=[$8],
>> >i_manufact_id=[$13]): rowcount = 26.050651230101302, cumulative cost =
>> >{0.0
>> >rows, 0.0 cpu, 0.0 io}, id = 166
>> >        HiveFilter(subset=[rel#165:Subset#7.HIVE.[]], condition=[=($13,
>> >436)]): rowcount = 26.050651230101302, cumulative cost = {0.0 rows, 0.0
>> >cpu, 0.0 io}, id = 164
>> >          HiveTableScan(subset=[rel#163:Subset#6.HIVE.[]],
>> >table=[[tpcds_small.item]]): rowcount = 18001.0, cumulative cost = {0},
>> id
>> >= 107
>> >      HiveProject(subset=[rel#160:Subset#4.HIVE.[]],
>> ss_sold_date_sk=[$0],
>> >ss_item_sk=[$2], ss_ext_sales_price=[$15]): rowcount = 2880405.0,
>> >cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 159
>> >        HiveTableScan(subset=[rel#158:Subset#3.HIVE.[]],
>> >table=[[tpcds_small.store_sales]]): rowcount = 2880405.0, cumulative cost
>> >=
>> >{0}, id = 104
>> >    HiveProject(subset=[rel#157:Subset#2.HIVE.[]], d_date_sk=[$0],
>> >d_year=[$6], d_moy=[$8]): rowcount = 5619.2307692307695, cumulative cost
>> =
>> >{0.0 rows, 0.0 cpu, 0.0 io}, id = 156
>> >      HiveFilter(subset=[rel#155:Subset#1.HIVE.[]], condition=[=($8,
>> >12)]):
>> >rowcount = 5619.2307692307695, cumulative cost = {0.0 rows, 0.0 cpu, 0.0
>> >io}, id = 154
>> >        HiveTableScan(subset=[rel#153:Subset#0.HIVE.[]],
>> >table=[[tpcds_small.date_dim]]): rowcount = 73050.0, cumulative cost =
>> >{0},
>> >id = 101
>>
>>
>

Re: Volcano Planner for TPCDS query

Reply via email to