#1 You need SwapJoinRule. Actually if you go to Hive branch 13 or Hive CBO branch you should get to the code that used Volcano.
#2 I belive it is possible; since the trait could be different propagated from below. John On 10/7/15, 3:53 PM, "Raajay" <[email protected]> wrote: >Hello, > >I am trying to optimize a TPCDS query (#3) in Hive using the Volcano >planner. I have included snippets of the query and the pre-Volcano >optimization query plan below. HiveSort, HiveTableScan, etc are basically >extensions of Sort, TableScan Relational operators defined in calcite. >Hive by default uses the HepPlanner, where as I wish to use the Volcano >planner. > >For this query in particular, I clear all the default rules from the >Volcano Planner and just include the following two rules: > >JoinPushThroughJoinRule:right and JoinPushThroughJoinRule:left > > >While executing the optimization I am able to observe that the "left" rule >kicks in and an alternate join order in generated. I can also see that the >cumulative cost of the new join order is less than the original join >order. Please find a snippet of the recursive display of the new join >order below. > >However, findBestExp does not return a plan with the modified join order >:( > > >1. Are these two rules sufficient ? If not, why ? Also, what other rules >required for this particular query. > > >2. Is it possible that a new sub-tree created upon a rule match on a root >node, to be not put in the same RelSubSet as the root node. If yes, will >the new generated plan be considered while building the cheapest plan. I >ask this question specifically because, I found that the new operators >(HiveProject, id=194 below) that were generated were not put in the same >RelSubSet but were in the same RelSet. > > >Thanks a lot for your patience in reading this long mail :) Hoping, that I >get some info to get Volcano Planner going for hive. > >Thanks >Raajay > > > >* The query looks like this:* > >select dt.d_year > ,item.i_brand_id brand_id > ,item.i_brand brand > ,sum(ss_ext_sales_price) sum_agg > from date_dim dt > ,store_sales > ,item > where dt.d_date_sk = store_sales.ss_sold_date_sk > and store_sales.ss_item_sk = item.i_item_sk > and item.i_manufact_id = 436 > and dt.d_moy=12 > group by dt.d_year > ,item.i_brand > ,item.i_brand_id > order by dt.d_year > ,sum_agg desc > ,brand_id > limit 100; > > >*The query plan before passing to Volcano planner is looks like this:* > >HiveSort(fetch=[100]): rowcount = 354.9838716449557, cumulative cost = >{3133795.037494761 rows, 0.0 cpu, 0.0 io}, id = 141 > HiveSort(sort0=[$0], sort1=[$3], sort2=[$1], dir0=[ASC], dir1=[DESC], >dir2=[ASC]): rowcount = 354.9838716449557, cumulative cost = >{3133795.037494761 rows, 0.0 cpu, 0.0 io}, id = 139 > HiveProject(d_year=[$0], brand_id=[$2], brand=[$1], sum_agg=[$3]): >rowcount = 354.9838716449557, cumulative cost = {3133795.037494761 rows, >0.0 cpu, 0.0 io}, id = 137 > HiveAggregate(group=[{0, 1, 2}], agg#0=[sum($3)]): rowcount = >354.9838716449557, cumulative cost = {3133795.037494761 rows, 0.0 cpu, 0.0 >io}, id = 135 > HiveProject($f0=[$1], $f1=[$8], $f2=[$7], $f3=[$5]): rowcount = >358.53076132315454, cumulative cost = {3133795.037494761 rows, 0.0 cpu, >0.0 >io}, id = 133 > HiveJoin(condition=[=($4, $6)], joinType=[inner], >algorithm=[none], cost=[{247770.8067255299 rows, 0.0 cpu, 0.0 io}]): >rowcount = 358.53076132315454, cumulative cost = {3133795.037494761 rows, >0.0 cpu, 0.0 io}, id = 131 > HiveJoin(condition=[=($0, $3)], joinType=[inner], >algorithm=[none], cost=[{2886024.230769231 rows, 0.0 cpu, 0.0 io}]): >rowcount = 247744.7560742998, cumulative cost = {2886024.230769231 rows, >0.0 cpu, 0.0 io}, id = 124 > HiveProject(d_date_sk=[$0], d_year=[$6], d_moy=[$8]): >rowcount = 5619.2307692307695, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 >io}, id = 151 > HiveFilter(condition=[=($8, 12)]): rowcount = >5619.2307692307695, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = >148 > HiveTableScan(table=[[tpcds_small.date_dim]]): rowcount >= >73050.0, cumulative cost = {0}, id = 101 > HiveProject(ss_sold_date_sk=[$0], ss_item_sk=[$2], >ss_ext_sales_price=[$15]): rowcount = 2880405.0, cumulative cost = {0.0 >rows, 0.0 cpu, 0.0 io}, id = 122 > HiveTableScan(table=[[tpcds_small.store_sales]]): rowcount >= 2880405.0, cumulative cost = {0}, id = 104 > HiveProject(i_item_sk=[$0], i_brand_id=[$7], i_brand=[$8], >i_manufact_id=[$13]): rowcount = 26.050651230101302, cumulative cost = >{0.0 >rows, 0.0 cpu, 0.0 io}, id = 146 > HiveFilter(condition=[=($13, 436)]): rowcount = >26.050651230101302, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = >143 > HiveTableScan(table=[[tpcds_small.item]]): rowcount = >18001.0, cumulative cost = {0}, id = 107 > > >*The new join order looks like this:* > >HiveProject(d_date_sk=[$7], d_year=[$8], d_moy=[$9], ss_sold_date_sk=[$4], >ss_item_sk=[$5], ss_ext_sales_price=[$6], i_item_sk=[$0], i_brand_id=[$1], >i_brand=[$2], i_manufact_id=[$3]): rowcount = 197.5727739722679, >cumulative >cost = {2888347.3617503582 rows, 0.0 cpu, 0.0 io}, id = 194 > HiveJoin(condition=[=($7, $4)], joinType=[inner], algorithm=[none], >cost=[{7916.31109912852 rows, 0.0 cpu, 0.0 io}]): rowcount = >197.5727739722679, cumulative cost = {2888347.3617503582 rows, 0.0 cpu, >0.0 >io}, id = 193 > HiveJoin(condition=[=($5, $0)], joinType=[inner], algorithm=[none], >cost=[{2880431.05065123 rows, 0.0 cpu, 0.0 io}]): rowcount = >2297.0803298977507, cumulative cost = {2880431.05065123 rows, 0.0 cpu, 0.0 >io}, id = 192 > HiveProject(i_item_sk=[$0], i_brand_id=[$7], i_brand=[$8], >i_manufact_id=[$13]): rowcount = 26.050651230101302, cumulative cost = >{0.0 >rows, 0.0 cpu, 0.0 io}, id = 166 > HiveFilter(subset=[rel#165:Subset#7.HIVE.[]], condition=[=($13, >436)]): rowcount = 26.050651230101302, cumulative cost = {0.0 rows, 0.0 >cpu, 0.0 io}, id = 164 > HiveTableScan(subset=[rel#163:Subset#6.HIVE.[]], >table=[[tpcds_small.item]]): rowcount = 18001.0, cumulative cost = {0}, id >= 107 > HiveProject(subset=[rel#160:Subset#4.HIVE.[]], ss_sold_date_sk=[$0], >ss_item_sk=[$2], ss_ext_sales_price=[$15]): rowcount = 2880405.0, >cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 159 > HiveTableScan(subset=[rel#158:Subset#3.HIVE.[]], >table=[[tpcds_small.store_sales]]): rowcount = 2880405.0, cumulative cost >= >{0}, id = 104 > HiveProject(subset=[rel#157:Subset#2.HIVE.[]], d_date_sk=[$0], >d_year=[$6], d_moy=[$8]): rowcount = 5619.2307692307695, cumulative cost = >{0.0 rows, 0.0 cpu, 0.0 io}, id = 156 > HiveFilter(subset=[rel#155:Subset#1.HIVE.[]], condition=[=($8, >12)]): >rowcount = 5619.2307692307695, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 >io}, id = 154 > HiveTableScan(subset=[rel#153:Subset#0.HIVE.[]], >table=[[tpcds_small.date_dim]]): rowcount = 73050.0, cumulative cost = >{0}, >id = 101
