Raajay, Lets set up a call. I would like to better understand what you are trying to achieve. It would be a good to revive Hive¹s use of Volcano.
I am out of town till next Monday. Thanks John On 10/8/15, 10:14 AM, "Raajay" <[email protected]> wrote: >I was finally able to get Volcano Planner working with Hive by adding >other >rules defined in "ql/optimizer/calcite/rules/" in the hive repo :) > > >Thanks John for the pointers; it would be good to learn about traits, so >that we can reducer the number of active rules to improve Volcano >Planner's >efficiency. > >Thanks >Raajay > > > > >On Thu, Oct 8, 2015 at 11:12 AM, Raajay <[email protected]> wrote: > >> Thanks for the suggestions. >> >> 1. The current version of Calcite does not have the SwapJoinRule >>defined. >> I guess the equivalent now is the JoinCommuteRule. Having that is not >> sufficient to get the alternate join order. >> >> 2. What are some good pointers to learning more about what traits mean >>and >> how they are used in query planning? Thanks. >> >> - Raajay >> >> >> On Wed, Oct 7, 2015 at 9:43 PM, John Pullokkaran < >> [email protected]> wrote: >> >>> #1 You need SwapJoinRule. >>> Actually if you go to Hive branch 13 or Hive CBO branch you should get >>>to >>> the code that used Volcano. >>> >>> #2 I belive it is possible; since the trait could be different >>>propagated >>> from below. >>> >>> John >>> >>> On 10/7/15, 3:53 PM, "Raajay" <[email protected]> wrote: >>> >>> >Hello, >>> > >>> >I am trying to optimize a TPCDS query (#3) in Hive using the Volcano >>> >planner. I have included snippets of the query and the pre-Volcano >>> >optimization query plan below. HiveSort, HiveTableScan, etc are >>> basically >>> >extensions of Sort, TableScan Relational operators defined in calcite. >>> >Hive by default uses the HepPlanner, where as I wish to use the >>>Volcano >>> >planner. >>> > >>> >For this query in particular, I clear all the default rules from the >>> >Volcano Planner and just include the following two rules: >>> > >>> >JoinPushThroughJoinRule:right and JoinPushThroughJoinRule:left >>> > >>> > >>> >While executing the optimization I am able to observe that the "left" >>> rule >>> >kicks in and an alternate join order in generated. I can also see that >>> the >>> >cumulative cost of the new join order is less than the original join >>> >order. Please find a snippet of the recursive display of the new join >>> >order below. >>> > >>> >However, findBestExp does not return a plan with the modified join >>>order >>> >:( >>> > >>> > >>> >1. Are these two rules sufficient ? If not, why ? Also, what other >>>rules >>> >required for this particular query. >>> > >>> > >>> >2. Is it possible that a new sub-tree created upon a rule match on a >>>root >>> >node, to be not put in the same RelSubSet as the root node. If yes, >>>will >>> >the new generated plan be considered while building the cheapest >>>plan. I >>> >ask this question specifically because, I found that the new operators >>> >(HiveProject, id=194 below) that were generated were not put in the >>>same >>> >RelSubSet but were in the same RelSet. >>> > >>> > >>> >Thanks a lot for your patience in reading this long mail :) Hoping, >>>that >>> I >>> >get some info to get Volcano Planner going for hive. >>> > >>> >Thanks >>> >Raajay >>> > >>> > >>> > >>> >* The query looks like this:* >>> > >>> >select dt.d_year >>> > ,item.i_brand_id brand_id >>> > ,item.i_brand brand >>> > ,sum(ss_ext_sales_price) sum_agg >>> > from date_dim dt >>> > ,store_sales >>> > ,item >>> > where dt.d_date_sk = store_sales.ss_sold_date_sk >>> > and store_sales.ss_item_sk = item.i_item_sk >>> > and item.i_manufact_id = 436 >>> > and dt.d_moy=12 >>> > group by dt.d_year >>> > ,item.i_brand >>> > ,item.i_brand_id >>> > order by dt.d_year >>> > ,sum_agg desc >>> > ,brand_id >>> > limit 100; >>> > >>> > >>> >*The query plan before passing to Volcano planner is looks like this:* >>> > >>> >HiveSort(fetch=[100]): rowcount = 354.9838716449557, cumulative cost = >>> >{3133795.037494761 rows, 0.0 cpu, 0.0 io}, id = 141 >>> > HiveSort(sort0=[$0], sort1=[$3], sort2=[$1], dir0=[ASC], >>>dir1=[DESC], >>> >dir2=[ASC]): rowcount = 354.9838716449557, cumulative cost = >>> >{3133795.037494761 rows, 0.0 cpu, 0.0 io}, id = 139 >>> > HiveProject(d_year=[$0], brand_id=[$2], brand=[$1], sum_agg=[$3]): >>> >rowcount = 354.9838716449557, cumulative cost = {3133795.037494761 >>>rows, >>> >0.0 cpu, 0.0 io}, id = 137 >>> > HiveAggregate(group=[{0, 1, 2}], agg#0=[sum($3)]): rowcount = >>> >354.9838716449557, cumulative cost = {3133795.037494761 rows, 0.0 cpu, >>> 0.0 >>> >io}, id = 135 >>> > HiveProject($f0=[$1], $f1=[$8], $f2=[$7], $f3=[$5]): rowcount >>>= >>> >358.53076132315454, cumulative cost = {3133795.037494761 rows, 0.0 >>>cpu, >>> >0.0 >>> >io}, id = 133 >>> > HiveJoin(condition=[=($4, $6)], joinType=[inner], >>> >algorithm=[none], cost=[{247770.8067255299 rows, 0.0 cpu, 0.0 io}]): >>> >rowcount = 358.53076132315454, cumulative cost = {3133795.037494761 >>>rows, >>> >0.0 cpu, 0.0 io}, id = 131 >>> > HiveJoin(condition=[=($0, $3)], joinType=[inner], >>> >algorithm=[none], cost=[{2886024.230769231 rows, 0.0 cpu, 0.0 io}]): >>> >rowcount = 247744.7560742998, cumulative cost = {2886024.230769231 >>>rows, >>> >0.0 cpu, 0.0 io}, id = 124 >>> > HiveProject(d_date_sk=[$0], d_year=[$6], d_moy=[$8]): >>> >rowcount = 5619.2307692307695, cumulative cost = {0.0 rows, 0.0 cpu, >>>0.0 >>> >io}, id = 151 >>> > HiveFilter(condition=[=($8, 12)]): rowcount = >>> >5619.2307692307695, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id >>>= >>> >148 >>> > HiveTableScan(table=[[tpcds_small.date_dim]]): >>>rowcount >>> >= >>> >73050.0, cumulative cost = {0}, id = 101 >>> > HiveProject(ss_sold_date_sk=[$0], ss_item_sk=[$2], >>> >ss_ext_sales_price=[$15]): rowcount = 2880405.0, cumulative cost = >>>{0.0 >>> >rows, 0.0 cpu, 0.0 io}, id = 122 >>> > HiveTableScan(table=[[tpcds_small.store_sales]]): >>> rowcount >>> >= 2880405.0, cumulative cost = {0}, id = 104 >>> > HiveProject(i_item_sk=[$0], i_brand_id=[$7], i_brand=[$8], >>> >i_manufact_id=[$13]): rowcount = 26.050651230101302, cumulative cost = >>> >{0.0 >>> >rows, 0.0 cpu, 0.0 io}, id = 146 >>> > HiveFilter(condition=[=($13, 436)]): rowcount = >>> >26.050651230101302, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id >>>= >>> >143 >>> > HiveTableScan(table=[[tpcds_small.item]]): rowcount = >>> >18001.0, cumulative cost = {0}, id = 107 >>> > >>> > >>> >*The new join order looks like this:* >>> > >>> >HiveProject(d_date_sk=[$7], d_year=[$8], d_moy=[$9], >>> ss_sold_date_sk=[$4], >>> >ss_item_sk=[$5], ss_ext_sales_price=[$6], i_item_sk=[$0], >>> i_brand_id=[$1], >>> >i_brand=[$2], i_manufact_id=[$3]): rowcount = 197.5727739722679, >>> >cumulative >>> >cost = {2888347.3617503582 rows, 0.0 cpu, 0.0 io}, id = 194 >>> > HiveJoin(condition=[=($7, $4)], joinType=[inner], algorithm=[none], >>> >cost=[{7916.31109912852 rows, 0.0 cpu, 0.0 io}]): rowcount = >>> >197.5727739722679, cumulative cost = {2888347.3617503582 rows, 0.0 >>>cpu, >>> >0.0 >>> >io}, id = 193 >>> > HiveJoin(condition=[=($5, $0)], joinType=[inner], >>>algorithm=[none], >>> >cost=[{2880431.05065123 rows, 0.0 cpu, 0.0 io}]): rowcount = >>> >2297.0803298977507, cumulative cost = {2880431.05065123 rows, 0.0 cpu, >>> 0.0 >>> >io}, id = 192 >>> > HiveProject(i_item_sk=[$0], i_brand_id=[$7], i_brand=[$8], >>> >i_manufact_id=[$13]): rowcount = 26.050651230101302, cumulative cost = >>> >{0.0 >>> >rows, 0.0 cpu, 0.0 io}, id = 166 >>> > HiveFilter(subset=[rel#165:Subset#7.HIVE.[]], >>>condition=[=($13, >>> >436)]): rowcount = 26.050651230101302, cumulative cost = {0.0 rows, >>>0.0 >>> >cpu, 0.0 io}, id = 164 >>> > HiveTableScan(subset=[rel#163:Subset#6.HIVE.[]], >>> >table=[[tpcds_small.item]]): rowcount = 18001.0, cumulative cost = >>>{0}, >>> id >>> >= 107 >>> > HiveProject(subset=[rel#160:Subset#4.HIVE.[]], >>> ss_sold_date_sk=[$0], >>> >ss_item_sk=[$2], ss_ext_sales_price=[$15]): rowcount = 2880405.0, >>> >cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 159 >>> > HiveTableScan(subset=[rel#158:Subset#3.HIVE.[]], >>> >table=[[tpcds_small.store_sales]]): rowcount = 2880405.0, cumulative >>>cost >>> >= >>> >{0}, id = 104 >>> > HiveProject(subset=[rel#157:Subset#2.HIVE.[]], d_date_sk=[$0], >>> >d_year=[$6], d_moy=[$8]): rowcount = 5619.2307692307695, cumulative >>>cost >>> = >>> >{0.0 rows, 0.0 cpu, 0.0 io}, id = 156 >>> > HiveFilter(subset=[rel#155:Subset#1.HIVE.[]], condition=[=($8, >>> >12)]): >>> >rowcount = 5619.2307692307695, cumulative cost = {0.0 rows, 0.0 cpu, >>>0.0 >>> >io}, id = 154 >>> > HiveTableScan(subset=[rel#153:Subset#0.HIVE.[]], >>> >table=[[tpcds_small.date_dim]]): rowcount = 73050.0, cumulative cost = >>> >{0}, >>> >id = 101 >>> >>> >>
