Re: Volcano Planner for TPCDS query

John Pullokkaran Wed, 07 Oct 2015 19:44:42 -0700

#1 You need SwapJoinRule.
Actually if you go to Hive branch 13 or Hive CBO branch you should get to
the code that used Volcano.


#2 I belive it is possible; since the trait could be different propagated
from below.

John

On 10/7/15, 3:53 PM, "Raajay" <[email protected]> wrote:

>Hello,
>
>I am trying to optimize  a TPCDS query (#3) in Hive using the Volcano
>planner. I have included snippets of the query and the pre-Volcano
>optimization query plan below. HiveSort, HiveTableScan, etc  are basically
>extensions of Sort, TableScan Relational operators defined in calcite.
>Hive by default uses the HepPlanner, where as I wish to use the Volcano
>planner.
>
>For this query in particular, I clear all the default rules from the
>Volcano Planner and just include the following two rules:
>
>JoinPushThroughJoinRule:right and JoinPushThroughJoinRule:left
>
>
>While executing the optimization I am able to observe that the "left" rule
>kicks in and an alternate join order in generated. I can also see that the
>cumulative cost of the new join order is less than the original join
>order.  Please find a snippet of the recursive display of the new join
>order below.
>
>However, findBestExp does not return a plan with the modified join order
>:(
>
>
>1. Are these two rules sufficient ? If not, why ? Also, what other rules
>required for this particular query.
>
>
>2. Is it possible that a new sub-tree created upon a rule match on a root
>node, to be not put in the same RelSubSet as the root node. If yes, will
>the new generated plan be considered while building the cheapest plan. I
>ask this question specifically because, I found that the new operators
>(HiveProject, id=194 below) that were generated were not put in the same
>RelSubSet but were in the same RelSet.
>
>
>Thanks a lot for your patience in reading this long mail :) Hoping, that I
>get some info to get Volcano Planner going for hive.
>
>Thanks
>Raajay
>
>
>
>* The query looks like this:*
>
>select  dt.d_year
>       ,item.i_brand_id brand_id
>       ,item.i_brand brand
>       ,sum(ss_ext_sales_price) sum_agg
> from  date_dim dt
>      ,store_sales
>      ,item
> where dt.d_date_sk = store_sales.ss_sold_date_sk
>   and store_sales.ss_item_sk = item.i_item_sk
>   and item.i_manufact_id = 436
>   and dt.d_moy=12
> group by dt.d_year
>      ,item.i_brand
>      ,item.i_brand_id
> order by dt.d_year
>         ,sum_agg desc
>         ,brand_id
> limit 100;
>
>
>*The query plan before passing to Volcano planner is looks like this:*
>
>HiveSort(fetch=[100]): rowcount = 354.9838716449557, cumulative cost =
>{3133795.037494761 rows, 0.0 cpu, 0.0 io}, id = 141
>  HiveSort(sort0=[$0], sort1=[$3], sort2=[$1], dir0=[ASC], dir1=[DESC],
>dir2=[ASC]): rowcount = 354.9838716449557, cumulative cost =
>{3133795.037494761 rows, 0.0 cpu, 0.0 io}, id = 139
>    HiveProject(d_year=[$0], brand_id=[$2], brand=[$1], sum_agg=[$3]):
>rowcount = 354.9838716449557, cumulative cost = {3133795.037494761 rows,
>0.0 cpu, 0.0 io}, id = 137
>      HiveAggregate(group=[{0, 1, 2}], agg#0=[sum($3)]): rowcount =
>354.9838716449557, cumulative cost = {3133795.037494761 rows, 0.0 cpu, 0.0
>io}, id = 135
>        HiveProject($f0=[$1], $f1=[$8], $f2=[$7], $f3=[$5]): rowcount =
>358.53076132315454, cumulative cost = {3133795.037494761 rows, 0.0 cpu,
>0.0
>io}, id = 133
>          HiveJoin(condition=[=($4, $6)], joinType=[inner],
>algorithm=[none], cost=[{247770.8067255299 rows, 0.0 cpu, 0.0 io}]):
>rowcount = 358.53076132315454, cumulative cost = {3133795.037494761 rows,
>0.0 cpu, 0.0 io}, id = 131
>            HiveJoin(condition=[=($0, $3)], joinType=[inner],
>algorithm=[none], cost=[{2886024.230769231 rows, 0.0 cpu, 0.0 io}]):
>rowcount = 247744.7560742998, cumulative cost = {2886024.230769231 rows,
>0.0 cpu, 0.0 io}, id = 124
>              HiveProject(d_date_sk=[$0], d_year=[$6], d_moy=[$8]):
>rowcount = 5619.2307692307695, cumulative cost = {0.0 rows, 0.0 cpu, 0.0
>io}, id = 151
>                HiveFilter(condition=[=($8, 12)]): rowcount =
>5619.2307692307695, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id =
>148
>                  HiveTableScan(table=[[tpcds_small.date_dim]]): rowcount
>=
>73050.0, cumulative cost = {0}, id = 101
>              HiveProject(ss_sold_date_sk=[$0], ss_item_sk=[$2],
>ss_ext_sales_price=[$15]): rowcount = 2880405.0, cumulative cost = {0.0
>rows, 0.0 cpu, 0.0 io}, id = 122
>                HiveTableScan(table=[[tpcds_small.store_sales]]): rowcount
>= 2880405.0, cumulative cost = {0}, id = 104
>            HiveProject(i_item_sk=[$0], i_brand_id=[$7], i_brand=[$8],
>i_manufact_id=[$13]): rowcount = 26.050651230101302, cumulative cost =
>{0.0
>rows, 0.0 cpu, 0.0 io}, id = 146
>              HiveFilter(condition=[=($13, 436)]): rowcount =
>26.050651230101302, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id =
>143
>                HiveTableScan(table=[[tpcds_small.item]]): rowcount =
>18001.0, cumulative cost = {0}, id = 107
>
>
>*The new join order looks like this:*
>
>HiveProject(d_date_sk=[$7], d_year=[$8], d_moy=[$9], ss_sold_date_sk=[$4],
>ss_item_sk=[$5], ss_ext_sales_price=[$6], i_item_sk=[$0], i_brand_id=[$1],
>i_brand=[$2], i_manufact_id=[$3]): rowcount = 197.5727739722679,
>cumulative
>cost = {2888347.3617503582 rows, 0.0 cpu, 0.0 io}, id = 194
>  HiveJoin(condition=[=($7, $4)], joinType=[inner], algorithm=[none],
>cost=[{7916.31109912852 rows, 0.0 cpu, 0.0 io}]): rowcount =
>197.5727739722679, cumulative cost = {2888347.3617503582 rows, 0.0 cpu,
>0.0
>io}, id = 193
>    HiveJoin(condition=[=($5, $0)], joinType=[inner], algorithm=[none],
>cost=[{2880431.05065123 rows, 0.0 cpu, 0.0 io}]): rowcount =
>2297.0803298977507, cumulative cost = {2880431.05065123 rows, 0.0 cpu, 0.0
>io}, id = 192
>      HiveProject(i_item_sk=[$0], i_brand_id=[$7], i_brand=[$8],
>i_manufact_id=[$13]): rowcount = 26.050651230101302, cumulative cost =
>{0.0
>rows, 0.0 cpu, 0.0 io}, id = 166
>        HiveFilter(subset=[rel#165:Subset#7.HIVE.[]], condition=[=($13,
>436)]): rowcount = 26.050651230101302, cumulative cost = {0.0 rows, 0.0
>cpu, 0.0 io}, id = 164
>          HiveTableScan(subset=[rel#163:Subset#6.HIVE.[]],
>table=[[tpcds_small.item]]): rowcount = 18001.0, cumulative cost = {0}, id
>= 107
>      HiveProject(subset=[rel#160:Subset#4.HIVE.[]], ss_sold_date_sk=[$0],
>ss_item_sk=[$2], ss_ext_sales_price=[$15]): rowcount = 2880405.0,
>cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 159
>        HiveTableScan(subset=[rel#158:Subset#3.HIVE.[]],
>table=[[tpcds_small.store_sales]]): rowcount = 2880405.0, cumulative cost
>=
>{0}, id = 104
>    HiveProject(subset=[rel#157:Subset#2.HIVE.[]], d_date_sk=[$0],
>d_year=[$6], d_moy=[$8]): rowcount = 5619.2307692307695, cumulative cost =
>{0.0 rows, 0.0 cpu, 0.0 io}, id = 156
>      HiveFilter(subset=[rel#155:Subset#1.HIVE.[]], condition=[=($8,
>12)]):
>rowcount = 5619.2307692307695, cumulative cost = {0.0 rows, 0.0 cpu, 0.0
>io}, id = 154
>        HiveTableScan(subset=[rel#153:Subset#0.HIVE.[]],
>table=[[tpcds_small.date_dim]]): rowcount = 73050.0, cumulative cost =
>{0},
>id = 101

Re: Volcano Planner for TPCDS query

Reply via email to