Hello,

I am trying to optimize  a TPCDS query (#3) in Hive using the Volcano
planner. I have included snippets of the query and the pre-Volcano
optimization query plan below. HiveSort, HiveTableScan, etc  are basically
extensions of Sort, TableScan Relational operators defined in calcite.
Hive by default uses the HepPlanner, where as I wish to use the Volcano
planner.

For this query in particular, I clear all the default rules from the
Volcano Planner and just include the following two rules:

JoinPushThroughJoinRule:right and JoinPushThroughJoinRule:left


While executing the optimization I am able to observe that the "left" rule
kicks in and an alternate join order in generated. I can also see that the
cumulative cost of the new join order is less than the original join
order.  Please find a snippet of the recursive display of the new join
order below.

However, findBestExp does not return a plan with the modified join order :(


1. Are these two rules sufficient ? If not, why ? Also, what other rules
required for this particular query.


2. Is it possible that a new sub-tree created upon a rule match on a root
node, to be not put in the same RelSubSet as the root node. If yes, will
the new generated plan be considered while building the cheapest plan. I
ask this question specifically because, I found that the new operators
(HiveProject, id=194 below) that were generated were not put in the same
RelSubSet but were in the same RelSet.


Thanks a lot for your patience in reading this long mail :) Hoping, that I
get some info to get Volcano Planner going for hive.

Thanks
Raajay



* The query looks like this:*

select  dt.d_year
       ,item.i_brand_id brand_id
       ,item.i_brand brand
       ,sum(ss_ext_sales_price) sum_agg
 from  date_dim dt
      ,store_sales
      ,item
 where dt.d_date_sk = store_sales.ss_sold_date_sk
   and store_sales.ss_item_sk = item.i_item_sk
   and item.i_manufact_id = 436
   and dt.d_moy=12
 group by dt.d_year
      ,item.i_brand
      ,item.i_brand_id
 order by dt.d_year
         ,sum_agg desc
         ,brand_id
 limit 100;


*The query plan before passing to Volcano planner is looks like this:*

HiveSort(fetch=[100]): rowcount = 354.9838716449557, cumulative cost =
{3133795.037494761 rows, 0.0 cpu, 0.0 io}, id = 141
  HiveSort(sort0=[$0], sort1=[$3], sort2=[$1], dir0=[ASC], dir1=[DESC],
dir2=[ASC]): rowcount = 354.9838716449557, cumulative cost =
{3133795.037494761 rows, 0.0 cpu, 0.0 io}, id = 139
    HiveProject(d_year=[$0], brand_id=[$2], brand=[$1], sum_agg=[$3]):
rowcount = 354.9838716449557, cumulative cost = {3133795.037494761 rows,
0.0 cpu, 0.0 io}, id = 137
      HiveAggregate(group=[{0, 1, 2}], agg#0=[sum($3)]): rowcount =
354.9838716449557, cumulative cost = {3133795.037494761 rows, 0.0 cpu, 0.0
io}, id = 135
        HiveProject($f0=[$1], $f1=[$8], $f2=[$7], $f3=[$5]): rowcount =
358.53076132315454, cumulative cost = {3133795.037494761 rows, 0.0 cpu, 0.0
io}, id = 133
          HiveJoin(condition=[=($4, $6)], joinType=[inner],
algorithm=[none], cost=[{247770.8067255299 rows, 0.0 cpu, 0.0 io}]):
rowcount = 358.53076132315454, cumulative cost = {3133795.037494761 rows,
0.0 cpu, 0.0 io}, id = 131
            HiveJoin(condition=[=($0, $3)], joinType=[inner],
algorithm=[none], cost=[{2886024.230769231 rows, 0.0 cpu, 0.0 io}]):
rowcount = 247744.7560742998, cumulative cost = {2886024.230769231 rows,
0.0 cpu, 0.0 io}, id = 124
              HiveProject(d_date_sk=[$0], d_year=[$6], d_moy=[$8]):
rowcount = 5619.2307692307695, cumulative cost = {0.0 rows, 0.0 cpu, 0.0
io}, id = 151
                HiveFilter(condition=[=($8, 12)]): rowcount =
5619.2307692307695, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 148
                  HiveTableScan(table=[[tpcds_small.date_dim]]): rowcount =
73050.0, cumulative cost = {0}, id = 101
              HiveProject(ss_sold_date_sk=[$0], ss_item_sk=[$2],
ss_ext_sales_price=[$15]): rowcount = 2880405.0, cumulative cost = {0.0
rows, 0.0 cpu, 0.0 io}, id = 122
                HiveTableScan(table=[[tpcds_small.store_sales]]): rowcount
= 2880405.0, cumulative cost = {0}, id = 104
            HiveProject(i_item_sk=[$0], i_brand_id=[$7], i_brand=[$8],
i_manufact_id=[$13]): rowcount = 26.050651230101302, cumulative cost = {0.0
rows, 0.0 cpu, 0.0 io}, id = 146
              HiveFilter(condition=[=($13, 436)]): rowcount =
26.050651230101302, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 143
                HiveTableScan(table=[[tpcds_small.item]]): rowcount =
18001.0, cumulative cost = {0}, id = 107


*The new join order looks like this:*

HiveProject(d_date_sk=[$7], d_year=[$8], d_moy=[$9], ss_sold_date_sk=[$4],
ss_item_sk=[$5], ss_ext_sales_price=[$6], i_item_sk=[$0], i_brand_id=[$1],
i_brand=[$2], i_manufact_id=[$3]): rowcount = 197.5727739722679, cumulative
cost = {2888347.3617503582 rows, 0.0 cpu, 0.0 io}, id = 194
  HiveJoin(condition=[=($7, $4)], joinType=[inner], algorithm=[none],
cost=[{7916.31109912852 rows, 0.0 cpu, 0.0 io}]): rowcount =
197.5727739722679, cumulative cost = {2888347.3617503582 rows, 0.0 cpu, 0.0
io}, id = 193
    HiveJoin(condition=[=($5, $0)], joinType=[inner], algorithm=[none],
cost=[{2880431.05065123 rows, 0.0 cpu, 0.0 io}]): rowcount =
2297.0803298977507, cumulative cost = {2880431.05065123 rows, 0.0 cpu, 0.0
io}, id = 192
      HiveProject(i_item_sk=[$0], i_brand_id=[$7], i_brand=[$8],
i_manufact_id=[$13]): rowcount = 26.050651230101302, cumulative cost = {0.0
rows, 0.0 cpu, 0.0 io}, id = 166
        HiveFilter(subset=[rel#165:Subset#7.HIVE.[]], condition=[=($13,
436)]): rowcount = 26.050651230101302, cumulative cost = {0.0 rows, 0.0
cpu, 0.0 io}, id = 164
          HiveTableScan(subset=[rel#163:Subset#6.HIVE.[]],
table=[[tpcds_small.item]]): rowcount = 18001.0, cumulative cost = {0}, id
= 107
      HiveProject(subset=[rel#160:Subset#4.HIVE.[]], ss_sold_date_sk=[$0],
ss_item_sk=[$2], ss_ext_sales_price=[$15]): rowcount = 2880405.0,
cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 159
        HiveTableScan(subset=[rel#158:Subset#3.HIVE.[]],
table=[[tpcds_small.store_sales]]): rowcount = 2880405.0, cumulative cost =
{0}, id = 104
    HiveProject(subset=[rel#157:Subset#2.HIVE.[]], d_date_sk=[$0],
d_year=[$6], d_moy=[$8]): rowcount = 5619.2307692307695, cumulative cost =
{0.0 rows, 0.0 cpu, 0.0 io}, id = 156
      HiveFilter(subset=[rel#155:Subset#1.HIVE.[]], condition=[=($8, 12)]):
rowcount = 5619.2307692307695, cumulative cost = {0.0 rows, 0.0 cpu, 0.0
io}, id = 154
        HiveTableScan(subset=[rel#153:Subset#0.HIVE.[]],
table=[[tpcds_small.date_dim]]): rowcount = 73050.0, cumulative cost = {0},
id = 101

Reply via email to