If it would help, I would be happy to join the call. Let me know. Julian
> On Oct 8, 2015, at 10:26 AM, John Pullokkaran <[email protected]> > wrote: > > Raajay, > > Lets set up a call. > I would like to better understand what you are trying to achieve. > It would be a good to revive Hive¹s use of Volcano. > > I am out of town till next Monday. > > Thanks > John > > On 10/8/15, 10:14 AM, "Raajay" <[email protected]> wrote: > >> I was finally able to get Volcano Planner working with Hive by adding >> other >> rules defined in "ql/optimizer/calcite/rules/" in the hive repo :) >> >> >> Thanks John for the pointers; it would be good to learn about traits, so >> that we can reducer the number of active rules to improve Volcano >> Planner's >> efficiency. >> >> Thanks >> Raajay >> >> >> >> >> On Thu, Oct 8, 2015 at 11:12 AM, Raajay <[email protected]> wrote: >> >>> Thanks for the suggestions. >>> >>> 1. The current version of Calcite does not have the SwapJoinRule >>> defined. >>> I guess the equivalent now is the JoinCommuteRule. Having that is not >>> sufficient to get the alternate join order. >>> >>> 2. What are some good pointers to learning more about what traits mean >>> and >>> how they are used in query planning? Thanks. >>> >>> - Raajay >>> >>> >>> On Wed, Oct 7, 2015 at 9:43 PM, John Pullokkaran < >>> [email protected]> wrote: >>> >>>> #1 You need SwapJoinRule. >>>> Actually if you go to Hive branch 13 or Hive CBO branch you should get >>>> to >>>> the code that used Volcano. >>>> >>>> #2 I belive it is possible; since the trait could be different >>>> propagated >>>> from below. >>>> >>>> John >>>> >>>> On 10/7/15, 3:53 PM, "Raajay" <[email protected]> wrote: >>>> >>>>> Hello, >>>>> >>>>> I am trying to optimize a TPCDS query (#3) in Hive using the Volcano >>>>> planner. I have included snippets of the query and the pre-Volcano >>>>> optimization query plan below. HiveSort, HiveTableScan, etc are >>>> basically >>>>> extensions of Sort, TableScan Relational operators defined in calcite. >>>>> Hive by default uses the HepPlanner, where as I wish to use the >>>> Volcano >>>>> planner. >>>>> >>>>> For this query in particular, I clear all the default rules from the >>>>> Volcano Planner and just include the following two rules: >>>>> >>>>> JoinPushThroughJoinRule:right and JoinPushThroughJoinRule:left >>>>> >>>>> >>>>> While executing the optimization I am able to observe that the "left" >>>> rule >>>>> kicks in and an alternate join order in generated. I can also see that >>>> the >>>>> cumulative cost of the new join order is less than the original join >>>>> order. Please find a snippet of the recursive display of the new join >>>>> order below. >>>>> >>>>> However, findBestExp does not return a plan with the modified join >>>> order >>>>> :( >>>>> >>>>> >>>>> 1. Are these two rules sufficient ? If not, why ? Also, what other >>>> rules >>>>> required for this particular query. >>>>> >>>>> >>>>> 2. Is it possible that a new sub-tree created upon a rule match on a >>>> root >>>>> node, to be not put in the same RelSubSet as the root node. If yes, >>>> will >>>>> the new generated plan be considered while building the cheapest >>>> plan. I >>>>> ask this question specifically because, I found that the new operators >>>>> (HiveProject, id=194 below) that were generated were not put in the >>>> same >>>>> RelSubSet but were in the same RelSet. >>>>> >>>>> >>>>> Thanks a lot for your patience in reading this long mail :) Hoping, >>>> that >>>> I >>>>> get some info to get Volcano Planner going for hive. >>>>> >>>>> Thanks >>>>> Raajay >>>>> >>>>> >>>>> >>>>> * The query looks like this:* >>>>> >>>>> select dt.d_year >>>>> ,item.i_brand_id brand_id >>>>> ,item.i_brand brand >>>>> ,sum(ss_ext_sales_price) sum_agg >>>>> from date_dim dt >>>>> ,store_sales >>>>> ,item >>>>> where dt.d_date_sk = store_sales.ss_sold_date_sk >>>>> and store_sales.ss_item_sk = item.i_item_sk >>>>> and item.i_manufact_id = 436 >>>>> and dt.d_moy=12 >>>>> group by dt.d_year >>>>> ,item.i_brand >>>>> ,item.i_brand_id >>>>> order by dt.d_year >>>>> ,sum_agg desc >>>>> ,brand_id >>>>> limit 100; >>>>> >>>>> >>>>> *The query plan before passing to Volcano planner is looks like this:* >>>>> >>>>> HiveSort(fetch=[100]): rowcount = 354.9838716449557, cumulative cost = >>>>> {3133795.037494761 rows, 0.0 cpu, 0.0 io}, id = 141 >>>>> HiveSort(sort0=[$0], sort1=[$3], sort2=[$1], dir0=[ASC], >>>> dir1=[DESC], >>>>> dir2=[ASC]): rowcount = 354.9838716449557, cumulative cost = >>>>> {3133795.037494761 rows, 0.0 cpu, 0.0 io}, id = 139 >>>>> HiveProject(d_year=[$0], brand_id=[$2], brand=[$1], sum_agg=[$3]): >>>>> rowcount = 354.9838716449557, cumulative cost = {3133795.037494761 >>>> rows, >>>>> 0.0 cpu, 0.0 io}, id = 137 >>>>> HiveAggregate(group=[{0, 1, 2}], agg#0=[sum($3)]): rowcount = >>>>> 354.9838716449557, cumulative cost = {3133795.037494761 rows, 0.0 cpu, >>>> 0.0 >>>>> io}, id = 135 >>>>> HiveProject($f0=[$1], $f1=[$8], $f2=[$7], $f3=[$5]): rowcount >>>> = >>>>> 358.53076132315454, cumulative cost = {3133795.037494761 rows, 0.0 >>>> cpu, >>>>> 0.0 >>>>> io}, id = 133 >>>>> HiveJoin(condition=[=($4, $6)], joinType=[inner], >>>>> algorithm=[none], cost=[{247770.8067255299 rows, 0.0 cpu, 0.0 io}]): >>>>> rowcount = 358.53076132315454, cumulative cost = {3133795.037494761 >>>> rows, >>>>> 0.0 cpu, 0.0 io}, id = 131 >>>>> HiveJoin(condition=[=($0, $3)], joinType=[inner], >>>>> algorithm=[none], cost=[{2886024.230769231 rows, 0.0 cpu, 0.0 io}]): >>>>> rowcount = 247744.7560742998, cumulative cost = {2886024.230769231 >>>> rows, >>>>> 0.0 cpu, 0.0 io}, id = 124 >>>>> HiveProject(d_date_sk=[$0], d_year=[$6], d_moy=[$8]): >>>>> rowcount = 5619.2307692307695, cumulative cost = {0.0 rows, 0.0 cpu, >>>> 0.0 >>>>> io}, id = 151 >>>>> HiveFilter(condition=[=($8, 12)]): rowcount = >>>>> 5619.2307692307695, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id >>>> = >>>>> 148 >>>>> HiveTableScan(table=[[tpcds_small.date_dim]]): >>>> rowcount >>>>> = >>>>> 73050.0, cumulative cost = {0}, id = 101 >>>>> HiveProject(ss_sold_date_sk=[$0], ss_item_sk=[$2], >>>>> ss_ext_sales_price=[$15]): rowcount = 2880405.0, cumulative cost = >>>> {0.0 >>>>> rows, 0.0 cpu, 0.0 io}, id = 122 >>>>> HiveTableScan(table=[[tpcds_small.store_sales]]): >>>> rowcount >>>>> = 2880405.0, cumulative cost = {0}, id = 104 >>>>> HiveProject(i_item_sk=[$0], i_brand_id=[$7], i_brand=[$8], >>>>> i_manufact_id=[$13]): rowcount = 26.050651230101302, cumulative cost = >>>>> {0.0 >>>>> rows, 0.0 cpu, 0.0 io}, id = 146 >>>>> HiveFilter(condition=[=($13, 436)]): rowcount = >>>>> 26.050651230101302, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id >>>> = >>>>> 143 >>>>> HiveTableScan(table=[[tpcds_small.item]]): rowcount = >>>>> 18001.0, cumulative cost = {0}, id = 107 >>>>> >>>>> >>>>> *The new join order looks like this:* >>>>> >>>>> HiveProject(d_date_sk=[$7], d_year=[$8], d_moy=[$9], >>>> ss_sold_date_sk=[$4], >>>>> ss_item_sk=[$5], ss_ext_sales_price=[$6], i_item_sk=[$0], >>>> i_brand_id=[$1], >>>>> i_brand=[$2], i_manufact_id=[$3]): rowcount = 197.5727739722679, >>>>> cumulative >>>>> cost = {2888347.3617503582 rows, 0.0 cpu, 0.0 io}, id = 194 >>>>> HiveJoin(condition=[=($7, $4)], joinType=[inner], algorithm=[none], >>>>> cost=[{7916.31109912852 rows, 0.0 cpu, 0.0 io}]): rowcount = >>>>> 197.5727739722679, cumulative cost = {2888347.3617503582 rows, 0.0 >>>> cpu, >>>>> 0.0 >>>>> io}, id = 193 >>>>> HiveJoin(condition=[=($5, $0)], joinType=[inner], >>>> algorithm=[none], >>>>> cost=[{2880431.05065123 rows, 0.0 cpu, 0.0 io}]): rowcount = >>>>> 2297.0803298977507, cumulative cost = {2880431.05065123 rows, 0.0 cpu, >>>> 0.0 >>>>> io}, id = 192 >>>>> HiveProject(i_item_sk=[$0], i_brand_id=[$7], i_brand=[$8], >>>>> i_manufact_id=[$13]): rowcount = 26.050651230101302, cumulative cost = >>>>> {0.0 >>>>> rows, 0.0 cpu, 0.0 io}, id = 166 >>>>> HiveFilter(subset=[rel#165:Subset#7.HIVE.[]], >>>> condition=[=($13, >>>>> 436)]): rowcount = 26.050651230101302, cumulative cost = {0.0 rows, >>>> 0.0 >>>>> cpu, 0.0 io}, id = 164 >>>>> HiveTableScan(subset=[rel#163:Subset#6.HIVE.[]], >>>>> table=[[tpcds_small.item]]): rowcount = 18001.0, cumulative cost = >>>> {0}, >>>> id >>>>> = 107 >>>>> HiveProject(subset=[rel#160:Subset#4.HIVE.[]], >>>> ss_sold_date_sk=[$0], >>>>> ss_item_sk=[$2], ss_ext_sales_price=[$15]): rowcount = 2880405.0, >>>>> cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 159 >>>>> HiveTableScan(subset=[rel#158:Subset#3.HIVE.[]], >>>>> table=[[tpcds_small.store_sales]]): rowcount = 2880405.0, cumulative >>>> cost >>>>> = >>>>> {0}, id = 104 >>>>> HiveProject(subset=[rel#157:Subset#2.HIVE.[]], d_date_sk=[$0], >>>>> d_year=[$6], d_moy=[$8]): rowcount = 5619.2307692307695, cumulative >>>> cost >>>> = >>>>> {0.0 rows, 0.0 cpu, 0.0 io}, id = 156 >>>>> HiveFilter(subset=[rel#155:Subset#1.HIVE.[]], condition=[=($8, >>>>> 12)]): >>>>> rowcount = 5619.2307692307695, cumulative cost = {0.0 rows, 0.0 cpu, >>>> 0.0 >>>>> io}, id = 154 >>>>> HiveTableScan(subset=[rel#153:Subset#0.HIVE.[]], >>>>> table=[[tpcds_small.date_dim]]): rowcount = 73050.0, cumulative cost = >>>>> {0}, >>>>> id = 101 >>>> >>>> >>> >
