Re: Volcano Planner for TPCDS query

Julian Hyde Thu, 08 Oct 2015 17:22:03 -0700

If it would help, I would be happy to join the call. Let me know.

Julian


> On Oct 8, 2015, at 10:26 AM, John Pullokkaran <[email protected]> 
> wrote:
> 
> Raajay,
> 
>   Lets set up a call.
> I would like to better understand what you are trying to achieve.
> It would be a good to revive Hive¹s use of Volcano.
> 
> I am out of town till next Monday.
> 
> Thanks
> John
> 
> On 10/8/15, 10:14 AM, "Raajay" <[email protected]> wrote:
> 
>> I was finally able to get Volcano Planner working with Hive by adding
>> other
>> rules defined in "ql/optimizer/calcite/rules/" in the hive repo :)
>> 
>> 
>> Thanks John for the pointers; it would be good to learn about traits, so
>> that we can reducer the number of active rules to improve Volcano
>> Planner's
>> efficiency.
>> 
>> Thanks
>> Raajay
>> 
>> 
>> 
>> 
>> On Thu, Oct 8, 2015 at 11:12 AM, Raajay <[email protected]> wrote:
>> 
>>> Thanks for the suggestions.
>>> 
>>> 1. The current version of Calcite does not have the SwapJoinRule
>>> defined.
>>> I guess the equivalent now is the JoinCommuteRule. Having that is not
>>> sufficient to get the alternate join order.
>>> 
>>> 2. What are some good pointers to learning more about what traits mean
>>> and
>>> how they are used in query planning? Thanks.
>>> 
>>> - Raajay
>>> 
>>> 
>>> On Wed, Oct 7, 2015 at 9:43 PM, John Pullokkaran <
>>> [email protected]> wrote:
>>> 
>>>> #1 You need SwapJoinRule.
>>>> Actually if you go to Hive branch 13 or Hive CBO branch you should get
>>>> to
>>>> the code that used Volcano.
>>>> 
>>>> #2 I belive it is possible; since the trait could be different
>>>> propagated
>>>> from below.
>>>> 
>>>> John
>>>> 
>>>> On 10/7/15, 3:53 PM, "Raajay" <[email protected]> wrote:
>>>> 
>>>>> Hello,
>>>>> 
>>>>> I am trying to optimize  a TPCDS query (#3) in Hive using the Volcano
>>>>> planner. I have included snippets of the query and the pre-Volcano
>>>>> optimization query plan below. HiveSort, HiveTableScan, etc  are
>>>> basically
>>>>> extensions of Sort, TableScan Relational operators defined in calcite.
>>>>> Hive by default uses the HepPlanner, where as I wish to use the
>>>> Volcano
>>>>> planner.
>>>>> 
>>>>> For this query in particular, I clear all the default rules from the
>>>>> Volcano Planner and just include the following two rules:
>>>>> 
>>>>> JoinPushThroughJoinRule:right and JoinPushThroughJoinRule:left
>>>>> 
>>>>> 
>>>>> While executing the optimization I am able to observe that the "left"
>>>> rule
>>>>> kicks in and an alternate join order in generated. I can also see that
>>>> the
>>>>> cumulative cost of the new join order is less than the original join
>>>>> order.  Please find a snippet of the recursive display of the new join
>>>>> order below.
>>>>> 
>>>>> However, findBestExp does not return a plan with the modified join
>>>> order
>>>>> :(
>>>>> 
>>>>> 
>>>>> 1. Are these two rules sufficient ? If not, why ? Also, what other
>>>> rules
>>>>> required for this particular query.
>>>>> 
>>>>> 
>>>>> 2. Is it possible that a new sub-tree created upon a rule match on a
>>>> root
>>>>> node, to be not put in the same RelSubSet as the root node. If yes,
>>>> will
>>>>> the new generated plan be considered while building the cheapest
>>>> plan. I
>>>>> ask this question specifically because, I found that the new operators
>>>>> (HiveProject, id=194 below) that were generated were not put in the
>>>> same
>>>>> RelSubSet but were in the same RelSet.
>>>>> 
>>>>> 
>>>>> Thanks a lot for your patience in reading this long mail :) Hoping,
>>>> that
>>>> I
>>>>> get some info to get Volcano Planner going for hive.
>>>>> 
>>>>> Thanks
>>>>> Raajay
>>>>> 
>>>>> 
>>>>> 
>>>>> * The query looks like this:*
>>>>> 
>>>>> select  dt.d_year
>>>>>      ,item.i_brand_id brand_id
>>>>>      ,item.i_brand brand
>>>>>      ,sum(ss_ext_sales_price) sum_agg
>>>>> from  date_dim dt
>>>>>     ,store_sales
>>>>>     ,item
>>>>> where dt.d_date_sk = store_sales.ss_sold_date_sk
>>>>>  and store_sales.ss_item_sk = item.i_item_sk
>>>>>  and item.i_manufact_id = 436
>>>>>  and dt.d_moy=12
>>>>> group by dt.d_year
>>>>>     ,item.i_brand
>>>>>     ,item.i_brand_id
>>>>> order by dt.d_year
>>>>>        ,sum_agg desc
>>>>>        ,brand_id
>>>>> limit 100;
>>>>> 
>>>>> 
>>>>> *The query plan before passing to Volcano planner is looks like this:*
>>>>> 
>>>>> HiveSort(fetch=[100]): rowcount = 354.9838716449557, cumulative cost =
>>>>> {3133795.037494761 rows, 0.0 cpu, 0.0 io}, id = 141
>>>>> HiveSort(sort0=[$0], sort1=[$3], sort2=[$1], dir0=[ASC],
>>>> dir1=[DESC],
>>>>> dir2=[ASC]): rowcount = 354.9838716449557, cumulative cost =
>>>>> {3133795.037494761 rows, 0.0 cpu, 0.0 io}, id = 139
>>>>>   HiveProject(d_year=[$0], brand_id=[$2], brand=[$1], sum_agg=[$3]):
>>>>> rowcount = 354.9838716449557, cumulative cost = {3133795.037494761
>>>> rows,
>>>>> 0.0 cpu, 0.0 io}, id = 137
>>>>>     HiveAggregate(group=[{0, 1, 2}], agg#0=[sum($3)]): rowcount =
>>>>> 354.9838716449557, cumulative cost = {3133795.037494761 rows, 0.0 cpu,
>>>> 0.0
>>>>> io}, id = 135
>>>>>       HiveProject($f0=[$1], $f1=[$8], $f2=[$7], $f3=[$5]): rowcount
>>>> =
>>>>> 358.53076132315454, cumulative cost = {3133795.037494761 rows, 0.0
>>>> cpu,
>>>>> 0.0
>>>>> io}, id = 133
>>>>>         HiveJoin(condition=[=($4, $6)], joinType=[inner],
>>>>> algorithm=[none], cost=[{247770.8067255299 rows, 0.0 cpu, 0.0 io}]):
>>>>> rowcount = 358.53076132315454, cumulative cost = {3133795.037494761
>>>> rows,
>>>>> 0.0 cpu, 0.0 io}, id = 131
>>>>>           HiveJoin(condition=[=($0, $3)], joinType=[inner],
>>>>> algorithm=[none], cost=[{2886024.230769231 rows, 0.0 cpu, 0.0 io}]):
>>>>> rowcount = 247744.7560742998, cumulative cost = {2886024.230769231
>>>> rows,
>>>>> 0.0 cpu, 0.0 io}, id = 124
>>>>>             HiveProject(d_date_sk=[$0], d_year=[$6], d_moy=[$8]):
>>>>> rowcount = 5619.2307692307695, cumulative cost = {0.0 rows, 0.0 cpu,
>>>> 0.0
>>>>> io}, id = 151
>>>>>               HiveFilter(condition=[=($8, 12)]): rowcount =
>>>>> 5619.2307692307695, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id
>>>> =
>>>>> 148
>>>>>                 HiveTableScan(table=[[tpcds_small.date_dim]]):
>>>> rowcount
>>>>> =
>>>>> 73050.0, cumulative cost = {0}, id = 101
>>>>>             HiveProject(ss_sold_date_sk=[$0], ss_item_sk=[$2],
>>>>> ss_ext_sales_price=[$15]): rowcount = 2880405.0, cumulative cost =
>>>> {0.0
>>>>> rows, 0.0 cpu, 0.0 io}, id = 122
>>>>>               HiveTableScan(table=[[tpcds_small.store_sales]]):
>>>> rowcount
>>>>> = 2880405.0, cumulative cost = {0}, id = 104
>>>>>           HiveProject(i_item_sk=[$0], i_brand_id=[$7], i_brand=[$8],
>>>>> i_manufact_id=[$13]): rowcount = 26.050651230101302, cumulative cost =
>>>>> {0.0
>>>>> rows, 0.0 cpu, 0.0 io}, id = 146
>>>>>             HiveFilter(condition=[=($13, 436)]): rowcount =
>>>>> 26.050651230101302, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id
>>>> =
>>>>> 143
>>>>>               HiveTableScan(table=[[tpcds_small.item]]): rowcount =
>>>>> 18001.0, cumulative cost = {0}, id = 107
>>>>> 
>>>>> 
>>>>> *The new join order looks like this:*
>>>>> 
>>>>> HiveProject(d_date_sk=[$7], d_year=[$8], d_moy=[$9],
>>>> ss_sold_date_sk=[$4],
>>>>> ss_item_sk=[$5], ss_ext_sales_price=[$6], i_item_sk=[$0],
>>>> i_brand_id=[$1],
>>>>> i_brand=[$2], i_manufact_id=[$3]): rowcount = 197.5727739722679,
>>>>> cumulative
>>>>> cost = {2888347.3617503582 rows, 0.0 cpu, 0.0 io}, id = 194
>>>>> HiveJoin(condition=[=($7, $4)], joinType=[inner], algorithm=[none],
>>>>> cost=[{7916.31109912852 rows, 0.0 cpu, 0.0 io}]): rowcount =
>>>>> 197.5727739722679, cumulative cost = {2888347.3617503582 rows, 0.0
>>>> cpu,
>>>>> 0.0
>>>>> io}, id = 193
>>>>>   HiveJoin(condition=[=($5, $0)], joinType=[inner],
>>>> algorithm=[none],
>>>>> cost=[{2880431.05065123 rows, 0.0 cpu, 0.0 io}]): rowcount =
>>>>> 2297.0803298977507, cumulative cost = {2880431.05065123 rows, 0.0 cpu,
>>>> 0.0
>>>>> io}, id = 192
>>>>>     HiveProject(i_item_sk=[$0], i_brand_id=[$7], i_brand=[$8],
>>>>> i_manufact_id=[$13]): rowcount = 26.050651230101302, cumulative cost =
>>>>> {0.0
>>>>> rows, 0.0 cpu, 0.0 io}, id = 166
>>>>>       HiveFilter(subset=[rel#165:Subset#7.HIVE.[]],
>>>> condition=[=($13,
>>>>> 436)]): rowcount = 26.050651230101302, cumulative cost = {0.0 rows,
>>>> 0.0
>>>>> cpu, 0.0 io}, id = 164
>>>>>         HiveTableScan(subset=[rel#163:Subset#6.HIVE.[]],
>>>>> table=[[tpcds_small.item]]): rowcount = 18001.0, cumulative cost =
>>>> {0},
>>>> id
>>>>> = 107
>>>>>     HiveProject(subset=[rel#160:Subset#4.HIVE.[]],
>>>> ss_sold_date_sk=[$0],
>>>>> ss_item_sk=[$2], ss_ext_sales_price=[$15]): rowcount = 2880405.0,
>>>>> cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 159
>>>>>       HiveTableScan(subset=[rel#158:Subset#3.HIVE.[]],
>>>>> table=[[tpcds_small.store_sales]]): rowcount = 2880405.0, cumulative
>>>> cost
>>>>> =
>>>>> {0}, id = 104
>>>>>   HiveProject(subset=[rel#157:Subset#2.HIVE.[]], d_date_sk=[$0],
>>>>> d_year=[$6], d_moy=[$8]): rowcount = 5619.2307692307695, cumulative
>>>> cost
>>>> =
>>>>> {0.0 rows, 0.0 cpu, 0.0 io}, id = 156
>>>>>     HiveFilter(subset=[rel#155:Subset#1.HIVE.[]], condition=[=($8,
>>>>> 12)]):
>>>>> rowcount = 5619.2307692307695, cumulative cost = {0.0 rows, 0.0 cpu,
>>>> 0.0
>>>>> io}, id = 154
>>>>>       HiveTableScan(subset=[rel#153:Subset#0.HIVE.[]],
>>>>> table=[[tpcds_small.date_dim]]): rowcount = 73050.0, cumulative cost =
>>>>> {0},
>>>>> id = 101
>>>> 
>>>> 
>>> 
>

Re: Volcano Planner for TPCDS query

Reply via email to