[GitHub] [hive] jcamachor commented on a change in pull request #1439: HIVE-24084 Push Aggregates thru joins in case it re-groups previously unique columns

GitBox Tue, 08 Sep 2020 21:49:11 -0700


jcamachor commented on a change in pull request #1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r485325338




##########
File path: 
ql/src/test/results/clientpositive/llap/constraints_optimization.q.out
##########
@@ -2631,13 +2631,12 @@ POSTHOOK: Input: default@customer
 POSTHOOK: Input: default@store_sales
 #### A masked pattern was here ####
 CBO PLAN:
-HiveAggregate(group=[{0}])
-  HiveJoin(condition=[=($0, $8)], joinType=[inner], algorithm=[none], 
cost=[not available])
-    HiveProject(c_customer_sk=[$0], c_customer_id=[$1], c_first_name=[$8], 
c_last_name=[$9], c_preferred_cust_flag=[$10], c_birth_country=[$14], 
c_login=[$15], c_email_address=[$16])
-      HiveTableScan(table=[[default, customer]], table:alias=[customer])
-    HiveProject(ss_customer_sk=[$3])
-      HiveFilter(condition=[IS NOT NULL($3)])
-        HiveTableScan(table=[[default, store_sales]], 
table:alias=[store_sales])
+HiveSemiJoin(condition=[=($0, $1)], joinType=[semi])

Review comment:
       This is quite neat.

##########
File path: 
ql/src/test/results/clientpositive/perf/tez/constraints/cbo_query23.q.out
##########
@@ -155,18 +155,19 @@ HiveAggregate(group=[{}], agg#0=[sum($0)])
                                     HiveTableScan(table=[[default, date_dim]], 
table:alias=[date_dim])
           HiveProject($f1=[$0])
             HiveFilter(condition=[>($2, 4)])
-              HiveProject(i_item_sk=[$1], d_date=[$0], $f2=[$2])
-                HiveAggregate(group=[{3, 4}], agg#0=[count()])
-                  HiveJoin(condition=[=($1, $4)], joinType=[inner], 
algorithm=[none], cost=[not available])
-                    HiveJoin(condition=[=($0, $2)], joinType=[inner], 
algorithm=[none], cost=[not available])
-                      HiveProject(ss_sold_date_sk=[$0], ss_item_sk=[$2])
-                        HiveFilter(condition=[IS NOT NULL($0)])
-                          HiveTableScan(table=[[default, store_sales]], 
table:alias=[store_sales])
-                      HiveProject(d_date_sk=[$0], d_date=[$2])
-                        HiveFilter(condition=[IN($6, 1999, 2000, 2001, 2002)])
-                          HiveTableScan(table=[[default, date_dim]], 
table:alias=[date_dim])
-                    HiveProject(i_item_sk=[$0], substr=[substr($4, 1, 30)])
-                      HiveTableScan(table=[[default, item]], 
table:alias=[item])
+              HiveProject(i_item_sk=[$3], d_date=[$1], $f2=[$2])
+                HiveJoin(condition=[=($0, $3)], joinType=[inner], 
algorithm=[none], cost=[not available])
+                  HiveProject(ss_item_sk=[$0], d_date=[$1], $f2=[$2])
+                    HiveAggregate(group=[{1, 3}], agg#0=[count()])

Review comment:
       Cool!

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
     }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the 
results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping 
which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+    if (groups.isEmpty()) {
+      return false;
+    }
+    RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+    Set<ImmutableBitSet> uKeys = mq.getUniqueKeys(input);
+    for (ImmutableBitSet u : uKeys) {
+      if (groups.contains(u)) {
+        return true;
+      }
+    }
+    if (input instanceof Join) {
+      Join join = (Join) input;
+      RexBuilder rexBuilder = input.getCluster().getRexBuilder();
+      SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), 
rexBuilder);
+
+      if (cond.valid) {
+        ImmutableBitSet newGroup = 
groups.intersect(ImmutableBitSet.fromBitSet(cond.fields));
+        RelNode l = join.getLeft();
+        RelNode r = join.getRight();
+
+        int joinFieldCount = join.getRowType().getFieldCount();
+        int lFieldCount = l.getRowType().getFieldCount();
+
+        ImmutableBitSet groupL = newGroup.get(0, lFieldCount);
+        ImmutableBitSet groupR = newGroup.get(lFieldCount, 
joinFieldCount).shift(-lFieldCount);
+
+        if (isGroupingUnique(l, groupL)) {

Review comment:
       As you go down recursively, you may start finding `HepRelVertex` as the 
rel node. I think you hit the `instanceof Project` below only for the first 
project because you create it using the builder.
   That is why I think once you have gone through the first join, you will not 
hit another join; you could simply call `areColumnsUnique` in these `if` 
clauses, which could potentially uncover new cases.
   
   I may be wrong though, I just wanted to leave you a final note to make sure 
it was clear what I meant.

##########
File path: ql/src/test/queries/clientpositive/tpch18.q
##########
@@ -0,0 +1,133 @@
+--! qt:dataset:tpch_0_001.customer
+--! qt:dataset:tpch_0_001.lineitem
+--! qt:dataset:tpch_0_001.nation
+--! qt:dataset:tpch_0_001.orders
+--! qt:dataset:tpch_0_001.part
+--! qt:dataset:tpch_0_001.partsupp
+--! qt:dataset:tpch_0_001.region
+--! qt:dataset:tpch_0_001.supplier
+
+
+use tpch_0_001;
+
+set hive.transpose.aggr.join=true;
+set hive.transpose.aggr.join.unique=true;
+set hive.mapred.mode=nonstrict;
+
+create view q18_tmp_cached as
+select
+       l_orderkey,
+       sum(l_quantity) as t_sum_quantity
+from
+       lineitem
+where
+       l_orderkey is not null
+group by
+       l_orderkey;
+
+
+
+explain cbo select
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice,
+sum(l_quantity)
+from
+       customer,
+       orders,
+       q18_tmp_cached t,
+       lineitem l
+where
+c_custkey = o_custkey
+and o_orderkey = t.l_orderkey
+and o_orderkey is not null
+and t.t_sum_quantity > 300
+and o_orderkey = l.l_orderkey
+and l.l_orderkey is not null
+group by
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice
+order by
+o_totalprice desc,
+o_orderdate
+limit 100;
+
+
+
+select 'add constraints';
+
+alter table orders add constraint pk_o primary key (o_orderkey) disable 
novalidate rely;
+alter table customer add constraint pk_c primary key (c_custkey) disable 
novalidate rely;
+

Review comment:
       Can we declare {{(L_ORDERKEY, L_LINENUMBER)}} as primary key? That way 
we will be able to verify whether HIVE-24087 is kicking in and removing the 
unnecessary join. 
   We can also try the variant with {{L_ORDERKEY NOT NULL}} constraint.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org
For additional commands, e-mail: gitbox-h...@hive.apache.org

[GitHub] [hive] jcamachor commented on a change in pull request #1439: HIVE-24084 Push Aggregates thru joins in case it re-groups previously unique columns

Reply via email to