[GitHub] [hive] zabetak commented on a diff in pull request #4442: HIVE-27399: Add support for lateral views with CBO

via GitHub Mon, 11 Sep 2023 08:59:44 -0700


zabetak commented on code in PR #4442:
URL: https://github.com/apache/hive/pull/4442#discussion_r1321538915



##########
ql/src/test/results/clientpositive/llap/lateral_view_ppd.q.out:
##########
@@ -316,29 +332,36 @@ STAGE PLANS:
           filterExpr: (key = '0') (type: boolean)
           Filter Operator
             predicate: (key = '0') (type: boolean)
-            Lateral View Forward
-              Select Operator
-                expressions: value (type: string)
-                outputColumnNames: value
-                Lateral View Join Operator
-                  outputColumnNames: _col1, _col6
-                  Select Operator
-                    expressions: _col1 (type: string), _col6 (type: int)
-                    outputColumnNames: _col0, _col1
-                    ListSink
-              Select Operator
-                expressions: array(1,2,3) (type: array<int>)
-                outputColumnNames: _col0
-                UDTF Operator
-                  function name: explode
-                  Filter Operator
-                    predicate: (col > 1) (type: boolean)

Review Comment:
   The `(col > 1)` predicate is applied twice after the changes in this PR; 
once before the `Lateral View Join Operator` and once afterwards. Is this 
normal? Do we know why?
   
   Can we add the `EXPLAIN CBO` for this or an equivalent query inside 
`lateral_view_cbo.q`?



##########
ql/src/test/results/clientpositive/llap/create_view.q.out:
##########
@@ -1121,7 +1121,7 @@ Sort Columns:             []
                 
 # View Information              
 Original Query:        SELECT * FROM src LATERAL VIEW explode(array(1,2,3)) 
myTable AS myCol    
-Expanded Query:        SELECT `src`.`key`, `src`.`value`, `mytable`.`mycol` 
FROM `default`.`src` LATERAL VIEW explode(array(1,2,3)) `myTable` AS `myCol`    
    

Review Comment:
   Is it okay that the backticks were removed from the expanded query?



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/Bug.java:
##########
@@ -88,4 +88,9 @@ public final class Bug {
    * Whether <a 
href="https://issues.apache.org/jira/browse/CALCITE-5669";>CALCITE-5669</a> is 
fixed.
    */
   public static final boolean CALCITE_5669_FIXED = false;
+
+  /**
+   * Whether <a 
href="https://issues.apache.org/jira/browse/CALCITE-5669";>CALCITE-5985</a> is 
fixed.

Review Comment:
   `s/CALCITE-5669/CALCITE-5985`



##########
ql/src/test/queries/clientpositive/lateral_view_cbo.q:
##########
@@ -0,0 +1,8 @@
+--! qt:dataset:src
+--Q1
+EXPLAIN CBO SELECT myTable.myCol FROM src 
+LATERAL VIEW explode(array(1,2,3)) myTable AS myCol;

Review Comment:
   Please include an `EXPLAIN CBO` query with `inline(array(1,2,3)` which seems 
to be handled differently.



##########
ql/src/test/queries/clientpositive/lateral_view_cbo.q:
##########
@@ -0,0 +1,8 @@
+--! qt:dataset:src

Review Comment:
   nit: We don't really need data for the plans here so we can drop the usage 
of `src`. A very simple DDL (`CREATE TABLE simple_t (col STRING)`) suffices and 
would make the test self-contained.



##########
ql/src/test/results/clientpositive/tez/tez_union_udtf.q.out:
##########
@@ -140,7 +137,7 @@ POSTHOOK: Input: default@src
 POSTHOOK: Input: default@src1
 POSTHOOK: Output: database:default
 POSTHOOK: Output: default@x
-POSTHOOK: Lineage: x.key EXPRESSION [(src)src.FieldSchema(name:key, 
type:string, comment:default), (src1)src1.FieldSchema(name:key, type:string, 
comment:default), ]

Review Comment:
   This seems lineage change seems like a regression that we may need to look 
into. Did we lose the information that the "key" column comes from the `src` 
table?



##########
ql/src/test/results/clientpositive/llap/lineage2.q.out:
##########
@@ -710,7 +710,7 @@ PREHOOK: query: select identity, ep1_id from relations
 PREHOOK: type: QUERY
 PREHOOK: Input: default@relations
 #### A masked pattern was here ####
-{"version":"1.0","engine":"tez","database":"default","hash":"436a649a0d9540e8f093f8353d86813a","queryText":"select
 identity, ep1_id from relations\n  lateral view explode(ep1_ids) nav_rel as 
ep1_id","edges":[{"sources":[2],"targets":[0],"edgeType":"PROJECTION"},{"sources":[3],"targets":[1],"expression":"nav_rel._col12","edgeType":"PROJECTION"}],"vertices":[{"id":0,"vertexType":"COLUMN","vertexId":"identity"},{"id":1,"vertexType":"COLUMN","vertexId":"ep1_id"},{"id":2,"vertexType":"COLUMN","vertexId":"default.relations.identity"},{"id":3,"vertexType":"COLUMN","vertexId":"default.relations.ep1_ids"}]}

Review Comment:
   I don't know much about how the lineage info is used but I suppose the 
change in the edge name shouldn't generate regressions. Do you have any 
thoughts about this?



##########
ql/src/test/results/clientpositive/llap/udtf_json_tuple.q.out:
##########
@@ -395,24 +398,27 @@ STAGE PLANS:
                           Lateral View Join Operator
                             outputColumnNames: _col6, _col7, _col8, _col9, 
_col10
                             Statistics: Num rows: 6 Data size: 3233 Basic 
stats: COMPLETE Column stats: COMPLETE
-                            Select Operator
-                              expressions: _col7 (type: string)
-                              outputColumnNames: _col7
-                              Statistics: Num rows: 6 Data size: 3233 Basic 
stats: COMPLETE Column stats: COMPLETE
-                              Group By Operator
-                                aggregations: count()
-                                keys: _col7 (type: string)
-                                minReductionHashAggr: 0.8333333
-                                mode: hash
-                                outputColumnNames: _col0, _col1
-                                Statistics: Num rows: 1 Data size: 8 Basic 
stats: COMPLETE Column stats: COMPLETE
-                                Reduce Output Operator
-                                  key expressions: _col0 (type: string)
-                                  null sort order: z
-                                  sort order: +
-                                  Map-reduce partition columns: _col0 (type: 
string)
+                            Filter Operator
+                              predicate: _col6 is not null (type: boolean)

Review Comment:
   A few lines above I see `Filter: c0 is not null`. Is `_col6` the same with 
`c0`? Is this new filter redundant?



##########
ql/src/test/results/clientnegative/udf_assert_true2.q.out:
##########
@@ -36,7 +36,7 @@ STAGE PLANS:
                   Limit
                     Number of rows: 2
                     Select Operator
-                      expressions: (1 + assert_true((_col6 < 2))) (type: int)
+                      expressions: (1 + UDFToInteger(assert_true((_col6 < 
2)))) (type: int)

Review Comment:
   On a second, I guess we don't have to follow up on this. I assume using 
`assert_true` with arithmetic operations would cause the same conversion even 
when lateral views are not present in the query.



##########
ql/src/test/results/clientpositive/llap/ppd_field_garbage.q.out:
##########
@@ -17,7 +17,7 @@ POSTHOOK: Input: default@test_issue
 POSTHOOK: Output: database:default
 POSTHOOK: Output: default@v_test_issue
 POSTHOOK: Lineage: v_test_issue.age EXPRESSION 
[(test_issue)test_issue.FieldSchema(name:test_c, 
type:struct<user_c:struct<age:int>>, comment:null), ]
-POSTHOOK: Lineage: v_test_issue.fileid EXPRESSION 
[(test_issue)test_issue.FieldSchema(name:fileid, type:int, comment:null), ]

Review Comment:
   I am not very familiar with lineage to comment if this change from 
`EXPRESSION` to `SIMPLE` is an improvement or a regression. I checked a bit 
`org.apache.hadoop.hive.ql.hooks.LineageInfo.DependencyType` but not really 
sure what should be the type here. Do you happen to know something more about 
it?



##########
ql/src/test/results/clientpositive/llap/reduce_deduplicate_null_keys.q.out:
##########
@@ -82,7 +82,7 @@ STAGE PLANS:
                           outputColumnNames: _col0, _col1
                           Statistics: Num rows: 1 Data size: 4 Basic stats: 
COMPLETE Column stats: COMPLETE
                           Reduce Output Operator
-                            key expressions: CASE WHEN ((_col0 = _col1)) THEN 
(1) ELSE (2) END (type: int)

Review Comment:
   I guess a result of HIVE-21152 + the fact that now we are using CBO for 
lateral views.



##########
ql/src/test/queries/clientpositive/lateral_view_cbo.q:
##########
@@ -0,0 +1,8 @@
+--! qt:dataset:src
+--Q1
+EXPLAIN CBO SELECT myTable.myCol FROM src 
+LATERAL VIEW explode(array(1,2,3)) myTable AS myCol;

Review Comment:
   Please include an `EXPLAIN CBO` query where the arguments of the table 
function are not constants:  `explode(table.array_col)`.



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveFilterTableFunctionTransposeRule.java:
##########
@@ -0,0 +1,155 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptUtil;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.core.Filter;
+import org.apache.calcite.rel.metadata.RelColumnMapping;
+import org.apache.calcite.rex.RexCall;
+import org.apache.calcite.rex.RexNode;
+import org.apache.calcite.rex.RexUtil;
+import org.apache.calcite.tools.RelBuilder;
+import org.apache.calcite.tools.RelBuilderFactory;
+import org.apache.commons.collections4.CollectionUtils;
+import org.apache.hadoop.hive.ql.exec.FunctionRegistry;
+import org.apache.hadoop.hive.ql.optimizer.calcite.HiveCalciteUtil;
+import org.apache.hadoop.hive.ql.optimizer.calcite.HiveRelFactories;
+import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveFilter;
+import 
org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveTableFunctionScan;
+
+import com.google.common.base.Preconditions;
+import com.google.common.collect.ImmutableList;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Set;
+
+/**
+ * Rule to transpose Filter and TableFunctionScan RelNodes
+ *
+ * We cannot use Calcite's FilterTableFunctionTransposeRule because that rule
+ * uses LogicalFilter and LogicalTableFunctionScan.  We should remove this
+ * class when CALCITE-5985 is fixed (and remove the CALCITE_5985_FIXED entry

Review Comment:
   Please add the following inside `matches` or `onMatch` method.
   ```java
   if (Bug.CALCITE_5985_FIXED) {
     throw new IllegalStateException("Class is redundant after fix is merged 
into Calcite");
   }
   ```
   This makes it easier to find the code that needs to be replaced once the 
issue is fixed.



##########
ql/src/test/results/clientpositive/llap/udtf_json_tuple.q.out:
##########
@@ -395,24 +398,27 @@ STAGE PLANS:
                           Lateral View Join Operator
                             outputColumnNames: _col6, _col7, _col8, _col9, 
_col10
                             Statistics: Num rows: 6 Data size: 3233 Basic 
stats: COMPLETE Column stats: COMPLETE
-                            Select Operator
-                              expressions: _col7 (type: string)
-                              outputColumnNames: _col7
-                              Statistics: Num rows: 6 Data size: 3233 Basic 
stats: COMPLETE Column stats: COMPLETE
-                              Group By Operator
-                                aggregations: count()
-                                keys: _col7 (type: string)
-                                minReductionHashAggr: 0.8333333
-                                mode: hash
-                                outputColumnNames: _col0, _col1
-                                Statistics: Num rows: 1 Data size: 8 Basic 
stats: COMPLETE Column stats: COMPLETE
-                                Reduce Output Operator
-                                  key expressions: _col0 (type: string)
-                                  null sort order: z
-                                  sort order: +
-                                  Map-reduce partition columns: _col0 (type: 
string)
+                            Filter Operator
+                              predicate: _col6 is not null (type: boolean)

Review Comment:
   I see the same pattern (new filter appears) in a few other .q.out files.



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveOptimizeInlineArrayTableFunctionRule.java:
##########
@@ -124,9 +125,12 @@ public void onMatch(RelOptRuleCall call) {
     RexNode newInlineCall =
         cluster.getRexBuilder().makeCall(tfs.getRowType(), inlineCall.op, 
newArrayCall);
 
+    // Use empty listfor columnMappings. The return row type of the RelNode 
now comprises of
+    // all the fields within the UDTF, so there is no mapping from the output 
fields
+    // directly to the input fields anymore.
     final RelNode newTableFunctionScanNode = tfs.copy(tfs.getTraitSet(),
         tfs.getInputs(), newInlineCall, tfs.getElementType(), tfs.getRowType(),
-        tfs.getColumnMappings());
+        Collections.emptySet());
 
     call.transformTo(newTableFunctionScanNode);

Review Comment:
   I saw the new changes in `LateralViewPlan` and this made me thinking that an 
equivalent thing could be done as part of this rule. Something like match 
`lateral(inline(array(1,2)))` and transform to `inline(array($0,1,2))`.
   
   I don't feel strongly about it so if you prefer to keep the 
`LateralViewPlan.isInlineArray` method there I am OK with it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org
For additional commands, e-mail: gitbox-h...@hive.apache.org

[GitHub] [hive] zabetak commented on a diff in pull request #4442: HIVE-27399: Add support for lateral views with CBO

Reply via email to