[GitHub] [hive] zabetak commented on a diff in pull request #4442: HIVE-27399: Add support for lateral views with CBO

via GitHub Wed, 06 Sep 2023 05:07:15 -0700


zabetak commented on code in PR #4442:
URL: https://github.com/apache/hive/pull/4442#discussion_r1317011777



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java:
##########
@@ -621,6 +644,80 @@ private ASTNode pkFkHint(int fkTableIndex, boolean 
nonFkSideIsFiltered) {
     }
   }
 
+  private ASTNode createASTLateralView(TableFunctionScan tfs, Schema s,

Review Comment:
   If its static let's declare it explicitly.



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java:
##########
@@ -383,8 +386,16 @@ private ASTNode convert() throws CalciteSemanticException {
       ASTBuilder sel = ASTBuilder.construct(HiveParser.TOK_SELEXPR, 
"TOK_SELEXPR");
       ASTNode function = buildUDTFAST(call.getOperator().getName(), children);
       sel.add(function);
-      for (String alias : udtf.getRowType().getFieldNames()) {
-        sel.add(HiveParser.Identifier, alias);
+
+      // When we are treating a UDTF as a 'select', it is not a lateral view.
+      // In this case, it means that only the fields from the UDTF are selected
+      // out of the RelNode and placed into the SELEXPR. So we want to ignore
+      // any field in the inputRef mapping.
+      List<String> fields = udtf.getRowType().getFieldNames();
+      for (int i = 0; i < udtf.getRowType().getFieldCount(); ++i) {
+        if (!udtf.containsInputRefMapping(i)) {

Review Comment:
   With the newly introduced `lateral` operator this `if` check and the related 
code checking for containment may be redundant. Can we get here with more 
fields than necessary?



##########
ql/src/test/results/clientnegative/udf_assert_true2.q.out:
##########
@@ -36,7 +36,7 @@ STAGE PLANS:
                   Limit
                     Number of rows: 2
                     Select Operator
-                      expressions: (1 + assert_true((_col6 < 2))) (type: int)
+                      expressions: (1 + UDFToInteger(assert_true((_col6 < 
2)))) (type: int)

Review Comment:
   What caused this change? It looks a bit weird.



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveFilterTableFunctionTransposeRule.java:
##########
@@ -0,0 +1,157 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptUtil;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.core.Filter;
+import org.apache.calcite.rel.metadata.RelColumnMapping;
+import org.apache.calcite.rex.RexCall;
+import org.apache.calcite.rex.RexNode;
+import org.apache.calcite.rex.RexUtil;
+import org.apache.calcite.tools.RelBuilder;
+import org.apache.calcite.tools.RelBuilderFactory;
+import org.apache.commons.collections4.CollectionUtils;
+import org.apache.hadoop.hive.ql.exec.FunctionRegistry;
+import org.apache.hadoop.hive.ql.optimizer.calcite.HiveCalciteUtil;
+import org.apache.hadoop.hive.ql.optimizer.calcite.HiveRelFactories;
+import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveFilter;
+import 
org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveTableFunctionScan;
+
+import com.google.common.base.Preconditions;
+import com.google.common.collect.ImmutableList;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Set;
+
+/**
+ * Rule to transpose Filter and TableFunctionScan RelNodes
+ *
+ * This differs from the FilterTableFunctionTransposeRule for a number of 
reasons:
+ *   - We check if the filter condition is deterministic under Hive rules. We 
cannot
+ *     push filters that aren't deterministic.
+ *   - We check if the filter only references columns that are not part of the 
udtf.
+ *     If the input ref in the filter references a column that is generated by 
the
+ *     udtf, the filter cannot be pushed through.

Review Comment:
   Thanks for the analysis. Please do the following:
   1. Log a CALCITE ticket reporting the problem with `Logical` operators in 
`FilterTableFunctionTransposeRule`. 
   2. Add an entry in `org.apache.hadoop.hive.ql.optimizer.calcite.Bug`
   3. Reference the new entry somewhere in the rule similarly to how the 
entries are references elsewhere.
   4. Update the Javadoc since as far as I understand this is the main reason 
that we cannot use the respective Calcite rule.



##########
ql/src/test/queries/clientpositive/lateral_view.q:
##########
@@ -10,6 +10,11 @@ EXPLAIN SELECT myTable.* FROM src LATERAL VIEW 
explode(array(1,2,3)) myTable AS
 EXPLAIN SELECT myTable.myCol, myTable2.myCol2 FROM src LATERAL VIEW 
explode(array(1,2,3)) myTable AS myCol LATERAL VIEW explode(array('a', 'b', 
'c')) myTable2 AS myCol2 LIMIT 9;
 EXPLAIN SELECT myTable2.* FROM src LATERAL VIEW explode(array(array(1,2,3))) 
myTable AS myCol LATERAL VIEW explode(myTable.myCol) myTable2 AS myCol2 LIMIT 3;
 
+EXPLAIN CBO SELECT * FROM src LATERAL VIEW explode(array(1,2,3)) myTable AS 
myCol SORT BY key ASC, myCol ASC LIMIT 1;

Review Comment:
   If the goal of adding `EXPLAIN CBO` in every `LATERAL VIEW` query was to 
ensure that CBO doesn't fail then maybe we could enforce this differently 
(e.g., `set hive.cbo.fallback.strategy=never`).



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java:
##########
@@ -621,6 +644,80 @@ private ASTNode pkFkHint(int fkTableIndex, boolean 
nonFkSideIsFiltered) {
     }
   }
 
+  private ASTNode createASTLateralView(TableFunctionScan tfs, Schema s,
+      QueryBlockInfo tableFunctionSource, String sqAlias) {
+    // The structure of the AST LATERAL VIEW will be:
+    //
+    //   TOK_LATERAL_VIEW
+    //     TOK_SELECT
+    //       TOK_SELEXPR
+    //         TOK_FUNCTION
+    //           <udtf func>
+    //           ...
+    //         <col alias for function>
+    //         TOK_TABALIAS
+    //           <table alias for lateral view>
+
+    // set up the select for the parameters of the UDTF
+    List<ASTNode> children = new ArrayList<>();
+    // The UDTF function call within the table function scan will be of the 
form:
+    // lateral(my_udtf_func(...), $0, $1, ...).  For recreating the AST, we 
need
+    // the inner "my_udtf_func".
+    RexCall lateralCall = (RexCall) tfs.getCall();
+    RexCall call = (RexCall) lateralCall.getOperands().get(0);
+    for (RexNode rn : call.getOperands()) {
+      ASTNode expr = rn.accept(new RexVisitor(s, rn instanceof RexLiteral,
+          select.getCluster().getRexBuilder()));
+      children.add(expr);
+    }
+    ASTNode function = buildUDTFAST(call.getOperator().getName(), children);
+
+    // Add the function to the SELEXPR
+    ASTBuilder selexpr = ASTBuilder.construct(HiveParser.TOK_SELEXPR, 
"TOK_SELEXPR");
+    selexpr.add(function);
+
+    // Add only the table generated size columns to the select expr for the 
function,
+    // skipping over the base table columns from the input side of the join.
+    int i = 0;
+    for (ColumnInfo c : s) {
+      if (i++ < tableFunctionSource.schema.size()) {
+        continue;
+      }
+      selexpr.add(HiveParser.Identifier, c.column);
+    }
+    // add the table alias for the lateral view.
+    ASTBuilder tabAlias = ASTBuilder.construct(HiveParser.TOK_TABALIAS, 
"TOK_TABALIAS");
+    tabAlias.add(HiveParser.Identifier, sqAlias);
+
+    // add the table alias to the SEL_EXPR
+    selexpr.add(tabAlias.node());
+
+    // create the SELECT clause
+    ASTBuilder sel = ASTBuilder.construct(HiveParser.TOK_SELEXPR, 
"TOK_SELECT");
+    sel.add(selexpr.node());
+
+    // place the SELECT clause under the LATERAL VIEW clause
+    ASTBuilder lateralview = ASTBuilder.construct(HiveParser.TOK_LATERAL_VIEW, 
"TOK_LATERAL_VIEW");
+    lateralview.add(sel.node());
+
+    // finally, add the LATERAL VIEW clause under the left side source which 
is the base table.
+    lateralview.add(tableFunctionSource.ast);
+
+    return lateralview.node();
+  }
+
+  /**
+   * Check to see if we can optimize out the lateral view operators
+   * We do not need to use the lateral view syntax if all of the fields
+   * selected out of the table scan come from the UDTF call.  No join
+   * is needed because all the fields come from the table level rather
+   * than the row level.
+   */

Review Comment:
   The comment seems obsolete since the method does not check any fields.



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java:
##########
@@ -577,6 +588,18 @@ private QueryBlockInfo convertSource(RelNode r) throws 
CalciteSemanticException
         ast = ASTBuilder.subQuery(left, sqAlias);
         s = new Schema((Union) r, sqAlias);
       }
+    } else if (r instanceof HiveTableFunctionScan &&

Review Comment:
   We can put the `instanceof` check inside the `isLateralView` method and 
simplify the code here to be simply `isLateralView(r)`.



##########
ql/src/test/results/clientpositive/llap/tablevalues.q.out:
##########
@@ -175,14 +175,27 @@ STAGE PLANS:
         TableScan
           alias: mytbl_n1
           Select Operator
-            expressions: 
array(struct(key,value,'A',10,key),struct(key,value,'B',20,key)) (type: 
array<struct<col1:string,col2:string,col3:string,col4:int,col5:string>>)
+            expressions: key (type: string)
             outputColumnNames: _col0
-            UDTF Operator
-              function name: inline
+            Lateral View Forward
               Select Operator
-                expressions: col3 (type: string), col4 (type: int), col5 
(type: string)
-                outputColumnNames: _col0, _col1, _col2
-                ListSink
+                Lateral View Join Operator
+                  outputColumnNames: _col2, _col3, _col4
+                  Select Operator
+                    expressions: _col2 (type: string), _col3 (type: int), 
_col4 (type: string)
+                    outputColumnNames: _col0, _col1, _col2
+                    ListSink
+              Select Operator
+                expressions: array(struct('A',10,_col0),struct('B',20,_col0)) 
(type: array<struct<col1:string,col2:int,col3:string>>)
+                outputColumnNames: _col0
+                UDTF Operator
+                  function name: inline
+                  Lateral View Join Operator
+                    outputColumnNames: _col2, _col3, _col4
+                    Select Operator
+                      expressions: _col2 (type: string), _col3 (type: int), 
_col4 (type: string)
+                      outputColumnNames: _col0, _col1, _col2
+                      ListSink

Review Comment:
   This seems like a regression. Probably it was caused by the introduction of 
lateral in `HiveTableFunctionScan`. Do we need to revisit that choice?



##########
ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java:
##########
@@ -1024,30 +1022,8 @@ boolean isCBOExecuted() {
 
   @Override
   boolean isCBOSupportedLateralView(ASTNode lateralView) {
-    // Lateral view AST has the following shape:
-    // ^(TOK_LATERAL_VIEW
-    //   ^(TOK_SELECT ^(TOK_SELEXPR ^(TOK_FUNCTION Identifier params) 
identifier* tableAlias)))
-    if (lateralView.getToken().getType() == HiveParser.TOK_LATERAL_VIEW_OUTER) 
{
-      // LATERAL VIEW OUTER not supported in CBO
-      return false;
-    }
-    // Only INLINE followed by ARRAY supported in CBO
-    ASTNode lvFunc = (ASTNode) lateralView.getChild(0).getChild(0).getChild(0);
-    String lvFuncName = lvFunc.getChild(0).getText();
-    if (lvFuncName.compareToIgnoreCase(
-            GenericUDTFInline.class.getAnnotation(Description.class).name()) 
!= 0) {
-      return false;
-    }
-    if (lvFunc.getChildCount() != 2) {
-      return false;
-    }

Review Comment:
   I believe the `if (lvFunc.getChildCount() != 2)` check was redundant anyways 
since at this line we already checked that the function is `GenericUDTFInline` 
and by definition it cannot have more than one input.
   
   Since here we are reasoning at the AST level I believe the first child 
corresponds to the function name token and the second child to the token 
representing the input argument to the inline function.



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveOptimizeInlineArrayTableFunctionRule.java:
##########
@@ -124,9 +125,12 @@ public void onMatch(RelOptRuleCall call) {
     RexNode newInlineCall =
         cluster.getRexBuilder().makeCall(tfs.getRowType(), inlineCall.op, 
newArrayCall);
 
+    // Use empty listfor columnMappings. The return row type of the RelNode 
now comprises of
+    // all the fields within the UDTF, so there is no mapping from the output 
fields
+    // directly to the input fields anymore.
     final RelNode newTableFunctionScanNode = tfs.copy(tfs.getTraitSet(),
         tfs.getInputs(), newInlineCall, tfs.getElementType(), tfs.getRowType(),
-        tfs.getColumnMappings());
+        Collections.emptySet());
 
     call.transformTo(newTableFunctionScanNode);

Review Comment:
   Does this rule still work as it is now that we are wrapping the table 
function with the `LATERAL` operator? Do we want/need to add special cases if 
we have lateral + inline array? 



##########
ql/src/java/org/apache/hadoop/hive/ql/ppd/OpProcFactory.java:
##########
@@ -504,27 +504,32 @@ public Object process(Node nd, Stack<Node> stack, 
NodeProcessorCtx procCtx,
       // We try to push the full Filter predicate iff:
       // - the Filter is on top of a TableScan, or
       // - the Filter is on top of a PTF (between PTF and Filter, there might 
be Select operators)
+      // - the Filter is on top of a LateralViewJoinOperator (the filter can 
be pushed through one
+      //   side of the join with the base table predicate, but not the UDTF 
side.)

Review Comment:
   I don't fully understand this optimization. The comment says that we want to 
fully push the filter if it is on top of a LateralViewJoinOperator but then it 
seems that we have restrictions on which side of the join we are going to push 
the filter. Are these restrictions enforced somewhere later on?



##########
ql/src/test/queries/clientpositive/lateral_view.q:
##########
@@ -10,6 +10,11 @@ EXPLAIN SELECT myTable.* FROM src LATERAL VIEW 
explode(array(1,2,3)) myTable AS
 EXPLAIN SELECT myTable.myCol, myTable2.myCol2 FROM src LATERAL VIEW 
explode(array(1,2,3)) myTable AS myCol LATERAL VIEW explode(array('a', 'b', 
'c')) myTable2 AS myCol2 LIMIT 9;
 EXPLAIN SELECT myTable2.* FROM src LATERAL VIEW explode(array(array(1,2,3))) 
myTable AS myCol LATERAL VIEW explode(myTable.myCol) myTable2 AS myCol2 LIMIT 3;
 
+EXPLAIN CBO SELECT * FROM src LATERAL VIEW explode(array(1,2,3)) myTable AS 
myCol SORT BY key ASC, myCol ASC LIMIT 1;

Review Comment:
   I suggested to create a new .q file (`cbo_lateral_views.q`) and add test 
cases there to put more focus on what we really want to test as part of this 
change. We want to verify that the CBO plan for certain types of queries is the 
expected one. Determining the types of queries is the whole point of having a 
dedicated .q file.
   
   **Examples**
   Do we care about LIMIT clause and the actual number there? Probably not so 
we don't need to add EXPLAIN CBO for that one.
   
   Do we care about how many times we have `LATERAL VIEW explode(array(1,2,3))` 
in the query? Probably not, one or two should suffice.
   ```sql
   --Q1
   SELECT myTable.myCol FROM src 
   LATERAL VIEW explode(array(1,2,3)) myTable AS myCol;
   --Q2
   SELECT myTable.myCol, myTable2.myCol2 FROM src
   LATERAL VIEW explode(array(1,2,3)) myTable AS myCol
   LATERAL VIEW explode(array(1,2,3)) myTable2 AS myCol2;
   --Q3
   SELECT myTable.myCol, myTable2.myCol2, myTable3.myCol3 FROM src 
   LATERAL VIEW explode(array(1,2,3)) myTable AS myCol
   LATERAL VIEW explode(array(1,2,3)) myTable2 AS myCol2
   LATERAL VIEW explode(array(1,2,3)) myTable3 AS myCol3;
   ```
   If the CBO plan is correct for Q1 and Q2 there is no reason to be wrong for 
Q3, Q4, etc.
   
   Do we care about what is the actual table function inside the `LATERAL VIEW` 
clause. Probably not, we don't need to test every possible variation of 
`explode`, `inline`, etc.
   
   Do we want our tables to have data and thus use the `src` and other similar 
tables? Probably not.
   
   All that to say that having many EXPLAIN CBO statements does not necessarily 
increase test coverage and has some drawbacks as well. Picking representatives 
queries and putting them in a file seems like a better alternative.



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java:
##########
@@ -577,6 +588,18 @@ private QueryBlockInfo convertSource(RelNode r) throws 
CalciteSemanticException
         ast = ASTBuilder.subQuery(left, sqAlias);
         s = new Schema((Union) r, sqAlias);
       }
+    } else if (r instanceof HiveTableFunctionScan &&

Review Comment:
   Maybe it suffices to check for the super-class type `TableFunctionScan` 
since it is the one used in the cast just below.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [hive] zabetak commented on a diff in pull request #4442: HIVE-27399: Add support for lateral views with CBO

Reply via email to