[GitHub] [hive] zabetak commented on a diff in pull request #4442: HIVE-27399: Add support for lateral views with CBO

via GitHub Wed, 09 Aug 2023 07:56:34 -0700


zabetak commented on code in PR #4442:
URL: https://github.com/apache/hive/pull/4442#discussion_r1288646268



##########
ql/src/test/queries/clientpositive/lateral_view.q:
##########
@@ -10,6 +10,11 @@ EXPLAIN SELECT myTable.* FROM src LATERAL VIEW 
explode(array(1,2,3)) myTable AS
 EXPLAIN SELECT myTable.myCol, myTable2.myCol2 FROM src LATERAL VIEW 
explode(array(1,2,3)) myTable AS myCol LATERAL VIEW explode(array('a', 'b', 
'c')) myTable2 AS myCol2 LIMIT 9;
 EXPLAIN SELECT myTable2.* FROM src LATERAL VIEW explode(array(array(1,2,3))) 
myTable AS myCol LATERAL VIEW explode(myTable.myCol) myTable2 AS myCol2 LIMIT 3;
 
+EXPLAIN CBO SELECT * FROM src LATERAL VIEW explode(array(1,2,3)) myTable AS 
myCol SORT BY key ASC, myCol ASC LIMIT 1;

Review Comment:
   It is a good idea to verify the `EXPLAIN CBO` plan in the presence of 
lateral views. I am wondering if it would be better to identify a few 
representative queries and then put the EXPLAIN CBO statements in a new .q file 
(e.g., `cbo_lateral_views.q`) going from the simpler to the more complicate 
queries. These would make testing more explicit and it would also reduce the 
changes in existing .q.out files leading to fewer merge conflicts when 
backporting. WDYT?



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveOptimizeInlineArrayTableFunctionRule.java:
##########
@@ -124,9 +125,12 @@ public void onMatch(RelOptRuleCall call) {
     RexNode newInlineCall =
         cluster.getRexBuilder().makeCall(tfs.getRowType(), inlineCall.op, 
newArrayCall);
 
+    // Use empty listfor columnMappings. The return row type of the RelNode 
now comprises of
+    // all the fields within the UDTF, so there is no mapping from the output 
fields
+    // directly to the input fields anymore.
     final RelNode newTableFunctionScanNode = tfs.copy(tfs.getTraitSet(),
         tfs.getInputs(), newInlineCall, tfs.getElementType(), tfs.getRowType(),
-        tfs.getColumnMappings());
+        new HashSet<>());

Review Comment:
   nit: `Collections.emptySet()` is a more efficient choice unless we want to 
modify the set.



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java:
##########
@@ -577,6 +588,77 @@ private QueryBlockInfo convertSource(RelNode r) throws 
CalciteSemanticException
         ast = ASTBuilder.subQuery(left, sqAlias);
         s = new Schema((Union) r, sqAlias);
       }
+    } else if (r instanceof HiveTableFunctionScan &&
+        !canOptimizeOutLateralView((HiveTableFunctionScan) r)) {

Review Comment:
   Is this optimization something that we could defer to a follow-up ticket so 
that we converge and merge this PR faster ? That would also make the diff and 
git history easier to follow.
   
   If we want to push it here then let me share what I was thinking with a 
concrete SQL example grabbed from existing tests.
   ```sql
   SELECT myCol from tmp_pyang_lv LATERAL VIEW explode(array(1,2,3)) myTab as 
myCol limit 3
   ```
   The current plan is the following:
   ```
   CBO PLAN:
   HiveSortLimit(fetch=[3])
     HiveProject(mycol=[$5])
       HiveTableFunctionScan(invocation=[explode(ARRAY(1, 2, 3))], 
rowType=[RecordType(VARCHAR(2147483647) inputs, BIGINT 
BLOCK__OFFSET__INSIDE__FILE, VARCHAR(2147483647) INPUT__FILE__NAME, 
RecordType(BIGINT writeid, INTEGER bucketid, BIGINT rowid) ROW__ID, BOOLEAN 
ROW__IS__DELETED, INTEGER mytab.mycol)])
         HiveTableScan(table=[[default, tmp_pyang_lv]], 
table:alias=[tmp_pyang_lv])
   ```
   I was thinking that we could have a CBO optimization that turns the plan 
above to something like the following:
   ```
   CBO PLAN:
   HiveSortLimit(fetch=[3])
     HiveTableFunctionScan(invocation=[explode(ARRAY(1, 2, 3))], 
rowType=[RecordType(INTEGER mytab.mycol)])
   ```



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java:
##########
@@ -621,6 +644,109 @@ private ASTNode pkFkHint(int fkTableIndex, boolean 
nonFkSideIsFiltered) {
     }
   }
 
+  private ASTNode createASTLateralView(TableFunctionScan tfs, Schema s,
+      QueryBlockInfo tableFunctionSource, String sqAlias) {
+    // In the case where the RelNode is a HiveTableFunctionScan, first we check
+    // to see if we can't optimize out the lateral view operator. We can 
optimize the
+    // operator out if only the udtf fields are grabbed out of the RelNode.  
If any
+    // of the base table fields need to be grabbed out, then a 'join' needs to 
be done
+    // and we need the lateral view.
+    //

Review Comment:
   The comment seems misplaced here. Maybe it is also a bit redundant given the 
javadoc in `canOptimizeOutLateralView` method.



##########
ql/src/java/org/apache/hadoop/hive/ql/parse/relnodegen/LateralViewPlan.java:
##########
@@ -114,7 +115,7 @@ public LateralViewPlan(ASTNode lateralView, RelOptCluster 
cluster, RelNode input
 
     this.lateralViewRel = HiveTableFunctionScan.create(cluster,
         TraitsUtil.getDefaultTraitSet(cluster), ImmutableList.of(inputRel), 
udtfCall,
-        null, retType, null);
+        null, retType, createColumnMappings(inputRel));

Review Comment:
   I was thinking that there is also a third category:
   3. Columns from the `inputRel` that are transformed based on the UDTF and 
appear in the result (something that would enable the derived attribute in 
`RelColumnMapping`).
   
   Anyways I don't fully understand how `RelColumnMapping` is supposed to work 
so I suppose we can leave with this mapping for the moment.



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/PlanModifierForASTConv.java:
##########
@@ -382,6 +387,11 @@ private static boolean validExchangeChild(HiveSortExchange 
sortNode) {
     return sortNode.getInput() instanceof Project;
   }
 
+  private static boolean validTableFunctionScanChild(HiveTableFunctionScan 
htfsNode) {
+    return htfsNode.getInputs().size() == 1 &&
+        (htfsNode.getInput(0) instanceof Project || htfsNode.getInput(0) 
instanceof HiveTableScan);

Review Comment:
   Thanks for the explanation. Not fully understand either but now I see the 
pattern so the changes look reasonable.



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveFilterTableFunctionTransposeRule.java:
##########
@@ -0,0 +1,157 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptUtil;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.core.Filter;
+import org.apache.calcite.rel.metadata.RelColumnMapping;
+import org.apache.calcite.rex.RexCall;
+import org.apache.calcite.rex.RexNode;
+import org.apache.calcite.rex.RexUtil;
+import org.apache.calcite.tools.RelBuilder;
+import org.apache.calcite.tools.RelBuilderFactory;
+import org.apache.commons.collections4.CollectionUtils;
+import org.apache.hadoop.hive.ql.exec.FunctionRegistry;
+import org.apache.hadoop.hive.ql.optimizer.calcite.HiveCalciteUtil;
+import org.apache.hadoop.hive.ql.optimizer.calcite.HiveRelFactories;
+import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveFilter;
+import 
org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveTableFunctionScan;
+
+import com.google.common.base.Preconditions;
+import com.google.common.collect.ImmutableList;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Set;
+
+/**
+ * Rule to transpose Filter and TableFunctionScan RelNodes
+ *
+ * This differs from the FilterTableFunctionTransposeRule for a number of 
reasons:
+ *   - We check if the filter condition is deterministic under Hive rules. We 
cannot
+ *     push filters that aren't deterministic.
+ *   - We check if the filter only references columns that are not part of the 
udtf.
+ *     If the input ref in the filter references a column that is generated by 
the
+ *     udtf, the filter cannot be pushed through.

Review Comment:
   I get the impression that `FilterTableFunctionTransposeRule` also does the 
same check using column mappings.
   
   I would also suggest to isolate this transformation to a separate patch if 
that is possible. The reasoning is the same less code to review faster to merge.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [hive] zabetak commented on a diff in pull request #4442: HIVE-27399: Add support for lateral views with CBO

Reply via email to