zabetak commented on code in PR #4442:
URL: https://github.com/apache/hive/pull/4442#discussion_r1317011777
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java:
##########
@@ -621,6 +644,80 @@ private ASTNode pkFkHint(int fkTableIndex, boolean
nonFkSideIsFiltered) {
}
}
+ private ASTNode createASTLateralView(TableFunctionScan tfs, Schema s,
Review Comment:
If its static let's declare it explicitly.
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java:
##########
@@ -383,8 +386,16 @@ private ASTNode convert() throws CalciteSemanticException {
ASTBuilder sel = ASTBuilder.construct(HiveParser.TOK_SELEXPR,
"TOK_SELEXPR");
ASTNode function = buildUDTFAST(call.getOperator().getName(), children);
sel.add(function);
- for (String alias : udtf.getRowType().getFieldNames()) {
- sel.add(HiveParser.Identifier, alias);
+
+ // When we are treating a UDTF as a 'select', it is not a lateral view.
+ // In this case, it means that only the fields from the UDTF are selected
+ // out of the RelNode and placed into the SELEXPR. So we want to ignore
+ // any field in the inputRef mapping.
+ List<String> fields = udtf.getRowType().getFieldNames();
+ for (int i = 0; i < udtf.getRowType().getFieldCount(); ++i) {
+ if (!udtf.containsInputRefMapping(i)) {
Review Comment:
With the newly introduced `lateral` operator this `if` check and the related
code checking for containment may be redundant. Can we get here with more
fields than necessary?
##########
ql/src/test/results/clientnegative/udf_assert_true2.q.out:
##########
@@ -36,7 +36,7 @@ STAGE PLANS:
Limit
Number of rows: 2
Select Operator
- expressions: (1 + assert_true((_col6 < 2))) (type: int)
+ expressions: (1 + UDFToInteger(assert_true((_col6 <
2)))) (type: int)
Review Comment:
What caused this change? It looks a bit weird.
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveFilterTableFunctionTransposeRule.java:
##########
@@ -0,0 +1,157 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptUtil;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.core.Filter;
+import org.apache.calcite.rel.metadata.RelColumnMapping;
+import org.apache.calcite.rex.RexCall;
+import org.apache.calcite.rex.RexNode;
+import org.apache.calcite.rex.RexUtil;
+import org.apache.calcite.tools.RelBuilder;
+import org.apache.calcite.tools.RelBuilderFactory;
+import org.apache.commons.collections4.CollectionUtils;
+import org.apache.hadoop.hive.ql.exec.FunctionRegistry;
+import org.apache.hadoop.hive.ql.optimizer.calcite.HiveCalciteUtil;
+import org.apache.hadoop.hive.ql.optimizer.calcite.HiveRelFactories;
+import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveFilter;
+import
org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveTableFunctionScan;
+
+import com.google.common.base.Preconditions;
+import com.google.common.collect.ImmutableList;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Set;
+
+/**
+ * Rule to transpose Filter and TableFunctionScan RelNodes
+ *
+ * This differs from the FilterTableFunctionTransposeRule for a number of
reasons:
+ * - We check if the filter condition is deterministic under Hive rules. We
cannot
+ * push filters that aren't deterministic.
+ * - We check if the filter only references columns that are not part of the
udtf.
+ * If the input ref in the filter references a column that is generated by
the
+ * udtf, the filter cannot be pushed through.
Review Comment:
Thanks for the analysis. Please do the following:
1. Log a CALCITE ticket reporting the problem with `Logical` operators in
`FilterTableFunctionTransposeRule`.
2. Add an entry in `org.apache.hadoop.hive.ql.optimizer.calcite.Bug`
3. Reference the new entry somewhere in the rule similarly to how the
entries are references elsewhere.
4. Update the Javadoc since as far as I understand this is the main reason
that we cannot use the respective Calcite rule.
##########
ql/src/test/queries/clientpositive/lateral_view.q:
##########
@@ -10,6 +10,11 @@ EXPLAIN SELECT myTable.* FROM src LATERAL VIEW
explode(array(1,2,3)) myTable AS
EXPLAIN SELECT myTable.myCol, myTable2.myCol2 FROM src LATERAL VIEW
explode(array(1,2,3)) myTable AS myCol LATERAL VIEW explode(array('a', 'b',
'c')) myTable2 AS myCol2 LIMIT 9;
EXPLAIN SELECT myTable2.* FROM src LATERAL VIEW explode(array(array(1,2,3)))
myTable AS myCol LATERAL VIEW explode(myTable.myCol) myTable2 AS myCol2 LIMIT 3;
+EXPLAIN CBO SELECT * FROM src LATERAL VIEW explode(array(1,2,3)) myTable AS
myCol SORT BY key ASC, myCol ASC LIMIT 1;
Review Comment:
If the goal of adding `EXPLAIN CBO` in every `LATERAL VIEW` query was to
ensure that CBO doesn't fail then maybe we could enforce this differently
(e.g., `set hive.cbo.fallback.strategy=never`).
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java:
##########
@@ -621,6 +644,80 @@ private ASTNode pkFkHint(int fkTableIndex, boolean
nonFkSideIsFiltered) {
}
}
+ private ASTNode createASTLateralView(TableFunctionScan tfs, Schema s,
+ QueryBlockInfo tableFunctionSource, String sqAlias) {
+ // The structure of the AST LATERAL VIEW will be:
+ //
+ // TOK_LATERAL_VIEW
+ // TOK_SELECT
+ // TOK_SELEXPR
+ // TOK_FUNCTION
+ // <udtf func>
+ // ...
+ // <col alias for function>
+ // TOK_TABALIAS
+ // <table alias for lateral view>
+
+ // set up the select for the parameters of the UDTF
+ List<ASTNode> children = new ArrayList<>();
+ // The UDTF function call within the table function scan will be of the
form:
+ // lateral(my_udtf_func(...), $0, $1, ...). For recreating the AST, we
need
+ // the inner "my_udtf_func".
+ RexCall lateralCall = (RexCall) tfs.getCall();
+ RexCall call = (RexCall) lateralCall.getOperands().get(0);
+ for (RexNode rn : call.getOperands()) {
+ ASTNode expr = rn.accept(new RexVisitor(s, rn instanceof RexLiteral,
+ select.getCluster().getRexBuilder()));
+ children.add(expr);
+ }
+ ASTNode function = buildUDTFAST(call.getOperator().getName(), children);
+
+ // Add the function to the SELEXPR
+ ASTBuilder selexpr = ASTBuilder.construct(HiveParser.TOK_SELEXPR,
"TOK_SELEXPR");
+ selexpr.add(function);
+
+ // Add only the table generated size columns to the select expr for the
function,
+ // skipping over the base table columns from the input side of the join.
+ int i = 0;
+ for (ColumnInfo c : s) {
+ if (i++ < tableFunctionSource.schema.size()) {
+ continue;
+ }
+ selexpr.add(HiveParser.Identifier, c.column);
+ }
+ // add the table alias for the lateral view.
+ ASTBuilder tabAlias = ASTBuilder.construct(HiveParser.TOK_TABALIAS,
"TOK_TABALIAS");
+ tabAlias.add(HiveParser.Identifier, sqAlias);
+
+ // add the table alias to the SEL_EXPR
+ selexpr.add(tabAlias.node());
+
+ // create the SELECT clause
+ ASTBuilder sel = ASTBuilder.construct(HiveParser.TOK_SELEXPR,
"TOK_SELECT");
+ sel.add(selexpr.node());
+
+ // place the SELECT clause under the LATERAL VIEW clause
+ ASTBuilder lateralview = ASTBuilder.construct(HiveParser.TOK_LATERAL_VIEW,
"TOK_LATERAL_VIEW");
+ lateralview.add(sel.node());
+
+ // finally, add the LATERAL VIEW clause under the left side source which
is the base table.
+ lateralview.add(tableFunctionSource.ast);
+
+ return lateralview.node();
+ }
+
+ /**
+ * Check to see if we can optimize out the lateral view operators
+ * We do not need to use the lateral view syntax if all of the fields
+ * selected out of the table scan come from the UDTF call. No join
+ * is needed because all the fields come from the table level rather
+ * than the row level.
+ */
Review Comment:
The comment seems obsolete since the method does not check any fields.
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java:
##########
@@ -577,6 +588,18 @@ private QueryBlockInfo convertSource(RelNode r) throws
CalciteSemanticException
ast = ASTBuilder.subQuery(left, sqAlias);
s = new Schema((Union) r, sqAlias);
}
+ } else if (r instanceof HiveTableFunctionScan &&
Review Comment:
We can put the `instanceof` check inside the `isLateralView` method and
simplify the code here to be simply `isLateralView(r)`.
##########
ql/src/test/results/clientpositive/llap/tablevalues.q.out:
##########
@@ -175,14 +175,27 @@ STAGE PLANS:
TableScan
alias: mytbl_n1
Select Operator
- expressions:
array(struct(key,value,'A',10,key),struct(key,value,'B',20,key)) (type:
array<struct<col1:string,col2:string,col3:string,col4:int,col5:string>>)
+ expressions: key (type: string)
outputColumnNames: _col0
- UDTF Operator
- function name: inline
+ Lateral View Forward
Select Operator
- expressions: col3 (type: string), col4 (type: int), col5
(type: string)
- outputColumnNames: _col0, _col1, _col2
- ListSink
+ Lateral View Join Operator
+ outputColumnNames: _col2, _col3, _col4
+ Select Operator
+ expressions: _col2 (type: string), _col3 (type: int),
_col4 (type: string)
+ outputColumnNames: _col0, _col1, _col2
+ ListSink
+ Select Operator
+ expressions: array(struct('A',10,_col0),struct('B',20,_col0))
(type: array<struct<col1:string,col2:int,col3:string>>)
+ outputColumnNames: _col0
+ UDTF Operator
+ function name: inline
+ Lateral View Join Operator
+ outputColumnNames: _col2, _col3, _col4
+ Select Operator
+ expressions: _col2 (type: string), _col3 (type: int),
_col4 (type: string)
+ outputColumnNames: _col0, _col1, _col2
+ ListSink
Review Comment:
This seems like a regression. Probably it was caused by the introduction of
lateral in `HiveTableFunctionScan`. Do we need to revisit that choice?
##########
ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java:
##########
@@ -1024,30 +1022,8 @@ boolean isCBOExecuted() {
@Override
boolean isCBOSupportedLateralView(ASTNode lateralView) {
- // Lateral view AST has the following shape:
- // ^(TOK_LATERAL_VIEW
- // ^(TOK_SELECT ^(TOK_SELEXPR ^(TOK_FUNCTION Identifier params)
identifier* tableAlias)))
- if (lateralView.getToken().getType() == HiveParser.TOK_LATERAL_VIEW_OUTER)
{
- // LATERAL VIEW OUTER not supported in CBO
- return false;
- }
- // Only INLINE followed by ARRAY supported in CBO
- ASTNode lvFunc = (ASTNode) lateralView.getChild(0).getChild(0).getChild(0);
- String lvFuncName = lvFunc.getChild(0).getText();
- if (lvFuncName.compareToIgnoreCase(
- GenericUDTFInline.class.getAnnotation(Description.class).name())
!= 0) {
- return false;
- }
- if (lvFunc.getChildCount() != 2) {
- return false;
- }
Review Comment:
I believe the `if (lvFunc.getChildCount() != 2)` check was redundant anyways
since at this line we already checked that the function is `GenericUDTFInline`
and by definition it cannot have more than one input.
Since here we are reasoning at the AST level I believe the first child
corresponds to the function name token and the second child to the token
representing the input argument to the inline function.
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveOptimizeInlineArrayTableFunctionRule.java:
##########
@@ -124,9 +125,12 @@ public void onMatch(RelOptRuleCall call) {
RexNode newInlineCall =
cluster.getRexBuilder().makeCall(tfs.getRowType(), inlineCall.op,
newArrayCall);
+ // Use empty listfor columnMappings. The return row type of the RelNode
now comprises of
+ // all the fields within the UDTF, so there is no mapping from the output
fields
+ // directly to the input fields anymore.
final RelNode newTableFunctionScanNode = tfs.copy(tfs.getTraitSet(),
tfs.getInputs(), newInlineCall, tfs.getElementType(), tfs.getRowType(),
- tfs.getColumnMappings());
+ Collections.emptySet());
call.transformTo(newTableFunctionScanNode);
Review Comment:
Does this rule still work as it is now that we are wrapping the table
function with the `LATERAL` operator? Do we want/need to add special cases if
we have lateral + inline array?
##########
ql/src/java/org/apache/hadoop/hive/ql/ppd/OpProcFactory.java:
##########
@@ -504,27 +504,32 @@ public Object process(Node nd, Stack<Node> stack,
NodeProcessorCtx procCtx,
// We try to push the full Filter predicate iff:
// - the Filter is on top of a TableScan, or
// - the Filter is on top of a PTF (between PTF and Filter, there might
be Select operators)
+ // - the Filter is on top of a LateralViewJoinOperator (the filter can
be pushed through one
+ // side of the join with the base table predicate, but not the UDTF
side.)
Review Comment:
I don't fully understand this optimization. The comment says that we want to
fully push the filter if it is on top of a LateralViewJoinOperator but then it
seems that we have restrictions on which side of the join we are going to push
the filter. Are these restrictions enforced somewhere later on?
##########
ql/src/test/queries/clientpositive/lateral_view.q:
##########
@@ -10,6 +10,11 @@ EXPLAIN SELECT myTable.* FROM src LATERAL VIEW
explode(array(1,2,3)) myTable AS
EXPLAIN SELECT myTable.myCol, myTable2.myCol2 FROM src LATERAL VIEW
explode(array(1,2,3)) myTable AS myCol LATERAL VIEW explode(array('a', 'b',
'c')) myTable2 AS myCol2 LIMIT 9;
EXPLAIN SELECT myTable2.* FROM src LATERAL VIEW explode(array(array(1,2,3)))
myTable AS myCol LATERAL VIEW explode(myTable.myCol) myTable2 AS myCol2 LIMIT 3;
+EXPLAIN CBO SELECT * FROM src LATERAL VIEW explode(array(1,2,3)) myTable AS
myCol SORT BY key ASC, myCol ASC LIMIT 1;
Review Comment:
I suggested to create a new .q file (`cbo_lateral_views.q`) and add test
cases there to put more focus on what we really want to test as part of this
change. We want to verify that the CBO plan for certain types of queries is the
expected one. Determining the types of queries is the whole point of having a
dedicated .q file.
**Examples**
Do we care about LIMIT clause and the actual number there? Probably not so
we don't need to add EXPLAIN CBO for that one.
Do we care about how many times we have `LATERAL VIEW explode(array(1,2,3))`
in the query? Probably not, one or two should suffice.
```sql
--Q1
SELECT myTable.myCol FROM src
LATERAL VIEW explode(array(1,2,3)) myTable AS myCol;
--Q2
SELECT myTable.myCol, myTable2.myCol2 FROM src
LATERAL VIEW explode(array(1,2,3)) myTable AS myCol
LATERAL VIEW explode(array(1,2,3)) myTable2 AS myCol2;
--Q3
SELECT myTable.myCol, myTable2.myCol2, myTable3.myCol3 FROM src
LATERAL VIEW explode(array(1,2,3)) myTable AS myCol
LATERAL VIEW explode(array(1,2,3)) myTable2 AS myCol2
LATERAL VIEW explode(array(1,2,3)) myTable3 AS myCol3;
```
If the CBO plan is correct for Q1 and Q2 there is no reason to be wrong for
Q3, Q4, etc.
Do we care about what is the actual table function inside the `LATERAL VIEW`
clause. Probably not, we don't need to test every possible variation of
`explode`, `inline`, etc.
Do we want our tables to have data and thus use the `src` and other similar
tables? Probably not.
All that to say that having many EXPLAIN CBO statements does not necessarily
increase test coverage and has some drawbacks as well. Picking representatives
queries and putting them in a file seems like a better alternative.
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java:
##########
@@ -577,6 +588,18 @@ private QueryBlockInfo convertSource(RelNode r) throws
CalciteSemanticException
ast = ASTBuilder.subQuery(left, sqAlias);
s = new Schema((Union) r, sqAlias);
}
+ } else if (r instanceof HiveTableFunctionScan &&
Review Comment:
Maybe it suffices to check for the super-class type `TableFunctionScan`
since it is the one used in the cast just below.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]