soumyakanti3578 commented on code in PR #5196:
URL: https://github.com/apache/hive/pull/5196#discussion_r1638520483
##########
iceberg/iceberg-handler/src/test/results/positive/delete_iceberg_mixed.q.out:
##########
@@ -84,17 +84,17 @@ Stage-4
<-Reducer 2 [CONTAINS]
File Output Operator [FS_46]
table:{"name:":"default.ice01"}
- Select Operator [SEL_44] (rows=3 width=295)
+ Select Operator [SEL_44] (rows=3 width=266)
Output:["_col0","_col1","_col2","_col3","_col4","_col5"]
- Merge Join Operator [MERGEJOIN_43] (rows=3 width=295)
+ Merge Join Operator [MERGEJOIN_43] (rows=3 width=266)
Conds:RS_57._col4=RS_63._col0(Left
Semi),Output:["_col0","_col1","_col2","_col3","_col4","_col5"]
<-Map 1 [SIMPLE_EDGE] vectorized
SHUFFLE [RS_57]
PartitionCols:_col4
- Select Operator [SEL_55] (rows=5 width=295)
+ Select Operator [SEL_55] (rows=6 width=280)
Review Comment:
It looks like there should be 6 rows instead of 5.
Here's the insert statement, `insert into ice01 values (1, 'ABC'),(2,
'CBS'),(3, null),(4, 'POPI'),(5, 'AQWR'),(6, 'POIU'),(9, null),(8,
'POIKL'),(10, 'YUIO');`
And the delete predicate is: `where id>4 OR id=2`. That gives us 6 rows with
ids 2, 5, 6, 9, 8, 10.
Also, I believe row counts are just estimated based on stats? I could be
wrong though.
##########
packaging/pom.xml:
##########
@@ -173,6 +173,10 @@
\Qhttps://www.eclipse.org/org/documents/epl-v10.php\E
\Qhttp://www.eclipse.org/org/documents/epl-v10.php\E
</EPL-1.0>
+ <EPL-2.0>
+
\Qhttps://raw.githubusercontent.com/locationtech/jts/master/LICENSE_EPLv2.txt\E
+ \Qhttps://www.eclipse.org/org/documents/epl-2.0/EPL-2.0.txt\E
+ </EPL-2.0>
Review Comment:
Without this the compilation fails with:
`21:54:57 [ERROR] Failed to execute goal
org.codehaus.mojo:license-maven-plugin:2.3.0:download-licenses (license-fetch)
on project hive-packaging: Execution license-fetch of goal
org.codehaus.mojo:license-maven-plugin:2.3.0:download-licenses failed: URL
'https://raw.githubusercontent.com/locationtech/jts/master/LICENSE_EPLv2.txt'
should belong to licenseUrlFileName having key
'eclipse-public-license-v.-2.0-epl-2.0.txt' together with URLs
'https://www.eclipse.org/org/documents/epl-2.0/EPL-2.0.txt' -> [Help 1]
`
I'm not sure if there's another way to resolve this.
##########
ql/src/java/org/apache/hadoop/hive/ql/parse/type/HiveFunctionHelper.java:
##########
@@ -322,18 +324,24 @@ public RexNode getExpression(String functionText,
FunctionInfo fi,
// unix_timestamp(args) -> to_unix_timestamp(args)
calciteOp = HiveToUnixTimestampSqlOperator.INSTANCE;
}
- expr = rexBuilder.makeCall(returnType, calciteOp, inputs);
+ expr = getExpression(returnType, calciteOp, inputs);
}
if (expr instanceof RexCall && !expr.isA(SqlKind.CAST)) {
RexCall call = (RexCall) expr;
expr = rexBuilder.makeCall(returnType, call.getOperator(),
RexUtil.flatten(call.getOperands(), call.getOperator()));
}
-
return expr;
}
+ private RexNode getExpression(RelDataType returnType, SqlOperator calciteOp,
List<RexNode> inputs) {
+ if (Objects.requireNonNull(calciteOp.getKind()) == SqlKind.IN) {
Review Comment:
I think non null check was added by mistake, and it should work without this
check. I will remove this.
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveFilterSortPredicates.java:
##########
@@ -198,6 +203,16 @@ public Double visitCall(RexCall call) {
return null;
}
+ if (call.getKind() == SqlKind.SEARCH) {
+ RexCall expandedCall = (RexCall) RexUtil.expandSearch(
+ new RexBuilder(new JavaTypeFactoryImpl(new HiveTypeSystemImpl())),
Review Comment:
I will do this.
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HivePointLookupOptimizerRule.java:
##########
@@ -179,7 +178,7 @@ protected HivePointLookupOptimizerRule(
this.minNumORClauses = minNumORClauses;
}
- public RexNode analyzeRexNode(RexBuilder rexBuilder, RexNode condition) {
+ public static RexNode analyzeRexNode(RexBuilder rexBuilder, RexNode
condition, int minNumORClauses) {
Review Comment:
Within `HivePointLookupOptimizerRule#analyzeRexNode` we try to merge IN
expressions and create NOT/BETWEEN if possible.
Right now with Calcite 1.33 there's no way to create these expressions as
both INs and BETWEENs are replaced with SEARCH.
In this whole PR, in a few places we are trying to convert SEARCH back to IN
and BETWEEN. Anytime `sarg.isPoints()` returns false, we can use
`analyzeRexNode` to convert back to BETWEEN or NOT BETWEEN. For this I had to
make it static. Here's a snippet from `ASTConverter` for example:
```
case SEARCH:
ASTNode astNode = call.getOperands().get(0).accept(this);
astNodeLst.add(astNode);
RexLiteral literal = (RexLiteral) call.operands.get(1);
Sarg<?> sarg =
Objects.requireNonNull(literal.getValueAs(Sarg.class), "Sarg");
int minOrClauses =
SessionState.getSessionConf().getIntVar(HiveConf.ConfVars.HIVE_POINT_LOOKUP_OPTIMIZER_MIN);
// convert Sarg to IN when they are points.
if (sarg.isPoints()) {
// just expand SEARCH to ORs when point count is less than
HIVE_POINT_LOOKUP_OPTIMIZER_MIN
if (sarg.pointCount < minOrClauses) {
return visitCall((RexCall)
call.accept(RexUtil.searchShuttle(rexBuilder, null, -1)));
}
// else convert to IN
for (Range<?> range : sarg.rangeSet.asRanges()) {
astNodeLst.add(visitLiteral((RexLiteral) rexBuilder.makeLiteral(
range.lowerEndpoint(), literal.getType(), true, true)));
}
ASTNode inAST = SqlFunctionConverter.buildAST(HiveIn.INSTANCE,
astNodeLst, call.getType());
if (sarg.nullAs == RexUnknownAs.UNKNOWN) {
return inAST;
} else if (sarg.nullAs == RexUnknownAs.TRUE) {
ASTNode isNull =
visitCall((RexCall)
rexBuilder.makeCall(SqlStdOperatorTable.IS_NULL, call.getOperands().get(0)));
return SqlFunctionConverter.buildAST(SqlStdOperatorTable.OR,
Arrays.asList(isNull, inAST), call.getType());
} else {
ASTNode isNotNull =
visitCall((RexCall)
rexBuilder.makeCall(SqlStdOperatorTable.IS_NOT_NULL,
call.getOperands().get(0)));
return SqlFunctionConverter
.buildAST(SqlStdOperatorTable.AND, Arrays.asList(isNotNull,
inAST), call.getType());
}
// Expand SEARCH operator
} else {
return visitCall((RexCall) HivePointLookupOptimizerRule
.analyzeRexNode(
rexBuilder,
call.accept(RexUtil.searchShuttle(rexBuilder, null, -1)),
minOrClauses
)
);
}
```
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveCalciteUtil.java:
##########
@@ -656,7 +658,9 @@ public static ImmutableList<RexNode>
getPredsNotPushedAlready(Set<String> predic
}
// Build map to not convert multiple times, further remove already
included predicates
Map<String,RexNode> stringToRexNode = Maps.newLinkedHashMap();
+ RexSimplify simplify = new RexSimplify(inp.getCluster().getRexBuilder(),
RelOptPredicateList.EMPTY, RexUtil.EXECUTOR);
for (RexNode r : predsToPushDown) {
+ r = simplify.simplify(r);
Review Comment:
Without this we get into StackOverflow:
https://github.com/apache/hive/pull/5196/commits/7f97e5aea3fd3370350a29606853c1c381792a4c
I checked again and we still get into the issue without this. Maybe we need
to think of an alternate solution given that I had to exclude
`HivePreFilteringRule` from a test because of a stack overflow too.
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/jdbc/HiveJdbcImplementor.java:
##########
@@ -128,6 +129,11 @@ public HiveJdbcImplementor(SqlDialect dialect,
JavaTypeFactory typeFactory) {
return result(join, leftResult, rightResult);
}
+public Result visit(JdbcTableScan scan) {
+ return result(scan.jdbcTable.tableName(),
+ ImmutableList.of(Clause.FROM), scan, null);
+}
+
Review Comment:
This is required because `visit(JdbcTableScan scan)` was removed from
Calcite in https://issues.apache.org/jira/browse/CALCITE-4640,
https://github.com/apache/calcite/pull/2426/files#diff-f118416cd59a4ba4e96ffcfa0d2b496dd2b372c6f5be01f7bb4d46996b9175d1
We get a lot of errors without this.
##########
ql/src/test/queries/clientpositive/cardinality_preserving_join_opt2.q:
##########
@@ -2,6 +2,7 @@ set hive.support.concurrency=true;
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set hive.strict.checks.cartesian.product=false;
set hive.stats.fetch.column.stats=true;
+set hive.cbo.rule.exclusion.regex=HivePreFilteringRule;
Review Comment:
No, this should definitely be not excluded in production. Unfortunately,
without this exclusion, this test would run for a long time and throw a
StackOverflow.
I excluded it just so that we get past this error and get a green run. If
possible, please look into this test failure, as I wasn't able to resolve this:
http://ci.hive.apache.org/job/hive-precommit/job/PR-5196/27/testReport/junit/org.apache.hadoop.hive.cli.split3/TestMiniLlapLocalCliDriver/Testing___split_22___PostProcess___testCliDriver_cardinality_preserving_join_opt2_/
##########
ql/src/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java:
##########
@@ -508,8 +492,19 @@ public void
testComputeRangePredicateSelectivityBetweenWithNULLS() {
public void testComputeRangePredicateSelectivityNotBetweenWithNULLS() {
doReturn((double) 20).when(tableMock).getRowCount();
doReturn(Collections.singletonList(stats)).when(tableMock).getColStat(Collections.singletonList(0));
- RexNode filter = REX_BUILDER.makeCall(SqlStdOperatorTable.BETWEEN,
boolTrue, inputRef0, int1, int3);
+ RexNode filter = makeNotBetween(inputRef0, int1, int3);
FilterSelectivityEstimator estimator = new
FilterSelectivityEstimator(scan, mq);
Assert.assertEquals(0.55, estimator.estimateSelectivity(filter), DELTA);
}
+
+ private RexNode makeNotBetween(RexNode inputRef, RexNode left, RexNode
right) {
+ RexNode withOr = REX_BUILDER.makeCall(
+ SqlStdOperatorTable.OR,
+ REX_BUILDER.makeCall(SqlStdOperatorTable.LESS_THAN, inputRef, left),
+ REX_BUILDER.makeCall(SqlStdOperatorTable.GREATER_THAN, inputRef, right)
+ );
+ RexSimplify simplify = new RexSimplify(REX_BUILDER,
RelOptPredicateList.EMPTY, RexUtil.EXECUTOR);
+
+ return simplify.simplify(withOr);
Review Comment:
With what you have suggested, it'll create a REXCALL like `NOT(SEARCH...)`,
which goes to `computeNotEqualitySelectivity` in `FilterSelectivityEstimator`.
maxNDV is computed there but 1.0 is always returned when the value is less than
1. This gives wrong result, even if we add an else-if block in `getMaxNDV`:
```
else if (op.isA(SqlKind.SEARCH)) {
tmpNDV = op.accept(this);
if (tmpNDV == null) {
return null;
}
if (tmpNDV > maxNDV) {
maxNDV = tmpNDV;
}
} else {
irv = new InputReferencedVisitor();
irv.apply(op);
...
...
```
With my patch, it creates a REXCALL with SEARCH operator, without the NOT.
And it directly matches to `case SEARCH`.
Moreover, we cannot create a BETWEEN with `makeBetween` when `left > right`,
so this test will fail with what you've suggested:
```
public void
testComputeRangePredicateSelectivityNotBetweenRightLowerThanLeft() {
RexNode filter = makeNotBetween(inputRef0, int5, int3);
FilterSelectivityEstimator estimator = new
FilterSelectivityEstimator(scan, mq);
Assert.assertEquals(1, estimator.estimateSelectivity(filter), DELTA);
}
```
Let me know what you think about this.
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java:
##########
@@ -171,7 +176,36 @@ public Double visitCall(RexCall call) {
break;
}
- default:
+ case SEARCH:
+ RexLiteral literal = (RexLiteral) call.operands.get(1);
+ Sarg<?> sarg = Objects.requireNonNull(literal.getValueAs(Sarg.class),
"Sarg");
+ RexBuilder rexBuilder = new RexBuilder(new JavaTypeFactoryImpl(new
HiveTypeSystemImpl()));
Review Comment:
I will change this.
##########
ql/src/test/org/apache/hadoop/hive/ql/io/TestSymlinkTextInputFormat.java:
##########
@@ -157,7 +157,7 @@ public void testCombine() throws Exception {
drv.run(loadFileCommand);
- String cmd = "select key*1 from " + tblName;
+ String cmd = "select key*2 from " + tblName;
Review Comment:
I saw that the final logical plan with calcite 1.33 is a bit more reduced
than what was produced in 1.25
```
1.33 - select key*1 ...
DEBUG [main] translator.PlanModifierForASTConv: Final plan after modifier
HiveProject(_c0=[$0])
HiveTableScan(table=[[default, text_symlink_text]],
table:alias=[text_symlink_text])
1.25 - select key*1 ...
DEBUG [main] translator.PlanModifierForASTConv: Final plan after modifier
HiveProject(_c0=[*($0, 1)])
HiveTableScan(table=[[default, text_symlink_text]],
table:alias=[text_symlink_text])
```
I think `key * 1` is getting reduced to just `key` in 1.33. If we change it
to `key*2` then the test passes:
```
1.33 - select key*2 ...
DEBUG [main] translator.PlanModifierForASTConv: Final plan after modifier
HiveProject(_c0=[*($0, 2)])
HiveTableScan(table=[[default, text_symlink_text]],
table:alias=[text_symlink_text])
```
Since this test is testing symlink text input format, and is not
particularly testing `key * 1`, I changed it to `key * 2`.
Moreover, in 1.25, when we run the same test with `select key ...`, it fails
with the same error.
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/reloperators/HiveFilter.java:
##########
@@ -72,6 +78,22 @@ public HiveFilter(RelOptCluster cluster, RelTraitSet traits,
RelNode child, RexN
super(cluster, TraitsUtil.getDefaultTraitSet(cluster), child, condition);
this.correlationInfos = new CorrelationInfoSupplier(getCondition());
}
+
+ @Override
+ public RelWriter explainTerms(RelWriter pw) {
+ // Remove this method after upgrading Calcite to 1.35+
Review Comment:
Is it a good idea to write project properties with `properties-maven-plugin`
to `ql/src/main/resources/`, and read the calcite version from there and
compare? This plugin is already in use in `itests`.
If we just define a constant CALCITE_VERSION = 1.35, it won't throw an
exception when the version is higher than the constant. But if we read directly
from the properties file, it'll throw an exception.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]