Re: [PR] HIVE-27102: Upgrade Calcite to 1.33.0 and Avatica to 1.23.0 [hive]

via GitHub Mon, 17 Jun 2024 15:45:57 -0700


soumyakanti3578 commented on code in PR #5196:
URL: https://github.com/apache/hive/pull/5196#discussion_r1638520483



##########
iceberg/iceberg-handler/src/test/results/positive/delete_iceberg_mixed.q.out:
##########
@@ -84,17 +84,17 @@ Stage-4
               <-Reducer 2 [CONTAINS]
                 File Output Operator [FS_46]
                   table:{"name:":"default.ice01"}
-                  Select Operator [SEL_44] (rows=3 width=295)
+                  Select Operator [SEL_44] (rows=3 width=266)
                     Output:["_col0","_col1","_col2","_col3","_col4","_col5"]
-                    Merge Join Operator [MERGEJOIN_43] (rows=3 width=295)
+                    Merge Join Operator [MERGEJOIN_43] (rows=3 width=266)
                       Conds:RS_57._col4=RS_63._col0(Left 
Semi),Output:["_col0","_col1","_col2","_col3","_col4","_col5"]
                     <-Map 1 [SIMPLE_EDGE] vectorized
                       SHUFFLE [RS_57]
                         PartitionCols:_col4
-                        Select Operator [SEL_55] (rows=5 width=295)
+                        Select Operator [SEL_55] (rows=6 width=280)

Review Comment:
   It looks like there should be 6 rows instead of 5.
   Here's the insert statement, `insert into ice01 values (1, 'ABC'),(2, 
'CBS'),(3, null),(4, 'POPI'),(5, 'AQWR'),(6, 'POIU'),(9, null),(8,
   'POIKL'),(10, 'YUIO');`
   
   And the delete predicate is: `where id>4 OR id=2`. That gives us 6 rows with 
ids 2, 5, 6, 9, 8, 10.
   
   Also, I believe row counts are just estimated based on stats? I could be 
wrong though.



##########
packaging/pom.xml:
##########
@@ -173,6 +173,10 @@
                   \Qhttps://www.eclipse.org/org/documents/epl-v10.php\E
                   \Qhttp://www.eclipse.org/org/documents/epl-v10.php\E
                 </EPL-1.0>
+                <EPL-2.0>
+                  
\Qhttps://raw.githubusercontent.com/locationtech/jts/master/LICENSE_EPLv2.txt\E
+                  \Qhttps://www.eclipse.org/org/documents/epl-2.0/EPL-2.0.txt\E
+                </EPL-2.0>

Review Comment:
   Without this the compilation fails with:
   
   `21:54:57  [ERROR] Failed to execute goal 
org.codehaus.mojo:license-maven-plugin:2.3.0:download-licenses (license-fetch) 
on project hive-packaging: Execution license-fetch of goal 
org.codehaus.mojo:license-maven-plugin:2.3.0:download-licenses failed: URL 
'https://raw.githubusercontent.com/locationtech/jts/master/LICENSE_EPLv2.txt' 
should belong to licenseUrlFileName having key 
'eclipse-public-license-v.-2.0-epl-2.0.txt' together with URLs 
'https://www.eclipse.org/org/documents/epl-2.0/EPL-2.0.txt' -> [Help 1]
   `
   I'm not sure if there's another way to resolve this.



##########
ql/src/java/org/apache/hadoop/hive/ql/parse/type/HiveFunctionHelper.java:
##########
@@ -322,18 +324,24 @@ public RexNode getExpression(String functionText, 
FunctionInfo fi,
         // unix_timestamp(args) -> to_unix_timestamp(args)
         calciteOp = HiveToUnixTimestampSqlOperator.INSTANCE;
       }
-      expr = rexBuilder.makeCall(returnType, calciteOp, inputs);
+      expr = getExpression(returnType, calciteOp, inputs);
     }
 
     if (expr instanceof RexCall && !expr.isA(SqlKind.CAST)) {
       RexCall call = (RexCall) expr;
       expr = rexBuilder.makeCall(returnType, call.getOperator(),
           RexUtil.flatten(call.getOperands(), call.getOperator()));
     }
-
     return expr;
   }
 
+  private RexNode getExpression(RelDataType returnType, SqlOperator calciteOp, 
List<RexNode> inputs) {
+    if (Objects.requireNonNull(calciteOp.getKind()) == SqlKind.IN) {

Review Comment:
   I think non null check was added by mistake, and it should work without this 
check. I will remove this.



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveFilterSortPredicates.java:
##########
@@ -198,6 +203,16 @@ public Double visitCall(RexCall call) {
         return null;
       }
 
+      if (call.getKind() == SqlKind.SEARCH) {
+        RexCall expandedCall = (RexCall) RexUtil.expandSearch(
+            new RexBuilder(new JavaTypeFactoryImpl(new HiveTypeSystemImpl())),

Review Comment:
   I will do this.



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HivePointLookupOptimizerRule.java:
##########
@@ -179,7 +178,7 @@ protected HivePointLookupOptimizerRule(
     this.minNumORClauses = minNumORClauses;
   }
 
-  public RexNode analyzeRexNode(RexBuilder rexBuilder, RexNode condition) {
+  public static RexNode analyzeRexNode(RexBuilder rexBuilder, RexNode 
condition, int minNumORClauses) {

Review Comment:
   Within `HivePointLookupOptimizerRule#analyzeRexNode` we try to merge IN 
expressions and create NOT/BETWEEN if possible. 
   Right now with Calcite 1.33 there's no way to create these expressions as 
both INs and BETWEENs are replaced with SEARCH.
   In this whole PR, in a few places we are trying to convert SEARCH back to IN 
and BETWEEN. Anytime `sarg.isPoints()` returns false, we can use 
`analyzeRexNode` to convert back to BETWEEN or NOT BETWEEN. For this I had to 
make it static. Here's a snippet from `ASTConverter` for example:
   ```
   case SEARCH:
           ASTNode astNode = call.getOperands().get(0).accept(this);
           astNodeLst.add(astNode);
           RexLiteral literal = (RexLiteral) call.operands.get(1);
           Sarg<?> sarg = 
Objects.requireNonNull(literal.getValueAs(Sarg.class), "Sarg");
           int minOrClauses = 
SessionState.getSessionConf().getIntVar(HiveConf.ConfVars.HIVE_POINT_LOOKUP_OPTIMIZER_MIN);
   
           // convert Sarg to IN when they are points.
           if (sarg.isPoints()) {
             // just expand SEARCH to ORs when point count is less than 
HIVE_POINT_LOOKUP_OPTIMIZER_MIN
             if (sarg.pointCount < minOrClauses) {
               return visitCall((RexCall) 
call.accept(RexUtil.searchShuttle(rexBuilder, null, -1)));
             }
   
             // else convert to IN
             for (Range<?> range : sarg.rangeSet.asRanges()) {
               astNodeLst.add(visitLiteral((RexLiteral) rexBuilder.makeLiteral(
                       range.lowerEndpoint(), literal.getType(), true, true)));
             }
             ASTNode inAST = SqlFunctionConverter.buildAST(HiveIn.INSTANCE, 
astNodeLst, call.getType());
   
             if (sarg.nullAs == RexUnknownAs.UNKNOWN) {
               return inAST;
             } else if (sarg.nullAs == RexUnknownAs.TRUE) {
               ASTNode isNull =
                   visitCall((RexCall) 
rexBuilder.makeCall(SqlStdOperatorTable.IS_NULL, call.getOperands().get(0)));
               return SqlFunctionConverter.buildAST(SqlStdOperatorTable.OR, 
Arrays.asList(isNull, inAST), call.getType());
             } else {
               ASTNode isNotNull =
                   visitCall((RexCall) 
rexBuilder.makeCall(SqlStdOperatorTable.IS_NOT_NULL, 
call.getOperands().get(0)));
               return SqlFunctionConverter
                   .buildAST(SqlStdOperatorTable.AND, Arrays.asList(isNotNull, 
inAST), call.getType());
             }
   
             // Expand SEARCH operator
           } else {
             return visitCall((RexCall) HivePointLookupOptimizerRule
                 .analyzeRexNode(
                     rexBuilder,
                     call.accept(RexUtil.searchShuttle(rexBuilder, null, -1)),
                     minOrClauses
                 )
             );
           }
   ``` 



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveCalciteUtil.java:
##########
@@ -656,7 +658,9 @@ public static ImmutableList<RexNode> 
getPredsNotPushedAlready(Set<String> predic
     }
     // Build map to not convert multiple times, further remove already 
included predicates
     Map<String,RexNode> stringToRexNode = Maps.newLinkedHashMap();
+    RexSimplify simplify = new RexSimplify(inp.getCluster().getRexBuilder(), 
RelOptPredicateList.EMPTY, RexUtil.EXECUTOR);
     for (RexNode r : predsToPushDown) {
+      r = simplify.simplify(r);

Review Comment:
   Without this we get into StackOverflow: 
https://github.com/apache/hive/pull/5196/commits/7f97e5aea3fd3370350a29606853c1c381792a4c
   
   I checked again and we still get into the issue without this. Maybe we need 
to think of an alternate solution given that I had to exclude 
`HivePreFilteringRule` from a test because of a stack overflow too.



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/jdbc/HiveJdbcImplementor.java:
##########
@@ -128,6 +129,11 @@ public HiveJdbcImplementor(SqlDialect dialect, 
JavaTypeFactory typeFactory) {
     return result(join, leftResult, rightResult);
   }
 
+public Result visit(JdbcTableScan scan) {
+  return result(scan.jdbcTable.tableName(),
+      ImmutableList.of(Clause.FROM), scan, null);
+}
+

Review Comment:
   This is required because `visit(JdbcTableScan scan)` was removed from  
Calcite in https://issues.apache.org/jira/browse/CALCITE-4640, 
https://github.com/apache/calcite/pull/2426/files#diff-f118416cd59a4ba4e96ffcfa0d2b496dd2b372c6f5be01f7bb4d46996b9175d1
   
   We get a lot of errors without this.



##########
ql/src/test/queries/clientpositive/cardinality_preserving_join_opt2.q:
##########
@@ -2,6 +2,7 @@ set hive.support.concurrency=true;
 set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
 set hive.strict.checks.cartesian.product=false;
 set hive.stats.fetch.column.stats=true;
+set hive.cbo.rule.exclusion.regex=HivePreFilteringRule;

Review Comment:
   No, this should definitely be not excluded in production. Unfortunately, 
without this exclusion, this test would run for a long time and throw a 
StackOverflow.
   
   I excluded it just so that we get past this error and get a green run. If 
possible, please look into this test failure, as I wasn't able to resolve this: 
http://ci.hive.apache.org/job/hive-precommit/job/PR-5196/27/testReport/junit/org.apache.hadoop.hive.cli.split3/TestMiniLlapLocalCliDriver/Testing___split_22___PostProcess___testCliDriver_cardinality_preserving_join_opt2_/



##########
ql/src/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java:
##########
@@ -508,8 +492,19 @@ public void 
testComputeRangePredicateSelectivityBetweenWithNULLS() {
   public void testComputeRangePredicateSelectivityNotBetweenWithNULLS() {
     doReturn((double) 20).when(tableMock).getRowCount();
     
doReturn(Collections.singletonList(stats)).when(tableMock).getColStat(Collections.singletonList(0));
-    RexNode filter = REX_BUILDER.makeCall(SqlStdOperatorTable.BETWEEN, 
boolTrue, inputRef0, int1, int3);
+    RexNode filter = makeNotBetween(inputRef0, int1, int3);
     FilterSelectivityEstimator estimator = new 
FilterSelectivityEstimator(scan, mq);
     Assert.assertEquals(0.55, estimator.estimateSelectivity(filter), DELTA);
   }
+
+  private RexNode makeNotBetween(RexNode inputRef, RexNode left, RexNode 
right) {
+    RexNode withOr = REX_BUILDER.makeCall(
+        SqlStdOperatorTable.OR,
+        REX_BUILDER.makeCall(SqlStdOperatorTable.LESS_THAN, inputRef, left),
+        REX_BUILDER.makeCall(SqlStdOperatorTable.GREATER_THAN, inputRef, right)
+    );
+    RexSimplify simplify = new RexSimplify(REX_BUILDER, 
RelOptPredicateList.EMPTY, RexUtil.EXECUTOR);
+
+    return simplify.simplify(withOr);

Review Comment:
   With what you have suggested, it'll create a REXCALL like `NOT(SEARCH...)`, 
which goes to `computeNotEqualitySelectivity` in `FilterSelectivityEstimator`. 
maxNDV is computed there but 1.0 is always returned when the value is less than 
1. This gives wrong result, even if we add an else-if block in `getMaxNDV`:
   ```
   else if (op.isA(SqlKind.SEARCH)) {
           tmpNDV = op.accept(this);
           if (tmpNDV == null) {
             return null;
           }
           if (tmpNDV > maxNDV) {
             maxNDV = tmpNDV;
           }
         } else {
           irv = new InputReferencedVisitor();
           irv.apply(op);
   ...
   ...
   ```
   
   With my patch, it creates a REXCALL with SEARCH operator, without the NOT. 
And it directly matches to `case SEARCH`.
   Moreover, we cannot create a BETWEEN with `makeBetween` when `left > right`, 
so this test will fail with what you've suggested:
   ```
   public void 
testComputeRangePredicateSelectivityNotBetweenRightLowerThanLeft() {
       RexNode filter = makeNotBetween(inputRef0, int5, int3);
       FilterSelectivityEstimator estimator = new 
FilterSelectivityEstimator(scan, mq);
       Assert.assertEquals(1, estimator.estimateSelectivity(filter), DELTA);
     }
   ```
   
   Let me know what you think about this. 



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java:
##########
@@ -171,7 +176,36 @@ public Double visitCall(RexCall call) {
       break;
     }
 
-    default:
+    case SEARCH:
+      RexLiteral literal = (RexLiteral) call.operands.get(1);
+      Sarg<?> sarg = Objects.requireNonNull(literal.getValueAs(Sarg.class), 
"Sarg");
+      RexBuilder rexBuilder = new RexBuilder(new JavaTypeFactoryImpl(new 
HiveTypeSystemImpl()));

Review Comment:
   I will change this.



##########
ql/src/test/org/apache/hadoop/hive/ql/io/TestSymlinkTextInputFormat.java:
##########
@@ -157,7 +157,7 @@ public void testCombine() throws Exception {
 
       drv.run(loadFileCommand);
 
-      String cmd = "select key*1 from " + tblName;
+      String cmd = "select key*2 from " + tblName;

Review Comment:
   I saw that the final logical plan with calcite 1.33 is a bit more reduced 
than what was produced in 1.25
   ```
   1.33 - select key*1 ...
   DEBUG [main] translator.PlanModifierForASTConv: Final plan after modifier
    HiveProject(_c0=[$0])
     HiveTableScan(table=[[default, text_symlink_text]], 
table:alias=[text_symlink_text])
   
   1.25 - select key*1 ...
   DEBUG [main] translator.PlanModifierForASTConv: Final plan after modifier
    HiveProject(_c0=[*($0, 1)])
     HiveTableScan(table=[[default, text_symlink_text]], 
table:alias=[text_symlink_text])
   ```
   I think `key * 1` is getting reduced to just `key` in 1.33. If we change it 
to `key*2` then the test passes:
   ```
   1.33 - select key*2 ...
   DEBUG [main] translator.PlanModifierForASTConv: Final plan after modifier
    HiveProject(_c0=[*($0, 2)])
     HiveTableScan(table=[[default, text_symlink_text]], 
table:alias=[text_symlink_text])
   ```
   Since this test is testing symlink text input format, and is not 
particularly testing `key * 1`, I changed it to `key * 2`.
   
   Moreover, in 1.25, when we run the same test with `select key ...`, it fails 
with the same error.



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/reloperators/HiveFilter.java:
##########
@@ -72,6 +78,22 @@ public HiveFilter(RelOptCluster cluster, RelTraitSet traits, 
RelNode child, RexN
     super(cluster, TraitsUtil.getDefaultTraitSet(cluster), child, condition);
     this.correlationInfos = new CorrelationInfoSupplier(getCondition());
   }
+  
+  @Override 
+  public RelWriter explainTerms(RelWriter pw) {
+    // Remove this method after upgrading Calcite to 1.35+

Review Comment:
   Is it a good idea to write project properties with `properties-maven-plugin` 
to `ql/src/main/resources/`, and read the calcite version from there and 
compare? This plugin is already in use in `itests`.
   
   If we just define a constant CALCITE_VERSION = 1.35, it won't throw an 
exception when the version is higher than the constant. But if we read directly 
from the properties file, it'll throw an exception.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-27102: Upgrade Calcite to 1.33.0 and Avatica to 1.23.0 [hive]

Reply via email to