Re: [PR] HIVE-29424: CBO plans should use histogram statistics for range predicates with a CAST [hive]

via GitHub Mon, 02 Mar 2026 06:00:54 -0800


zabetak commented on code in PR #6293:
URL: https://github.com/apache/hive/pull/6293#discussion_r2871457543



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java:
##########
@@ -218,6 +218,27 @@ private boolean isRemovableCast(RexNode exp, HiveTableScan 
tableScan) {
       return false;
     }
 
+    SqlTypeName targetType = cast.getType().getSqlTypeName();
+    SqlTypeName sourceType = op0.getType().getSqlTypeName();
+
+    switch (sourceType) {
+    case TINYINT, SMALLINT, INTEGER, BIGINT:
+      // additional checks are needed
+      break;
+    case FLOAT, DOUBLE, DECIMAL, TIMESTAMP, DATE:
+      return true;

Review Comment:
   I don't understand why we don't need additional checks to remove a cast when 
the source data type is one of these. Could you please add a comment explaining 
why it is ok to return true and exit early in this case?



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java:
##########
@@ -218,6 +218,27 @@ private boolean isRemovableCast(RexNode exp, HiveTableScan 
tableScan) {
       return false;
     }
 
+    SqlTypeName targetType = cast.getType().getSqlTypeName();
+    SqlTypeName sourceType = op0.getType().getSqlTypeName();
+
+    switch (sourceType) {
+    case TINYINT, SMALLINT, INTEGER, BIGINT:
+      // additional checks are needed
+      break;
+    case FLOAT, DOUBLE, DECIMAL, TIMESTAMP, DATE:
+      return true;
+    default:
+      // unknown type, do not remove the cast
+      return false;
+    }
+
+    // If the source type is completely within the target type, the cast is 
lossless
+    Range<Float> targetRange = getRangeOfType(cast.getType(), 
BoundType.CLOSED, BoundType.CLOSED);

Review Comment:
   Is there anything preventing the cast type to be a STRING, CHAR, VARCHAR, or 
other unsupported types? We want to avoid hitting an `IllegalStateException` in 
this case.



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java:
##########
@@ -218,6 +218,27 @@ private boolean isRemovableCast(RexNode exp, HiveTableScan 
tableScan) {
       return false;
     }
 
+    SqlTypeName targetType = cast.getType().getSqlTypeName();
+    SqlTypeName sourceType = op0.getType().getSqlTypeName();
+
+    switch (sourceType) {
+    case TINYINT, SMALLINT, INTEGER, BIGINT:
+      // additional checks are needed
+      break;
+    case FLOAT, DOUBLE, DECIMAL, TIMESTAMP, DATE:
+      return true;
+    default:
+      // unknown type, do not remove the cast
+      return false;
+    }
+
+    // If the source type is completely within the target type, the cast is 
lossless
+    Range<Float> targetRange = getRangeOfType(cast.getType(), 
BoundType.CLOSED, BoundType.CLOSED);
+    Range<Float> sourceRange = getRangeOfType(op0.getType(), BoundType.CLOSED, 
BoundType.CLOSED);
+    if (sourceRange.equals(targetRange.intersection(sourceRange))) {
+      return true;
+    }

Review Comment:
   The `intersection` method throws an `IllegalArgumentException` if the ranges 
are disjointed. Is it guaranteed that the ranges are always connected at this 
stage?



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java:
##########
@@ -218,6 +218,27 @@ private boolean isRemovableCast(RexNode exp, HiveTableScan 
tableScan) {
       return false;
     }
 
+    SqlTypeName targetType = cast.getType().getSqlTypeName();
+    SqlTypeName sourceType = op0.getType().getSqlTypeName();
+
+    switch (sourceType) {
+    case TINYINT, SMALLINT, INTEGER, BIGINT:
+      // additional checks are needed
+      break;
+    case FLOAT, DOUBLE, DECIMAL, TIMESTAMP, DATE:
+      return true;
+    default:
+      // unknown type, do not remove the cast
+      return false;
+    }
+
+    // If the source type is completely within the target type, the cast is 
lossless

Review Comment:
   Do we need stats to determine if a cast is lossless? If not then we could 
possibly move this logic before checking if column stats are empty.



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java:
##########
@@ -226,43 +247,41 @@ private boolean isRemovableCast(RexNode exp, 
HiveTableScan tableScan) {
       return false;
     }
 
-    SqlTypeName type = cast.getType().getSqlTypeName();
 
-    double min;
-    double max;
-    switch (type) {
-    case TINYINT, SMALLINT, INTEGER, BIGINT:
-      min = ((Number) type.getLimit(false, SqlTypeName.Limit.OVERFLOW, false, 
-1, -1)).doubleValue();
-      max = ((Number) type.getLimit(true, SqlTypeName.Limit.OVERFLOW, false, 
-1, -1)).doubleValue();
-      break;
-    case TIMESTAMP, DATE:
-      min = Long.MIN_VALUE;
-      max = Long.MAX_VALUE;
-      break;
-    case FLOAT:
-      min = -Float.MAX_VALUE;
-      max = Float.MAX_VALUE;
-      break;
-    case DOUBLE, DECIMAL:
-      min = -Double.MAX_VALUE;
-      max = Double.MAX_VALUE;
-      break;
-    default:
-      // unknown type, do not remove the cast
-      return false;
-    }
     // are all values of the input column accepted by the cast?
+    double min = ((Number) targetType.getLimit(false, 
SqlTypeName.Limit.OVERFLOW, false, -1, -1)).doubleValue();
+    double max = ((Number) targetType.getLimit(true, 
SqlTypeName.Limit.OVERFLOW, false, -1, -1)).doubleValue();
     return min < colRange.minValue.doubleValue() && 
colRange.maxValue.doubleValue() < max;
   }
 
   /**
-   * Get the range of values that are rounded to valid values of a DECIMAL 
type.
+   * Get the range of values that are rounded to valid values of a type.
    *
-   * @param type the DECIMAL type
+   * @param type the type
    * @param lowerBound the lower bound type of the result
    * @param upperBound the upper bound type of the result
    * @return the range of the type
    */
+  private static Range<Float> getRangeOfType(RelDataType type, BoundType 
lowerBound, BoundType upperBound) {
+    switch (type.getSqlTypeName()) {
+    // in case of integer types,
+    case TINYINT:
+      return Range.closed(-128.99998f, 127.99999f);
+    case SMALLINT:
+      return Range.closed(-32768.996f, 32767.998f);
+    case INTEGER:
+      return Range.closed(-2.1474836E9f, 2.1474836E9f);
+    case BIGINT, DATE, TIMESTAMP:
+      return Range.closed(-9.223372E18f, 9.223372E18f);
+    case DECIMAL:
+      return getRangeOfDecimalType(type, lowerBound, upperBound);

Review Comment:
   For every other type we return a closed `Range` no matter the input bound 
arguments. Why we can't do the same for `DECIMAL`?



##########
ql/src/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java:
##########
@@ -591,6 +604,35 @@ public void 
testComputeRangePredicateSelectivityNotBetweenWithNULLS() {
     Assert.assertEquals(0.2, estimator.estimateSelectivity(filter), DELTA);
   }
 
+  @Test
+  public void testRangePredicateCastInteger() {
+    // use VALUES2, even if the tested types cannot represent its values
+    // we're only interested in whether the cast to a smaller integer type 
results in the default selectivity

Review Comment:
   Do we have/need tests where a narrowing cast does not return the default 
selectivity cause we are able to determine via the stats that the range of 
values in the column are all accepted by the cast?



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java:
##########
@@ -295,21 +314,54 @@ private static Range<Float> 
getRangeOfDecimalType(RelDataType type, BoundType lo
    * @param typeRange the boundaries of the type range
    * @return the adjusted boundary
    */
-  private static Range<Float> adjustRangeToDecimalType(Range<Float> 
predicateRange, RelDataType type,
+  private static Range<Float> adjustRangeToType(Range<Float> predicateRange, 
RelDataType type,
       Range<Float> typeRange) {
-    float adjust = (float) (5 * Math.pow(10, -(type.getScale() + 1)));
-    // the resulting value of +- adjust would be rounded up, so in some cases 
we need to use Math.nextDown
     boolean lowerInclusive = 
BoundType.CLOSED.equals(predicateRange.lowerBoundType());
     boolean upperInclusive = 
BoundType.CLOSED.equals(predicateRange.upperBoundType());
-    float adjusted1 = lowerInclusive ? predicateRange.lowerEndpoint() - adjust
-        : Math.nextDown(predicateRange.lowerEndpoint() + adjust);
-    float adjusted2 = upperInclusive ? 
Math.nextDown(predicateRange.upperEndpoint() + adjust)
-        : predicateRange.upperEndpoint() - adjust;
-    float lower = Math.max(adjusted1, typeRange.lowerEndpoint());
-    float upper = Math.min(adjusted2, typeRange.upperEndpoint());
-    // the boundaries might result in an invalid range (e.g., left > right)
-    // in that case the predicate does not select anything, and we return an 
empty range
-    return makeRange(lower, predicateRange.lowerBoundType(), upper, 
predicateRange.upperBoundType());
+    switch (type.getSqlTypeName()) {
+    case TINYINT, SMALLINT, INTEGER, BIGINT: {
+      // when casting a floating point, its values are rounded towards 0
+      // i.e, 10.99 is rounded to 10, and -10.99 is rounded to -10
+      // to take this into account, the predicate range is transformed in the 
following ways
+      // [10.0, 15.0] -> [10, 15.99999]
+      // (10.0, 15.0) -> [11, 14.99999]
+      // [10.2, 15.2] -> [11, 15.99999]
+      // (10.2, 15.2) -> [11, 15.99999]
+
+      // [-15.0, -10.0] -> [-15.9999, -10]
+      // (-15.0, -10.0) -> [-14.9999, -11]
+      // [-15.2, -10.2] -> [-15.9999, -11]
+      // (-15.2, -10.2) -> [-15.9999, -11]
+
+      // normalize the range to make the formulas easier
+      Range<Float> range = convertRangeToClosedOpen(predicateRange);
+      Range<Float> typeClosedOpen = convertRangeToClosedOpen(typeRange);

Review Comment:
   If we reuse the same variable it would make the similarities with the 
DECIMAL handling below more evident.
   ```java
   typeRange = convertRangeToClosedOpen(typeRange);
   ```



##########
ql/src/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java:
##########
@@ -657,8 +692,8 @@ public void testRangePredicateWithCast2() {
 
     // the cast would apply a modulo operation to the values outside the range 
of the cast
     // so instead a default selectivity should be returned
-    checkSelectivity(1 / 3.f, lt(cast("f_numeric", TINYINT), 
literalFloat(100)));
-    checkSelectivity(1 / 3.f, lt(cast("f_numeric", TINYINT), 
literalFloat(100)));
+    checkSelectivity(1 / 3.f, lt(cast("f_integer", TINYINT), 
literalFloat(100)));
+    checkSelectivity(1 / 3.f, lt(cast("f_integer", TINYINT), 
literalFloat(100)));

Review Comment:
    Don't we cover this case already in `testRangePredicateCastInteger`?



##########
ql/src/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java:
##########
@@ -726,7 +761,85 @@ public void testRangePredicateOnDateWithCast() {
   }
 
   @Test
-  public void testBetweenWithCastDecimal2s1() {
+  public void testBetweenWithCastToTinyIntCheckRounding() {
+    useFieldWithValues("f_numeric", VALUES3, KLL3);
+    float total = VALUES3.length;
+    float universe = 10; // the number of values that "survive" the cast
+    RexNode cast = cast("f_numeric", TINYINT);
+    // check rounding of positive numbers
+    checkBetweenSelectivity(3, universe, total, cast, 0, 10);
+    checkBetweenSelectivity(3, universe, total, cast, 0, 10.9f);

Review Comment:
   This does not seem like a valid BETWEEN expression. I have the impression 
that we can't compare a float with an int type directly. Some kind of type 
conversion/alignment should be performed and since we are casting one side to 
TINYINT then the other side (literal) should be casted as well so we can never 
end-up with `10.9f`.
   
   I think we should drop this kind of invalid expressions from the tests since 
we don't want to unrelated failures in the future. To be checked if the code 
can be simplified as well after knowing that these expressions cannot appear.



##########
ql/src/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java:
##########
@@ -726,7 +761,85 @@ public void testRangePredicateOnDateWithCast() {
   }
 
   @Test
-  public void testBetweenWithCastDecimal2s1() {
+  public void testBetweenWithCastToTinyIntCheckRounding() {
+    useFieldWithValues("f_numeric", VALUES3, KLL3);
+    float total = VALUES3.length;
+    float universe = 10; // the number of values that "survive" the cast
+    RexNode cast = cast("f_numeric", TINYINT);
+    // check rounding of positive numbers
+    checkBetweenSelectivity(3, universe, total, cast, 0, 10);
+    checkBetweenSelectivity(3, universe, total, cast, 0, 10.9f);
+    checkBetweenSelectivity(4, universe, total, cast, 0, 11);
+    checkBetweenSelectivity(4, universe, total, cast, 10, 20);
+    checkBetweenSelectivity(1, universe, total, cast, 10.9999f, 20);
+    checkBetweenSelectivity(1, universe, total, cast, 11, 20);
+
+    // check rounding of negative numbers
+    checkBetweenSelectivity(4, universe, total, cast, -20, -10);
+    checkBetweenSelectivity(1, universe, total, cast, -20, -10.9f);

Review Comment:
   Invalid?



##########
ql/src/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java:
##########
@@ -726,7 +761,85 @@ public void testRangePredicateOnDateWithCast() {
   }
 
   @Test
-  public void testBetweenWithCastDecimal2s1() {
+  public void testBetweenWithCastToTinyIntCheckRounding() {
+    useFieldWithValues("f_numeric", VALUES3, KLL3);
+    float total = VALUES3.length;
+    float universe = 10; // the number of values that "survive" the cast
+    RexNode cast = cast("f_numeric", TINYINT);
+    // check rounding of positive numbers
+    checkBetweenSelectivity(3, universe, total, cast, 0, 10);
+    checkBetweenSelectivity(3, universe, total, cast, 0, 10.9f);
+    checkBetweenSelectivity(4, universe, total, cast, 0, 11);
+    checkBetweenSelectivity(4, universe, total, cast, 10, 20);
+    checkBetweenSelectivity(1, universe, total, cast, 10.9999f, 20);
+    checkBetweenSelectivity(1, universe, total, cast, 11, 20);
+
+    // check rounding of negative numbers
+    checkBetweenSelectivity(4, universe, total, cast, -20, -10);
+    checkBetweenSelectivity(1, universe, total, cast, -20, -10.9f);
+    checkBetweenSelectivity(1, universe, total, cast, -20, -11);
+    checkBetweenSelectivity(3, universe, total, cast, -10, 0);
+    checkBetweenSelectivity(3, universe, total, cast, -10.9999f, 0);

Review Comment:
   Invalid?



##########
ql/src/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java:
##########
@@ -591,6 +604,35 @@ public void 
testComputeRangePredicateSelectivityNotBetweenWithNULLS() {
     Assert.assertEquals(0.2, estimator.estimateSelectivity(filter), DELTA);
   }
 
+  @Test
+  public void testRangePredicateCastInteger() {

Review Comment:
   I like this test covering all combinations of casts between INTEGER types. 
We also have tests with combinations between DECIMAL types. I am wondering if 
we should we generalize even further and have tests covering every combination 
of types supported in the range predicates? Having the same predicate for all 
possible combinations of casts and column types would add more coverage and 
make the tests more systematic.



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java:
##########


Review Comment:
   Not only DECIMAL type.



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java:
##########
@@ -295,21 +314,54 @@ private static Range<Float> 
getRangeOfDecimalType(RelDataType type, BoundType lo
    * @param typeRange the boundaries of the type range
    * @return the adjusted boundary
    */
-  private static Range<Float> adjustRangeToDecimalType(Range<Float> 
predicateRange, RelDataType type,
+  private static Range<Float> adjustRangeToType(Range<Float> predicateRange, 
RelDataType type,
       Range<Float> typeRange) {
-    float adjust = (float) (5 * Math.pow(10, -(type.getScale() + 1)));
-    // the resulting value of +- adjust would be rounded up, so in some cases 
we need to use Math.nextDown
     boolean lowerInclusive = 
BoundType.CLOSED.equals(predicateRange.lowerBoundType());
     boolean upperInclusive = 
BoundType.CLOSED.equals(predicateRange.upperBoundType());
-    float adjusted1 = lowerInclusive ? predicateRange.lowerEndpoint() - adjust
-        : Math.nextDown(predicateRange.lowerEndpoint() + adjust);
-    float adjusted2 = upperInclusive ? 
Math.nextDown(predicateRange.upperEndpoint() + adjust)
-        : predicateRange.upperEndpoint() - adjust;
-    float lower = Math.max(adjusted1, typeRange.lowerEndpoint());
-    float upper = Math.min(adjusted2, typeRange.upperEndpoint());
-    // the boundaries might result in an invalid range (e.g., left > right)
-    // in that case the predicate does not select anything, and we return an 
empty range
-    return makeRange(lower, predicateRange.lowerBoundType(), upper, 
predicateRange.upperBoundType());
+    switch (type.getSqlTypeName()) {
+    case TINYINT, SMALLINT, INTEGER, BIGINT: {
+      // when casting a floating point, its values are rounded towards 0
+      // i.e, 10.99 is rounded to 10, and -10.99 is rounded to -10
+      // to take this into account, the predicate range is transformed in the 
following ways
+      // [10.0, 15.0] -> [10, 15.99999]
+      // (10.0, 15.0) -> [11, 14.99999]
+      // [10.2, 15.2] -> [11, 15.99999]
+      // (10.2, 15.2) -> [11, 15.99999]
+
+      // [-15.0, -10.0] -> [-15.9999, -10]
+      // (-15.0, -10.0) -> [-14.9999, -11]
+      // [-15.2, -10.2] -> [-15.9999, -11]
+      // (-15.2, -10.2) -> [-15.9999, -11]
+
+      // normalize the range to make the formulas easier
+      Range<Float> range = convertRangeToClosedOpen(predicateRange);
+      Range<Float> typeClosedOpen = convertRangeToClosedOpen(typeRange);
+      float rangeLower = (range.lowerEndpoint() >= 0 ? (float) 
Math.ceil(range.lowerEndpoint())
+          : Math.nextUp(-(float) 
Math.ceil(Math.nextUp(-range.lowerEndpoint()))));
+      float rangeUpper = range.upperEndpoint() >= 0 ? Math.nextDown((float) 
Math.ceil(range.upperEndpoint()))
+          : Math.nextUp((float) -Math.ceil(-range.upperEndpoint()));
+      float lower = Math.max(typeClosedOpen.lowerEndpoint(), rangeLower);
+      float upper = Math.min(typeClosedOpen.upperEndpoint(), rangeUpper);

Review Comment:
   Inversing the args on min and max will also bring the code close to the 
DECIMAL handling block below.



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java:
##########
@@ -367,11 +419,9 @@ private double computeRangePredicateSelectivity(RexCall 
call, SqlKind op) {
     int inputRefOpIndex = 1 - literalOpIdx;
     RexNode node = operands.get(inputRefOpIndex);
     if (isRemovableCast(node, scan)) {
-      if (node.getType().getSqlTypeName() == SqlTypeName.DECIMAL) {
-        Range<Float> rangeOfDecimalType =
-            getRangeOfDecimalType(node.getType(), boundaries.lowerBoundType(), 
boundaries.upperBoundType());
-        boundaries = adjustRangeToDecimalType(boundaries, node.getType(), 
rangeOfDecimalType);
-      }
+      Range<Float> rangeOfDecimalType =

Review Comment:
   Not necessarily a DECIMAL. Rename?



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java:
##########
@@ -295,21 +314,54 @@ private static Range<Float> 
getRangeOfDecimalType(RelDataType type, BoundType lo
    * @param typeRange the boundaries of the type range
    * @return the adjusted boundary
    */
-  private static Range<Float> adjustRangeToDecimalType(Range<Float> 
predicateRange, RelDataType type,
+  private static Range<Float> adjustRangeToType(Range<Float> predicateRange, 
RelDataType type,
       Range<Float> typeRange) {
-    float adjust = (float) (5 * Math.pow(10, -(type.getScale() + 1)));
-    // the resulting value of +- adjust would be rounded up, so in some cases 
we need to use Math.nextDown
     boolean lowerInclusive = 
BoundType.CLOSED.equals(predicateRange.lowerBoundType());
     boolean upperInclusive = 
BoundType.CLOSED.equals(predicateRange.upperBoundType());
-    float adjusted1 = lowerInclusive ? predicateRange.lowerEndpoint() - adjust
-        : Math.nextDown(predicateRange.lowerEndpoint() + adjust);
-    float adjusted2 = upperInclusive ? 
Math.nextDown(predicateRange.upperEndpoint() + adjust)
-        : predicateRange.upperEndpoint() - adjust;
-    float lower = Math.max(adjusted1, typeRange.lowerEndpoint());
-    float upper = Math.min(adjusted2, typeRange.upperEndpoint());
-    // the boundaries might result in an invalid range (e.g., left > right)
-    // in that case the predicate does not select anything, and we return an 
empty range
-    return makeRange(lower, predicateRange.lowerBoundType(), upper, 
predicateRange.upperBoundType());
+    switch (type.getSqlTypeName()) {
+    case TINYINT, SMALLINT, INTEGER, BIGINT: {
+      // when casting a floating point, its values are rounded towards 0
+      // i.e, 10.99 is rounded to 10, and -10.99 is rounded to -10
+      // to take this into account, the predicate range is transformed in the 
following ways
+      // [10.0, 15.0] -> [10, 15.99999]
+      // (10.0, 15.0) -> [11, 14.99999]
+      // [10.2, 15.2] -> [11, 15.99999]
+      // (10.2, 15.2) -> [11, 15.99999]
+
+      // [-15.0, -10.0] -> [-15.9999, -10]
+      // (-15.0, -10.0) -> [-14.9999, -11]
+      // [-15.2, -10.2] -> [-15.9999, -11]
+      // (-15.2, -10.2) -> [-15.9999, -11]
+
+      // normalize the range to make the formulas easier
+      Range<Float> range = convertRangeToClosedOpen(predicateRange);
+      Range<Float> typeClosedOpen = convertRangeToClosedOpen(typeRange);
+      float rangeLower = (range.lowerEndpoint() >= 0 ? (float) 
Math.ceil(range.lowerEndpoint())
+          : Math.nextUp(-(float) 
Math.ceil(Math.nextUp(-range.lowerEndpoint()))));
+      float rangeUpper = range.upperEndpoint() >= 0 ? Math.nextDown((float) 
Math.ceil(range.upperEndpoint()))
+          : Math.nextUp((float) -Math.ceil(-range.upperEndpoint()));
+      float lower = Math.max(typeClosedOpen.lowerEndpoint(), rangeLower);
+      float upper = Math.min(typeClosedOpen.upperEndpoint(), rangeUpper);
+      return makeRange(lower, BoundType.CLOSED, upper, BoundType.OPEN);
+    }
+    case DECIMAL: {
+      float adjust = (float) (5 * Math.pow(10, -(type.getScale() + 1)));
+      // the resulting value of +- adjust would be rounded up, so in some 
cases we need to use Math.nextDown
+      float adjusted1 = lowerInclusive ? predicateRange.lowerEndpoint() - 
adjust
+          : Math.nextDown(predicateRange.lowerEndpoint() + adjust);
+      float adjusted2 = upperInclusive ? 
Math.nextDown(predicateRange.upperEndpoint() + adjust)
+          : predicateRange.upperEndpoint() - adjust;
+      float lower = Math.max(adjusted1, typeRange.lowerEndpoint());
+      float upper = Math.min(adjusted2, typeRange.upperEndpoint());
+      // the boundaries might result in an invalid range (e.g., left > right)
+      // in that case the predicate does not select anything, and we return an 
empty range

Review Comment:
   It seems that invalid ranges can be created whenever `makeRange` is called. 
Since the respective method was specifically added for this purpose this 
comment should be moved there or possibly dropped if redundant.



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java:
##########


Review Comment:
   The method has "special" handling for almost all data types (with few 
exceptions) so the doc seems a bit obsolete. We can either drop it or update it 
to reflect what happens for each type.



##########
ql/src/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java:
##########
@@ -603,13 +645,6 @@ public void testRangePredicateWithCast() {
     checkSelectivity(1 / 13f, lt(cast("f_numeric", TINYINT), int2));
     checkSelectivity(5 / 13f, gt(cast("f_numeric", TINYINT), int2));
     checkSelectivity(8 / 13f, le(cast("f_numeric", TINYINT), int2));
-
-    // check some types
-    checkSelectivity(3 / 13.f, ge(cast("f_numeric", INTEGER), int5));
-    checkSelectivity(3 / 13.f, ge(cast("f_numeric", SMALLINT), int5));
-    checkSelectivity(3 / 13.f, ge(cast("f_numeric", BIGINT), int5));
-    checkSelectivity(3 / 13.f, ge(cast("f_numeric", FLOAT), int5));
-    checkSelectivity(3 / 13.f, ge(cast("f_numeric", DOUBLE), int5));

Review Comment:
   Why did we remove these tests? Do we test somewhere else the cast from 
numeric to the other types?



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java:
##########
@@ -295,21 +314,54 @@ private static Range<Float> 
getRangeOfDecimalType(RelDataType type, BoundType lo
    * @param typeRange the boundaries of the type range
    * @return the adjusted boundary
    */
-  private static Range<Float> adjustRangeToDecimalType(Range<Float> 
predicateRange, RelDataType type,
+  private static Range<Float> adjustRangeToType(Range<Float> predicateRange, 
RelDataType type,
       Range<Float> typeRange) {
-    float adjust = (float) (5 * Math.pow(10, -(type.getScale() + 1)));
-    // the resulting value of +- adjust would be rounded up, so in some cases 
we need to use Math.nextDown
     boolean lowerInclusive = 
BoundType.CLOSED.equals(predicateRange.lowerBoundType());
     boolean upperInclusive = 
BoundType.CLOSED.equals(predicateRange.upperBoundType());
-    float adjusted1 = lowerInclusive ? predicateRange.lowerEndpoint() - adjust
-        : Math.nextDown(predicateRange.lowerEndpoint() + adjust);
-    float adjusted2 = upperInclusive ? 
Math.nextDown(predicateRange.upperEndpoint() + adjust)
-        : predicateRange.upperEndpoint() - adjust;
-    float lower = Math.max(adjusted1, typeRange.lowerEndpoint());
-    float upper = Math.min(adjusted2, typeRange.upperEndpoint());
-    // the boundaries might result in an invalid range (e.g., left > right)
-    // in that case the predicate does not select anything, and we return an 
empty range
-    return makeRange(lower, predicateRange.lowerBoundType(), upper, 
predicateRange.upperBoundType());
+    switch (type.getSqlTypeName()) {
+    case TINYINT, SMALLINT, INTEGER, BIGINT: {
+      // when casting a floating point, its values are rounded towards 0
+      // i.e, 10.99 is rounded to 10, and -10.99 is rounded to -10
+      // to take this into account, the predicate range is transformed in the 
following ways
+      // [10.0, 15.0] -> [10, 15.99999]
+      // (10.0, 15.0) -> [11, 14.99999]
+      // [10.2, 15.2] -> [11, 15.99999]
+      // (10.2, 15.2) -> [11, 15.99999]
+
+      // [-15.0, -10.0] -> [-15.9999, -10]
+      // (-15.0, -10.0) -> [-14.9999, -11]
+      // [-15.2, -10.2] -> [-15.9999, -11]
+      // (-15.2, -10.2) -> [-15.9999, -11]
+
+      // normalize the range to make the formulas easier
+      Range<Float> range = convertRangeToClosedOpen(predicateRange);
+      Range<Float> typeClosedOpen = convertRangeToClosedOpen(typeRange);
+      float rangeLower = (range.lowerEndpoint() >= 0 ? (float) 
Math.ceil(range.lowerEndpoint())
+          : Math.nextUp(-(float) 
Math.ceil(Math.nextUp(-range.lowerEndpoint()))));
+      float rangeUpper = range.upperEndpoint() >= 0 ? Math.nextDown((float) 
Math.ceil(range.upperEndpoint()))
+          : Math.nextUp((float) -Math.ceil(-range.upperEndpoint()));
+      float lower = Math.max(typeClosedOpen.lowerEndpoint(), rangeLower);
+      float upper = Math.min(typeClosedOpen.upperEndpoint(), rangeUpper);
+      return makeRange(lower, BoundType.CLOSED, upper, BoundType.OPEN);
+    }
+    case DECIMAL: {
+      float adjust = (float) (5 * Math.pow(10, -(type.getScale() + 1)));
+      // the resulting value of +- adjust would be rounded up, so in some 
cases we need to use Math.nextDown
+      float adjusted1 = lowerInclusive ? predicateRange.lowerEndpoint() - 
adjust
+          : Math.nextDown(predicateRange.lowerEndpoint() + adjust);
+      float adjusted2 = upperInclusive ? 
Math.nextDown(predicateRange.upperEndpoint() + adjust)
+          : predicateRange.upperEndpoint() - adjust;
+      float lower = Math.max(adjusted1, typeRange.lowerEndpoint());
+      float upper = Math.min(adjusted2, typeRange.upperEndpoint());
+      // the boundaries might result in an invalid range (e.g., left > right)
+      // in that case the predicate does not select anything, and we return an 
empty range
+      return makeRange(lower, predicateRange.lowerBoundType(), upper, 
predicateRange.upperBoundType());
+    }
+    case TIMESTAMP, DATE:
+      return predicateRange;
+    default:
+      return typeRange.intersection(predicateRange);

Review Comment:
   Is it ok to skip the check to verify that ranges are connected?



##########
ql/src/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java:
##########
@@ -726,7 +761,85 @@ public void testRangePredicateOnDateWithCast() {
   }
 
   @Test
-  public void testBetweenWithCastDecimal2s1() {
+  public void testBetweenWithCastToTinyIntCheckRounding() {

Review Comment:
   If the comment about the validity of some tests holds then potentially this 
test becomes irrelevant/mergeable with `testBetweenWithCastToTinyInt`.



##########
ql/src/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java:
##########
@@ -726,7 +761,85 @@ public void testRangePredicateOnDateWithCast() {
   }
 
   @Test
-  public void testBetweenWithCastDecimal2s1() {
+  public void testBetweenWithCastToTinyIntCheckRounding() {
+    useFieldWithValues("f_numeric", VALUES3, KLL3);
+    float total = VALUES3.length;
+    float universe = 10; // the number of values that "survive" the cast
+    RexNode cast = cast("f_numeric", TINYINT);
+    // check rounding of positive numbers
+    checkBetweenSelectivity(3, universe, total, cast, 0, 10);
+    checkBetweenSelectivity(3, universe, total, cast, 0, 10.9f);
+    checkBetweenSelectivity(4, universe, total, cast, 0, 11);
+    checkBetweenSelectivity(4, universe, total, cast, 10, 20);
+    checkBetweenSelectivity(1, universe, total, cast, 10.9999f, 20);

Review Comment:
   Invalid?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-29424: CBO plans should use histogram statistics for range predicates with a CAST [hive]

Reply via email to