[GitHub] [hive] asolimando commented on a diff in pull request #3137: HIVE-26221: Add histogram-based column statistics

GitBox Thu, 08 Dec 2022 05:58:10 -0800


asolimando commented on code in PR #3137:
URL: https://github.com/apache/hive/pull/3137#discussion_r1043383390



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java:
##########
@@ -167,6 +178,109 @@ public Double visitCall(RexCall call) {
     return selectivity;
   }
 
+  private double computeRangePredicateSelectivity(RexCall call, SqlKind op) {
+    final boolean isLiteralLeft = 
call.getOperands().get(0).getKind().equals(SqlKind.LITERAL);
+    final boolean isLiteralRight = 
call.getOperands().get(1).getKind().equals(SqlKind.LITERAL);
+    final boolean isInputRefLeft = 
call.getOperands().get(0).getKind().equals(SqlKind.INPUT_REF);
+    final boolean isInputRefRight = 
call.getOperands().get(1).getKind().equals(SqlKind.INPUT_REF);
+
+    if (childRel instanceof HiveTableScan && isLiteralLeft != isLiteralRight 
&& isInputRefLeft != isInputRefRight) {
+      final HiveTableScan t = (HiveTableScan) childRel;
+      final int inputRefIndex = ((RexInputRef) 
call.getOperands().get(isInputRefLeft ? 0 : 1)).getIndex();
+      final List<ColStatistics> colStats = 
t.getColStat(Collections.singletonList(inputRefIndex));
+
+      if (!colStats.isEmpty() && isHistogramAvailable(colStats.get(0))) {
+        final KllFloatsSketch kll = 
KllFloatsSketch.heapify(Memory.wrap(colStats.get(0).getHistogram()));
+        final Object boundValueObject = ((RexLiteral) 
call.getOperands().get(isLiteralLeft ? 0 : 1)).getValue();
+        final SqlTypeName typeName = call.getOperands().get(isInputRefLeft ? 0 
: 1).getType().getSqlTypeName();
+        float value = extractLiteral(typeName, boundValueObject);
+        boolean closedBound = op.equals(SqlKind.LESS_THAN_OR_EQUAL) || 
op.equals(SqlKind.GREATER_THAN_OR_EQUAL);
+
+        double selectivity;
+        if (op.equals(SqlKind.LESS_THAN_OR_EQUAL) || 
op.equals(SqlKind.LESS_THAN)) {
+          selectivity = closedBound ? lessThanOrEqualSelectivity(kll, value) : 
lessThanSelectivity(kll, value);
+        } else {
+          selectivity = closedBound ? greaterThanOrEqualSelectivity(kll, 
value) : greaterThanSelectivity(kll, value);
+        }
+
+        // selectivity does not account for null values, we multiply for the 
number of non-null values (getN) and we
+        // divide by the total (non-null + null values) to get the overall 
selectivity
+        return kll.getN() * selectivity / t.getTable().getRowCount();

Review Comment:
   You mean a filter `col > null`? In that case, we bail out way before as we 
identify this as a `false` predicate:
   ```
   DEBUG : Shutting down query EXPLAIN CBO COST SELECT * FROM sample t1 JOIN 
sample t2 ON (t1.id = t2.id) WHERE t1.price2 < null
   +----------------------------------------------------+
   |                      Explain                       |
   +----------------------------------------------------+
   | CBO PLAN:                                          |
   | HiveValues(tuples=[[]]): rowcount = 1.0, cumulative cost = {1.0 rows, 1.0 
cpu, 0.0 io}, id = 243 |
   |                                                    |
   +----------------------------------------------------+
   ```
   
   If you are thinking of a specific row (say row 10) having `col = null`, then 
that's a runtime thing, the selectivity of the filter is computed "globally" 
for the column and the given filter, it will just say that given the stats we 
have, say 0.5 (50%) rows will be filtered out.
   
   Selectivity is used to estimate how many rows will survive the filtering 
step, to have an idea of the `rowCount`, it's not a runtime concept.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [hive] asolimando commented on a diff in pull request #3137: HIVE-26221: Add histogram-based column statistics

Reply via email to