[GitHub] [hive] asolimando commented on a diff in pull request #3137: HIVE-26221: Add histogram-based column statistics

GitBox Fri, 09 Dec 2022 00:25:54 -0800


asolimando commented on code in PR #3137:
URL: https://github.com/apache/hive/pull/3137#discussion_r1044198784



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java:
##########
@@ -167,6 +178,109 @@ public Double visitCall(RexCall call) {
     return selectivity;
   }
 
+  private double computeRangePredicateSelectivity(RexCall call, SqlKind op) {
+    final boolean isLiteralLeft = 
call.getOperands().get(0).getKind().equals(SqlKind.LITERAL);
+    final boolean isLiteralRight = 
call.getOperands().get(1).getKind().equals(SqlKind.LITERAL);
+    final boolean isInputRefLeft = 
call.getOperands().get(0).getKind().equals(SqlKind.INPUT_REF);
+    final boolean isInputRefRight = 
call.getOperands().get(1).getKind().equals(SqlKind.INPUT_REF);
+
+    if (childRel instanceof HiveTableScan && isLiteralLeft != isLiteralRight 
&& isInputRefLeft != isInputRefRight) {
+      final HiveTableScan t = (HiveTableScan) childRel;
+      final int inputRefIndex = ((RexInputRef) 
call.getOperands().get(isInputRefLeft ? 0 : 1)).getIndex();
+      final List<ColStatistics> colStats = 
t.getColStat(Collections.singletonList(inputRefIndex));
+
+      if (!colStats.isEmpty() && isHistogramAvailable(colStats.get(0))) {
+        final KllFloatsSketch kll = 
KllFloatsSketch.heapify(Memory.wrap(colStats.get(0).getHistogram()));
+        final Object boundValueObject = ((RexLiteral) 
call.getOperands().get(isLiteralLeft ? 0 : 1)).getValue();
+        final SqlTypeName typeName = call.getOperands().get(isInputRefLeft ? 0 
: 1).getType().getSqlTypeName();
+        float value = extractLiteral(typeName, boundValueObject);
+        boolean closedBound = op.equals(SqlKind.LESS_THAN_OR_EQUAL) || 
op.equals(SqlKind.GREATER_THAN_OR_EQUAL);
+
+        double selectivity;
+        if (op.equals(SqlKind.LESS_THAN_OR_EQUAL) || 
op.equals(SqlKind.LESS_THAN)) {
+          selectivity = closedBound ? lessThanOrEqualSelectivity(kll, value) : 
lessThanSelectivity(kll, value);
+        } else {
+          selectivity = closedBound ? greaterThanOrEqualSelectivity(kll, 
value) : greaterThanSelectivity(kll, value);
+        }
+
+        // selectivity does not account for null values, we multiply for the 
number of non-null values (getN) and we
+        // divide by the total (non-null + null values) to get the overall 
selectivity
+        return kll.getN() * selectivity / t.getTable().getRowCount();

Review Comment:
   Sorry, I wasn't clear, let me restate it so that we are sure we are at the 
same page. 
   
   KLL ignores `null`s (that is, `kll.update(null)` is a no-op), so `kllgetN()` 
in your case would be `3`, and selectivity would be `1/3`.
   
   In Hive, however, we need to account for nulls, so we do `3 * 1/3` to get 
how many rows will be there after filtering (only `1` in this case), then we 
divide by the total number of rows, nulls included, which is `5` 
(`t.getTable().getRowCount()`), so that the result is correctly `1/5`.
   
   If you want I can try to improve the comment as well, I understand it's a 
bit tricky.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [hive] asolimando commented on a diff in pull request #3137: HIVE-26221: Add histogram-based column statistics

Reply via email to