asolimando commented on code in PR #3137:
URL: https://github.com/apache/hive/pull/3137#discussion_r1043383390
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java:
##########
@@ -167,6 +178,109 @@ public Double visitCall(RexCall call) {
return selectivity;
}
+ private double computeRangePredicateSelectivity(RexCall call, SqlKind op) {
+ final boolean isLiteralLeft =
call.getOperands().get(0).getKind().equals(SqlKind.LITERAL);
+ final boolean isLiteralRight =
call.getOperands().get(1).getKind().equals(SqlKind.LITERAL);
+ final boolean isInputRefLeft =
call.getOperands().get(0).getKind().equals(SqlKind.INPUT_REF);
+ final boolean isInputRefRight =
call.getOperands().get(1).getKind().equals(SqlKind.INPUT_REF);
+
+ if (childRel instanceof HiveTableScan && isLiteralLeft != isLiteralRight
&& isInputRefLeft != isInputRefRight) {
+ final HiveTableScan t = (HiveTableScan) childRel;
+ final int inputRefIndex = ((RexInputRef)
call.getOperands().get(isInputRefLeft ? 0 : 1)).getIndex();
+ final List<ColStatistics> colStats =
t.getColStat(Collections.singletonList(inputRefIndex));
+
+ if (!colStats.isEmpty() && isHistogramAvailable(colStats.get(0))) {
+ final KllFloatsSketch kll =
KllFloatsSketch.heapify(Memory.wrap(colStats.get(0).getHistogram()));
+ final Object boundValueObject = ((RexLiteral)
call.getOperands().get(isLiteralLeft ? 0 : 1)).getValue();
+ final SqlTypeName typeName = call.getOperands().get(isInputRefLeft ? 0
: 1).getType().getSqlTypeName();
+ float value = extractLiteral(typeName, boundValueObject);
+ boolean closedBound = op.equals(SqlKind.LESS_THAN_OR_EQUAL) ||
op.equals(SqlKind.GREATER_THAN_OR_EQUAL);
+
+ double selectivity;
+ if (op.equals(SqlKind.LESS_THAN_OR_EQUAL) ||
op.equals(SqlKind.LESS_THAN)) {
+ selectivity = closedBound ? lessThanOrEqualSelectivity(kll, value) :
lessThanSelectivity(kll, value);
+ } else {
+ selectivity = closedBound ? greaterThanOrEqualSelectivity(kll,
value) : greaterThanSelectivity(kll, value);
+ }
+
+ // selectivity does not account for null values, we multiply for the
number of non-null values (getN) and we
+ // divide by the total (non-null + null values) to get the overall
selectivity
+ return kll.getN() * selectivity / t.getTable().getRowCount();
Review Comment:
You mean a filter `col > null`? In that case, we bail out way before as we
identify this as a `false` predicate:
```
DEBUG : Shutting down query EXPLAIN CBO COST SELECT * FROM sample t1 JOIN
sample t2 ON (t1.id = t2.id) WHERE t1.price2 < null
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| CBO PLAN: |
| HiveValues(tuples=[[]]): rowcount = 1.0, cumulative cost = {1.0 rows, 1.0
cpu, 0.0 io}, id = 243 |
| |
+----------------------------------------------------+
```
If you are thinking of a specific row (say row 10) having `col = null`, then
that's a runtime thing, the selectivity of the filter is computed "globally"
for the column and the given filter, it will just say that given the stats we
have, say 0.5 (50%) rows will be filtered out.
Selectivity is used to estimate how many rows will survive the filtering
step, to have an idea of the `rowCount`, it's not a runtime concept.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]