[
https://issues.apache.org/jira/browse/HIVE-26221?focusedWorklogId=832102&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-832102
]
ASF GitHub Bot logged work on HIVE-26221:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 08/Dec/22 14:57
Start Date: 08/Dec/22 14:57
Worklog Time Spent: 10m
Work Description: asolimando commented on code in PR #3137:
URL: https://github.com/apache/hive/pull/3137#discussion_r1043447078
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java:
##########
@@ -834,6 +844,36 @@ private long evaluateBetweenExpr(Statistics stats,
ExprNodeDesc pred, long currN
return currNumRows;
}
+ try {
+ if (comparisonExpression instanceof ExprNodeColumnDesc) {
+ final ExprNodeColumnDesc columnDesc = (ExprNodeColumnDesc)
comparisonExpression;
+ ColStatistics cs =
stats.getColumnStatisticsFromColName(columnDesc.getColumn());
+ if (FilterSelectivityEstimator.isHistogramAvailable(cs)) {
+ final KllFloatsSketch kll =
KllFloatsSketch.heapify(Memory.wrap(cs.getHistogram()));
+ final String colTypeLowerCase =
columnDesc.getTypeString().toLowerCase();
+ final String leftValueString = leftExpression instanceof
ExprNodeConstantDesc
+ ? ((ExprNodeConstantDesc)
leftExpression).getValue().toString() : leftExpression.getExprString();
+ final String rightValueString = rightExpression instanceof
ExprNodeConstantDesc
+ ? ((ExprNodeConstantDesc)
rightExpression).getValue().toString() : rightExpression.getExprString();
+ final float leftValue =
extractFloatFromLiteralValue(colTypeLowerCase, leftValueString);
+ final float rightValue =
extractFloatFromLiteralValue(colTypeLowerCase, rightValueString);
+ if (invert) {
+ // column < leftValue OR column > rightValue
+ if (rightValue < leftValue) {
+ return kll.getN();
+ }
+ return Math.round(kll.getN() * (lessThanSelectivity(kll,
leftValue) + greaterThanSelectivity(kll, rightValue)));
+ }
+ // if they are equal we can't handle it here, it becomes an
equality predicate
+ if (Float.compare(leftValue, rightValue) != 0) {
+ return Math.round(kll.getN() *
FilterSelectivityEstimator.betweenSelectivity(kll, leftValue, rightValue));
+ }
+ }
+ }
+ } catch(IllegalArgumentException e) {
Review Comment:
In theory `Timestamp.valueOf()` and `Date.valueOf()` only generates
`IllegalArgumentException`, while all the numeric data types throws
`NumberFormatException` which extends `IllegalArgumentException`.
However, `Float.parseFloat()`, `Double.parseDouble()`, `new BigDecimal()`,
`Timestamp.valueOf()` and `Date.valueOf()` throw an `NullPointerException` when
the input string is `null`, we can check for that and throw an
`IllegalArgumentException`.
Since as a safety net we can always resort to the standard computation, I
have nothing against catching a more general exception, but in that case I
think it's better `catch (RuntimeException e) {...}` so we don't catch and
ignore stuff like `InterruptedException` which would be pretty bad.
Anyway, if they are not in the method signature, they must inherit from
`RuntimeException`, so I think it' safe.
I will update the unit tests to cover this case too.
EDIT: finally I think it's better to catch `RuntimeException` which covers
both `IllegalArgumentException` and `NullPointerException`, I have added some
debug logs saying what's happening and updated unit tests to reflect the change.
Issue Time Tracking
-------------------
Worklog Id: (was: 832102)
Time Spent: 7h 10m (was: 7h)
> Add histogram-based column statistics
> -------------------------------------
>
> Key: HIVE-26221
> URL: https://issues.apache.org/jira/browse/HIVE-26221
> Project: Hive
> Issue Type: Improvement
> Components: CBO, Metastore, Statistics
> Affects Versions: 4.0.0-alpha-2
> Reporter: Alessandro Solimando
> Assignee: Alessandro Solimando
> Priority: Major
> Labels: pull-request-available
> Time Spent: 7h 10m
> Remaining Estimate: 0h
>
> Hive does not support histogram statistics, which are particularly useful for
> skewed data (which is very common in practice) and range predicates.
> Hive's current selectivity estimation for range predicates is based on a
> hard-coded value of 1/3 (see
> [FilterSelectivityEstimator.java#L138-L144|https://github.com/apache/hive/blob/56c336268ea8c281d23c22d89271af37cb7e2572/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L138-L144]).])
> The current proposal aims at integrating histogram as an additional column
> statistics, stored into the Hive metastore at the table (or partition) level.
> The main requirements for histogram integration are the following:
> * efficiency: the approach must scale and support billions of rows
> * merge-ability: partition-level histograms have to be merged to form
> table-level histograms
> * explicit and configurable trade-off between memory footprint and accuracy
> Hive already integrates [KLL data
> sketches|https://datasketches.apache.org/docs/KLL/KLLSketch.html] UDAF.
> Datasketches are small, stateful programs that process massive data-streams
> and can provide approximate answers, with mathematical guarantees, to
> computationally difficult queries orders-of-magnitude faster than
> traditional, exact methods.
> We propose to use KLL, and more specifically the cumulative distribution
> function (CDF), as the underlying data structure for our histogram statistics.
> The current proposal targets numeric data types (float, integer and numeric
> families) and temporal data types (date and timestamp).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)