okumin commented on code in PR #6244:
URL: https://github.com/apache/hive/pull/6244#discussion_r2767210302
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java:
##########
@@ -2030,6 +2030,18 @@ public Object process(Node nd, Stack<Node> stack,
NodeProcessorCtx procCtx,
}
}
+ // NDV=0 means join key statistics are unavailable - fall back to
joinFactor heuristic
+ if (allSatisfyPreCondition) {
+ for (int pos = 0; pos < parents.size(); pos++) {
+ ReduceSinkOperator parent = (ReduceSinkOperator)
jop.getParentOperators().get(pos);
+ List<String> keyExprs =
StatsUtils.getQualifedReducerKeyNames(parent.getConf().getOutputKeyColumnNames());
+ if (!satisfyPrecondition(parent.getStatistics(), keyExprs)) {
+ allSatisfyPreCondition = false;
+ break;
+ }
+ }
+ }
Review Comment:
Note: This is probably ok but I want to check it again
##########
ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFWhen.java:
##########
@@ -144,6 +148,53 @@ static class WhenStatEstimator implements StatEstimator {
@Override
public Optional<ColStatistics> estimate(List<ColStatistics> argStats) {
+ return estimate(argStats, null);
+ }
+
+ @Override
+ public Optional<ColStatistics> estimate(List<ColStatistics> argStats,
List<ExprNodeDesc> argExprs) {
Review Comment:
I'm clarifying my understanding. Please let me know if I'm overlooking
something.
Let's assume the number of distinct values of `col_2` is 2, that of
`col_100` is 100, and that of `col_999` is 999.
The true derived NDV of the following expression is 3. The original
implementation returns 1, and this implementation returns 3.
```sql
CASE
WHEN category BETWEEN 0 AND 4 THEN 'CODE_00'
WHEN category BETWEEN 5 AND 9 THEN 'CODE_01'
ELSE 'CODE_ELSE'
END
```
That of this is 2. The original implementation returns 1, and this
implementation returns 2.
```sql
CASE
WHEN category BETWEEN 0 AND 4 THEN 'CODE_00'
WHEN category BETWEEN 5 AND 9 THEN 'CODE_01'
ELSE 'CODE_01'
END
```
That of this is 100, 101, or 102. The original implementation returns 100,
and this implementation returns 100.
```sql
CASE
WHEN category BETWEEN 0 AND 4 THEN 'CODE_00'
WHEN category BETWEEN 5 AND 9 THEN 'CODE_01'
ELSE col_100
END
```
That of this is 999 ~ 1100. The original implementation returns 999, and
this implementation returns 999.
```sql
CASE
WHEN category BETWEEN 0 AND 4 THEN 'CODE_00'
WHEN category BETWEEN 5 AND 9 THEN col_999
ELSE col_100
END
```
That of this is 6 ~ 8. The original implementation returns 2, and this
implementation returns 2.
```sql
CASE
WHEN category BETWEEN 0 AND 4 THEN 'CODE_00'
WHEN category BETWEEN 5 AND 9 THEN 'CODE_01'
WHEN category BETWEEN 10 AND 14 THEN 'CODE_02'
WHEN category BETWEEN 15 AND 19 THEN 'CODE_03'
WHEN category BETWEEN 20 AND 24 THEN 'CODE_04'
WHEN category BETWEEN 25 AND 29 THEN 'CODE_05'
ELSE col_2
END
```
I'd say the current patch does not introduce worse estimation in any case.
##########
ql/src/java/org/apache/hadoop/hive/ql/plan/Statistics.java:
##########
@@ -256,7 +256,12 @@ public void addToColumnStats(List<ColStatistics> colStats)
{
updatedCS = columnStats.get(key);
updatedCS.setAvgColLen(Math.max(updatedCS.getAvgColLen(),
cs.getAvgColLen()));
updatedCS.setNumNulls(StatsUtils.safeAdd(updatedCS.getNumNulls(),
cs.getNumNulls()));
- updatedCS.setCountDistint(Math.max(updatedCS.getCountDistint(),
cs.getCountDistint()));
+ if(updatedCS.getCountDistint() > 0 && cs.getCountDistint() > 0) {
Review Comment:
```suggestion
if (updatedCS.getCountDistint() > 0 && cs.getCountDistint() > 0)
{
```
##########
ql/src/java/org/apache/hadoop/hive/ql/stats/estimator/StatEstimator.java:
##########
@@ -40,4 +41,18 @@ public interface StatEstimator {
* @return {@link ColStatistics} estimate for the actual UDF.
*/
public Optional<ColStatistics> estimate(List<ColStatistics> argStats);
+
+ /**
+ * Computes the output statistics of the actual UDF with access to original
expressions.
+ *
+ * This method provides access to the original expression nodes, allowing
estimators to make
+ * more accurate estimates when expressions have special properties (e.g.,
all constants).
+ *
+ * @param argStats the statistics for every argument of the UDF
+ * @param argExprs the original expression nodes for every argument of the
UDF
+ * @return {@link ColStatistics} estimate for the actual UDF.
+ */
+ default Optional<ColStatistics> estimate(List<ColStatistics> argStats,
List<ExprNodeDesc> argExprs) {
+ return estimate(argStats);
+ }
Review Comment:
I guess we can satisfy the requirements for the current use case without
adding a new method. We may obtain the required information via
`GenericUDF#initialize` if it's been initialized. If not initialized, we will
probably need this method. This example materializes a constant at compile-time.
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFRound.java
##########
ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFWhen.java:
##########
Review Comment:
Can we simplify the implementation and handle a few more general cases? This
is an idea I'm not obsessed with.
```java
@Override
public ObjectInspector initialize(ObjectInspector[] arguments) throws
UDFArgumentTypeException {
...
numberOfDistinctConstants = ???
}
static class WhenStatEstimator implements StatEstimator {
@Override
public Optional<ColStatistics> estimate(List<ColStatistics> argStats) {
...
var statistics = combiner.getResult();
if (statistics.getCountDistint() > 0 && numberOfDistinctConstants >
statistics.getCountDistint()) {
statistics.setCountDistinct(numberOfDistinctConstants);
}
return statistics;
}
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]