zabetak commented on code in PR #6418:
URL: https://github.com/apache/hive/pull/6418#discussion_r3322968964
##########
ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java:
##########
@@ -823,9 +823,11 @@ public static ColStatistics
getColStatistics(ColumnStatisticsObj cso, String col
if (numTrues == 0 && numFalses == 0) {
// All NULL column - no non-null distinct values
cs.setCountDistint(0);
+ cs.setConst(true);
} else if (numTrues == 0 || numFalses == 0) {
// One value type confirmed absent (=0), other is present (>0) or
unknown (<0)
cs.setCountDistint(1);
Review Comment:
Let's assume that `numTrues = 0` and `numFalses = -1` then we will do
`setCountDistinct(1)` which I guess is not correct or at least not consistent
with what happens above where we set it to zero. I guess we have to rewrite the
`if` statement which might also simplify the boolean condition in `cs.setConst`.
##########
ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java:
##########
@@ -823,9 +823,11 @@ public static ColStatistics
getColStatistics(ColumnStatisticsObj cso, String col
if (numTrues == 0 && numFalses == 0) {
// All NULL column - no non-null distinct values
cs.setCountDistint(0);
+ cs.setConst(true);
} else if (numTrues == 0 || numFalses == 0) {
// One value type confirmed absent (=0), other is present (>0) or
unknown (<0)
cs.setCountDistint(1);
+ cs.setConst(csd.getBooleanStats().getNumNulls() == 0 && (numTrues > 0
|| numFalses > 0));
} else {
// Both != 0: either both present (>0), both unknown (<0), or one
present + one unknown
Review Comment:
If both are unknown then why countDistinct is set to 2? Possibly
out-of-scope for the current PR but looks like a bug although it may never
happen in practice. I don't see any prod code setting trues/false to negative
values. The comment in conjunction with the code is a bit confusing.
##########
ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java:
##########
@@ -823,9 +823,11 @@ public static ColStatistics
getColStatistics(ColumnStatisticsObj cso, String col
if (numTrues == 0 && numFalses == 0) {
// All NULL column - no non-null distinct values
cs.setCountDistint(0);
+ cs.setConst(true);
Review Comment:
Not sure if this is correct. If the table is empty then naturally we will go
into this if statement and mark the column as constant. I am a bit skeptical if
we we should use the `setConst` method inside this method.
##########
ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java:
##########
@@ -2112,7 +2115,7 @@ private static List<Long>
extractNDVGroupingColumns(List<ColStatistics> colStats
for (ColStatistics cs : colStats) {
if (cs != null) {
long ndv = cs.getCountDistint();
- if (cs.getNumNulls() > 0) {
+ if (cs.getNumNulls() > 0 && (ndv > 0 || cs.isConst())) {
ndv = StatsUtils.safeAdd(ndv, 1);
}
Review Comment:
#### Part A
Setting NDV to 1 when we have constants seems like a reasonable and safe
improvement in the accuracy of estimations. I would be OK to approve the PR
with just this change.
```java
if (cs.isConst()) {
ndv = 1;
}
```
#### Part B
The benefit from from guarding the +1 increment with `ndv > 0` is more
subtle.
```java
if (ndv > 0 && cs.getNumNulls() > 0) {
ndv = StatsUtils.safeAdd(ndv, 1);
}
```
First of all, if we have stats and if numNulls is greater than zero then
very likely NDV would also be greater than zero.
Secondly, NDV == 0 and numNulls > 0 can be a valid use-case and it suffices
to modify your previous example just a bit to demonstrate how the new code can
lead to underestimation.
|Column|NDV|numNulls|
|--|--|--|
|product_id|100,000|0|
|funnel_step|0| 480,000,000|
|device_type|0|450,000,000|
##########
ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java:
##########
@@ -2112,7 +2115,7 @@ private static List<Long>
extractNDVGroupingColumns(List<ColStatistics> colStats
for (ColStatistics cs : colStats) {
if (cs != null) {
long ndv = cs.getCountDistint();
- if (cs.getNumNulls() > 0) {
+ if (cs.getNumNulls() > 0 && (ndv > 0 || cs.isConst())) {
ndv = StatsUtils.safeAdd(ndv, 1);
}
Review Comment:
So for the Part B, I think we can discuss a bit more about the "Sources of
the (NDV=0, numNulls>0)" under HIVE-29556 and depending on the frequency and
gravity of the underestimation we can decide how to proceed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]