zabetak commented on code in PR #6359:
URL: https://github.com/apache/hive/pull/6359#discussion_r2994697977
##########
ql/src/java/org/apache/hadoop/hive/ql/stats/estimator/PessimisticStatCombiner.java:
##########
@@ -41,9 +42,14 @@ public void add(ColStatistics stat) {
if (stat.getAvgColLen() > result.getAvgColLen()) {
result.setAvgColLen(stat.getAvgColLen());
}
- if (stat.getCountDistint() > result.getCountDistint()) {
- result.setCountDistint(stat.getCountDistint());
+ // NDV=0 is "unknown" only if the stat is NOT a constant.
+ // Constants with NDV=0 (e.g., NULL) are "known zero", not unknown.
+ if ((result.getCountDistint() == 0 && !result.isConst()) ||
(stat.getCountDistint() == 0 && !stat.isConst())) {
+ result.setCountDistint(0);
Review Comment:
I checked a bit the code and it appears that nulls should not add up to NDV.
For example:
```sql
SELECT CASE WHEN cond=1 THEN NULL WHEN cond=2 THEN 'A' ELSE 'B' END x FROM t
```
NDV(x) = 2
NUM_NULLS(x) = |t|
so it might be better to keep things as are and don't modify
`buildColStatForConstant`.
In fact, everything would be simpler if we could use -1 for NDV to declare
unknown as it happens for the other stats. This is probably a bigger change to
digest so let's not go into this direction for now.
I understand the risk that summing NDV with zero can lead to bad estimations
but it is still an improvement against the current state of the code (at least
it does not worsen the situation). We could also be a bit more defensive and
say that if numNulls >= 0 then NDV is valid and safe to use (not unknown).
Still if you consider this to be very risky then I am OK to revert to your
original proposal of adding a new boolean field in `ColStatistics` object for
marking constants. Leaving the final choice to you :)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]