zabetak commented on code in PR #6418:
URL: https://github.com/apache/hive/pull/6418#discussion_r3322968964


##########
ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java:
##########
@@ -823,9 +823,11 @@ public static ColStatistics 
getColStatistics(ColumnStatisticsObj cso, String col
       if (numTrues == 0 && numFalses == 0) {
         // All NULL column - no non-null distinct values
         cs.setCountDistint(0);
+        cs.setConst(true);
       } else if (numTrues == 0 || numFalses == 0) {
         // One value type confirmed absent (=0), other is present (>0) or 
unknown (<0)
         cs.setCountDistint(1);

Review Comment:
   Let's assume that `numTrues = 0` and `numFalses = -1` then we will do 
`setCountDistinct(1)` which I guess is not correct or at least not consistent 
with what happens above where we set it to zero. I guess we have to rewrite the 
`if` statement which might also simplify the boolean condition in `cs.setConst`.



##########
ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java:
##########
@@ -823,9 +823,11 @@ public static ColStatistics 
getColStatistics(ColumnStatisticsObj cso, String col
       if (numTrues == 0 && numFalses == 0) {
         // All NULL column - no non-null distinct values
         cs.setCountDistint(0);
+        cs.setConst(true);
       } else if (numTrues == 0 || numFalses == 0) {
         // One value type confirmed absent (=0), other is present (>0) or 
unknown (<0)
         cs.setCountDistint(1);
+        cs.setConst(csd.getBooleanStats().getNumNulls() == 0 && (numTrues > 0 
|| numFalses > 0));
       } else {
         // Both != 0: either both present (>0), both unknown (<0), or one 
present + one unknown

Review Comment:
   If both are unknown then why countDistinct is set to 2? Possibly 
out-of-scope for the current PR but looks like a bug although it may never 
happen in practice. I don't see any prod code setting trues/false to negative 
values. The comment in conjunction with the code is a bit confusing.



##########
ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java:
##########
@@ -823,9 +823,11 @@ public static ColStatistics 
getColStatistics(ColumnStatisticsObj cso, String col
       if (numTrues == 0 && numFalses == 0) {
         // All NULL column - no non-null distinct values
         cs.setCountDistint(0);
+        cs.setConst(true);

Review Comment:
   Not sure if this is correct. If the table is empty then naturally we will go 
into this if statement and mark the column as constant. I am a bit skeptical if 
we we should use the `setConst` method inside this method.



##########
ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java:
##########
@@ -2112,7 +2115,7 @@ private static List<Long> 
extractNDVGroupingColumns(List<ColStatistics> colStats
     for (ColStatistics cs : colStats) {
       if (cs != null) {
         long ndv = cs.getCountDistint();
-        if (cs.getNumNulls() > 0) {
+        if (cs.getNumNulls() > 0 && (ndv > 0 || cs.isConst())) {
           ndv = StatsUtils.safeAdd(ndv, 1);
         }

Review Comment:
   #### Part A
   Setting NDV to 1 when we have constants seems like a reasonable and safe 
improvement in the accuracy of estimations. I would be OK to approve the PR 
with just this change.
   ```java
   if (cs.isConst()) {
     ndv = 1;
   }
   ```
   #### Part B
   The benefit from from guarding the +1 increment with `ndv > 0` is more 
subtle.
   ```java
    if (ndv > 0 && cs.getNumNulls() > 0) {
     ndv = StatsUtils.safeAdd(ndv, 1);
   }
   ```
   First of all, if we have stats and if numNulls is greater than zero then 
very likely NDV would also be greater than zero.
   Secondly, NDV == 0 and numNulls > 0 can be a valid use-case and it suffices 
to modify your previous example just a bit to demonstrate how the new code can 
lead to underestimation.
   |Column|NDV|numNulls|
   |--|--|--|
   |product_id|100,000|0|
   |funnel_step|0|      480,000,000|
   |device_type|0|450,000,000|



##########
ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java:
##########
@@ -2112,7 +2115,7 @@ private static List<Long> 
extractNDVGroupingColumns(List<ColStatistics> colStats
     for (ColStatistics cs : colStats) {
       if (cs != null) {
         long ndv = cs.getCountDistint();
-        if (cs.getNumNulls() > 0) {
+        if (cs.getNumNulls() > 0 && (ndv > 0 || cs.isConst())) {
           ndv = StatsUtils.safeAdd(ndv, 1);
         }

Review Comment:
   So for the Part B, I think we can discuss a bit more about the "Sources of 
the (NDV=0, numNulls>0)" under HIVE-29556 and depending on the frequency and 
gravity of the underestimation we can decide how to proceed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to