konstantinb commented on code in PR #6418:
URL: https://github.com/apache/hive/pull/6418#discussion_r3243043704


##########
ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java:
##########
@@ -2112,7 +2115,7 @@ private static List<Long> 
extractNDVGroupingColumns(List<ColStatistics> colStats
     for (ColStatistics cs : colStats) {
       if (cs != null) {
         long ndv = cs.getCountDistint();
-        if (cs.getNumNulls() > 0) {
+        if (cs.getNumNulls() > 0 && (ndv > 0 || cs.isConst())) {
           ndv = StatsUtils.safeAdd(ndv, 1);
         }

Review Comment:
   Good point. The `+1` accounts for NULL being its own GROUP BY bucket — 
that's why `numNulls > 0` gates it for the non-const case. The `|| 
cs.isConst()` clause was specifically to handle the all-NULL
     const case (`countDistint=0, numNulls>0`) where we still want one bucket.
   
     But you're right that `ndv = 1` is the cleaner way to express it. The 
current form works only because all three `setConst(true)` sites today satisfy 
the invariant `!(countDistint>0 && numNulls>0)` — that's implicit and would 
silently give the wrong answer (2 buckets) if some future code path marked a 
column const while preserving both a non-null distinct count and nulls. I'll 
switch to:
   
     ```java
     if (cs.isConst()) {
       ndv = 1;
     } else if (ndv > 0 && cs.getNumNulls() > 0) {
       ndv = StatsUtils.safeAdd(ndv, 1);
     }



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to