zabetak commented on code in PR #6359:
URL: https://github.com/apache/hive/pull/6359#discussion_r3033018561


##########
ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java:
##########
@@ -2109,7 +2111,10 @@ private static List<Long> 
extractNDVGroupingColumns(List<ColStatistics> colStats
     for (ColStatistics cs : colStats) {
       if (cs != null) {
         long ndv = cs.getCountDistint();
-        if (cs.getNumNulls() > 0) {
+        // NDV needs to be adjusted if a column has a known NDV along with 
NULL values
+        // or if a column happens to be "const NULL"
+        if ((ndv > 0 && cs.getNumNulls() > 0) ||
+            (ndv == 0 && !cs.isEstimated() && cs.getNumNulls() == 
parentStats.getNumRows())) {

Review Comment:
   CASE WHEN statements can appear everywhere in the query plan and the changes 
in the combiner make this stats better. I understand that there is some 
correlation with grouping columns but not all queries have a GROUP BY clause 
thus I feel we could separate the two problems. 
   
   I am not opposed in changing extractNDVGroupingColumns but was thinking that 
its probably better to do it in a separate PR. If we can't, then option 2 seems 
more appealing although I couldn't find out why this +1 logic was introduced in 
the first place.
   
   Overall, I trust your judgement since you spend much more time on the topic 
so I am happy to let you decide how you want to move forward.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to