konstantinb commented on code in PR #6359:
URL: https://github.com/apache/hive/pull/6359#discussion_r2988884907


##########
ql/src/java/org/apache/hadoop/hive/ql/stats/estimator/PessimisticStatCombiner.java:
##########
@@ -41,9 +42,14 @@ public void add(ColStatistics stat) {
     if (stat.getAvgColLen() > result.getAvgColLen()) {
       result.setAvgColLen(stat.getAvgColLen());
     }
-    if (stat.getCountDistint() > result.getCountDistint()) {
-      result.setCountDistint(stat.getCountDistint());
+    // NDV=0 is "unknown" only if the stat is NOT a constant.
+    // Constants with NDV=0 (e.g., NULL) are "known zero", not unknown.
+    if ((result.getCountDistint() == 0 && !result.isConst()) || 
(stat.getCountDistint() == 0 && !stat.isConst())) {
+      result.setCountDistint(0);

Review Comment:
   @zabetak, this is the most complicated problem to solve, in my opinion. The 
following code: 
https://github.com/apache/hive/blob/931d4bb62b26de699240c816df439e00644e3dcb/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L1634
 is very explicit in assigning an NDV of 0 to NULL constants and 1 to non-NULL 
constants. At the same time, an NDV of "0" for a source column is typically 
used to indicate that the NDV for the column. is "unknown", which could matter 
a lot for large tables. Therefore, simply "summing" an NDV of 0 introduces even 
bigger mis-estimations
   
   To compensate for the "0 NDV" null constant, The following code: 
https://github.com/apache/hive/blob/931d4bb62b26de699240c816df439e00644e3dcb/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L2113
 has a "+1" adjustment when figuring out NDVs of a GROUP BY. 
   
   I am thinking of modifying buildColStatForConstant() to treat NULL values as 
regular constants and see if we run into any significant side effects. If you 
have any additional thoughts on the subject, I would greatly appreciate knowing 
those. Thank you!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to