Re: [PR] HIVE-29368: More accurate pessimistic stats combining [hive]

via GitHub Thu, 02 Apr 2026 02:39:26 -0700


zabetak commented on code in PR #6359:
URL: https://github.com/apache/hive/pull/6359#discussion_r3016466506



##########
ql/src/java/org/apache/hadoop/hive/ql/stats/estimator/StatEstimator.java:
##########
@@ -39,5 +40,19 @@ public interface StatEstimator {
    * @param argStats the statistics for every argument of the UDF
    * @return {@link ColStatistics} estimate for the actual UDF.
    */
-  public Optional<ColStatistics> estimate(List<ColStatistics> argStats);
+  default Optional<ColStatistics> estimate(List<ColStatistics> argStats) {
+    throw new UnsupportedOperationException("This estimator requires 
parentStats");
+  }

Review Comment:
   Do we need to change this to default?



##########
ql/src/test/results/clientpositive/llap/vectorized_stats.q.out:
##########
@@ -1207,13 +1207,13 @@ STAGE PLANS:
                       minReductionHashAggr: 0.99
                       mode: hash
                       outputColumnNames: _col0
-                      Statistics: Num rows: 1 Data size: 40 Basic stats: 
COMPLETE Column stats: COMPLETE
+                      Statistics: Num rows: 6144 Data size: 183480 Basic 
stats: COMPLETE Column stats: COMPLETE

Review Comment:
   Thanks for the explanation. Since the new estimate is not always better 
maybe we should rediscuss the change in extractNDVGroupingColumns or maybe 
address HIVE-29534 first.



##########
ql/src/java/org/apache/hadoop/hive/ql/stats/estimator/PessimisticStatCombiner.java:
##########
@@ -21,16 +21,26 @@
 import java.util.Optional;
 
 import org.apache.hadoop.hive.ql.plan.ColStatistics;
+import org.apache.hadoop.hive.ql.stats.StatsUtils;
 
 /**
  * Combines {@link ColStatistics} objects to provide the most pessimistic 
estimate.
  */
 public class PessimisticStatCombiner {
 
+  private final long numRows;
   private boolean inited;
+  private boolean hasUnknownNDV;
   private ColStatistics result;
 
+  public PessimisticStatCombiner(long numRows) {
+    this.numRows = numRows;
+  }
+
   public void add(ColStatistics stat) {
+    // NDV==0 means unknown, unless it's a NULL constant (numNulls == numRows)
+    hasUnknownNDV = hasUnknownNDV || (stat.getCountDistint() == 0 && 
stat.getNumNulls() != numRows);
+

Review Comment:
   Note that if in the CASE statement you have one branch with unknown NDV and 
another that is the NULL constant the result of the combiner will appear to 
other consumers as a NULL constant according to the criteria specified here. 
Probably an edge case but still wanted to highlight that its hard to cover 
everything from the moment that NDV = 0 is a valid value.



##########
ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java:
##########
@@ -2109,7 +2111,10 @@ private static List<Long> 
extractNDVGroupingColumns(List<ColStatistics> colStats
     for (ColStatistics cs : colStats) {
       if (cs != null) {
         long ndv = cs.getCountDistint();
-        if (cs.getNumNulls() > 0) {
+        // NDV needs to be adjusted if a column has a known NDV along with 
NULL values
+        // or if a column happens to be "const NULL"
+        if ((ndv > 0 && cs.getNumNulls() > 0) ||
+            (ndv == 0 && !cs.isEstimated() && cs.getNumNulls() == 
parentStats.getNumRows())) {

Review Comment:
   Another reason why I suggest to defer this on another patch is cause I am 
not fully convinced if we should actually do a +1 on the NDV for nulls. If we 
continue on the assumption that is used elsewhere that null is not a value and 
thus does not contribute to NDV then we shouldn't modify the NDV; possibly we 
should set the `numNulls` to 1.



##########
ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java:
##########
@@ -2109,7 +2111,10 @@ private static List<Long> 
extractNDVGroupingColumns(List<ColStatistics> colStats
     for (ColStatistics cs : colStats) {
       if (cs != null) {
         long ndv = cs.getCountDistint();
-        if (cs.getNumNulls() > 0) {
+        // NDV needs to be adjusted if a column has a known NDV along with 
NULL values
+        // or if a column happens to be "const NULL"
+        if ((ndv > 0 && cs.getNumNulls() > 0) ||
+            (ndv == 0 && !cs.isEstimated() && cs.getNumNulls() == 
parentStats.getNumRows())) {

Review Comment:
   It seems that this change led to more .q.out changes and addresses an NDV 
issue on a slightly different place.  Would it be possible to defer this to 
another patch/Jira so that we converge and merge this PR sooner. I understand 
the intention of the change but I am a bit reluctant to modify multiple things 
as part of the same change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-29368: More accurate pessimistic stats combining [hive]

Reply via email to