Re: [PR] HIVE-29368: More accurate pessimistic stats combining [hive]

via GitHub Wed, 25 Mar 2026 05:13:02 -0700


zabetak commented on code in PR #6359:
URL: https://github.com/apache/hive/pull/6359#discussion_r2986830609



##########
ql/src/java/org/apache/hadoop/hive/ql/plan/ColStatistics.java:
##########
@@ -31,6 +31,7 @@ public class ColStatistics {
   private boolean isPrimaryKey;
   private boolean isEstimated;
   private boolean isFilteredColumn;
+  private boolean isConst;

Review Comment:
   Can we avoid this new indicator? If we add a new field it means that we need 
to keep them up to date in various places where `ColStatistics` are used so I 
would prefer to avoid this if possible.



##########
ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java:
##########
@@ -1578,7 +1578,7 @@ public static ColStatistics 
getColStatisticsFromExpression(HiveConf conf, Statis
             csList.add(cs);
           }
           if (csList.size() == engfd.getChildren().size()) {
-            Optional<ColStatistics> res = se.estimate(csList);
+            Optional<ColStatistics> res = se.estimate(csList, numRows);

Review Comment:
   A `GenericUDF` can never have an NDV > numRows so instead of putting the 
responsibility to the implementors of `StatEstimator`  we could enforce the 
bound here. This would avoid the changes in the existing APIs and make the code 
more robust since even if the estimator returns an invalid NDV (e.g., greater 
than the numRows) it wouldn't mess up everything else.  



##########
ql/src/test/results/clientpositive/llap/list_bucket_dml_6.q.out:
##########
@@ -1,13 +1,17 @@
-PREHOOK: query: create table list_bucketing_dynamic_part_n3 (key String, value 
String) 
-    partitioned by (ds String, hr String) 
+hive.merge.mapfiles=true
+hive.merge.mapredfiles=false
+hive.merge.tezfiles=false
+hive.merge.smallfiles.avgsize=16000000

Review Comment:
   These props shouldn't be here and that's why the CI fails as well. Most 
likely a bad copy-paster in the wrong place.



##########
ql/src/test/queries/clientpositive/list_bucket_dml_6.q:
##########
@@ -92,6 +94,3 @@ select * from list_bucketing_dynamic_part_n3 where key = 
'484' and value = 'val_
 select * from list_bucketing_dynamic_part_n3 where key = '484' and value = 
'val_484';
 select * from srcpart where ds = '2008-04-08' and key = '484' and value = 
'val_484';
 
--- clean up
-drop table list_bucketing_dynamic_part_n3;
-

Review Comment:
   Unrelated changes and whitespace fixes, although minor, increase conflicts 
and backfor efforts so it's better to keep them apart and avoid them if 
possible.



##########
ql/src/java/org/apache/hadoop/hive/ql/stats/estimator/StatEstimator.java:
##########
@@ -24,20 +24,53 @@
 import org.apache.hadoop.hive.ql.plan.ColStatistics;
 
 /**
- * Enables statistics related computation on UDFs
+ * Enables statistics related computation on UDFs.
+ *
+ * <p>This interface provides two default implementations:
+ * <ul>
+ *   <li>{@link #estimate(List)} - clones the first argument's statistics 
(suitable for most UDFs)</li>
+ *   <li>{@link #estimate(List, long)} - calls estimate(List) and caps NDV at 
numRows</li>
+ * </ul>
+ *
+ * <p>UDFs that simply pass through statistics (like LOWER, UPPER) can use the 
defaults.
+ * UDFs that combine statistics (like IF, WHEN, COALESCE) should override 
{@link #estimate(List)}.
  */
 public interface StatEstimator {
 
   /**
    * Computes the output statistics of the actual UDF.
    *
-   * The estimator should return with a preferably overestimated {@link 
ColStatistics} object if possible.
-   * The actual estimation logic may decide to not give an estimation; it 
should return with {@link Optional#empty()}.
+   * <p>The default implementation clones the first argument's statistics, 
which is suitable
+   * for most UDFs that don't significantly alter the statistical properties 
of their input.
+   *
+   * <p>Override this method for UDFs that combine multiple inputs (like IF, 
WHEN, COALESCE)
+   * or significantly transform the data.
+   *
+   * @param argStats the statistics for every argument of the UDF
+   * @return {@link ColStatistics} estimate for the actual UDF, or empty if 
estimation is not possible
+   */
+  default Optional<ColStatistics> estimate(List<ColStatistics> argStats) {
+    if (argStats.isEmpty()) {
+      return Optional.empty();
+    }
+    return Optional.of(argStats.get(0).clone());
+  }

Review Comment:
   There is no reason to add a default implementation for this method. Leaving 
things as they are does not break anything.



##########
ql/src/test/queries/clientpositive/ndv_case_const.q:
##########
@@ -0,0 +1,27 @@
+CREATE TABLE t (cond INT, c2 STRING, c100 STRING);

Review Comment:
   As far as I understand this set of tests are mainly targeting the 
`Statistics: Num rows` entry of the `Group By Operator`. It may be worth adding 
a comment in here to clarify the intention of these tests so that if/when 
.q.out changes in the future the reviewer can quickly understand what these 
tests are supposed to check.



##########
ql/src/java/org/apache/hadoop/hive/ql/stats/estimator/PessimisticStatCombiner.java:
##########
@@ -41,9 +42,14 @@ public void add(ColStatistics stat) {
     if (stat.getAvgColLen() > result.getAvgColLen()) {
       result.setAvgColLen(stat.getAvgColLen());
     }
-    if (stat.getCountDistint() > result.getCountDistint()) {
-      result.setCountDistint(stat.getCountDistint());
+    // NDV=0 is "unknown" only if the stat is NOT a constant.
+    // Constants with NDV=0 (e.g., NULL) are "known zero", not unknown.
+    if ((result.getCountDistint() == 0 && !result.isConst()) || 
(stat.getCountDistint() == 0 && !stat.isConst())) {
+      result.setCountDistint(0);

Review Comment:
   Is this special handling here only relevant for NULL constants? For the sake 
of simplicity could we just skip/ignore stats with unknown NDV? In other words, 
if NDV is zero just let it be and just proceed with the summation (which 
effectively is a noop).
   
   As a side note, I am not sure why the count distinct of a NULL constant 
should be zero and not one. It's just another constant after all.



##########
ql/src/java/org/apache/hadoop/hive/ql/stats/estimator/StatEstimator.java:
##########
@@ -24,20 +24,53 @@
 import org.apache.hadoop.hive.ql.plan.ColStatistics;
 
 /**
- * Enables statistics related computation on UDFs
+ * Enables statistics related computation on UDFs.
+ *
+ * <p>This interface provides two default implementations:
+ * <ul>
+ *   <li>{@link #estimate(List)} - clones the first argument's statistics 
(suitable for most UDFs)</li>
+ *   <li>{@link #estimate(List, long)} - calls estimate(List) and caps NDV at 
numRows</li>
+ * </ul>
+ *
+ * <p>UDFs that simply pass through statistics (like LOWER, UPPER) can use the 
defaults.
+ * UDFs that combine statistics (like IF, WHEN, COALESCE) should override 
{@link #estimate(List)}.
  */
 public interface StatEstimator {
 
   /**
    * Computes the output statistics of the actual UDF.
    *
-   * The estimator should return with a preferably overestimated {@link 
ColStatistics} object if possible.
-   * The actual estimation logic may decide to not give an estimation; it 
should return with {@link Optional#empty()}.
+   * <p>The default implementation clones the first argument's statistics, 
which is suitable
+   * for most UDFs that don't significantly alter the statistical properties 
of their input.
+   *
+   * <p>Override this method for UDFs that combine multiple inputs (like IF, 
WHEN, COALESCE)
+   * or significantly transform the data.
+   *
+   * @param argStats the statistics for every argument of the UDF
+   * @return {@link ColStatistics} estimate for the actual UDF, or empty if 
estimation is not possible
+   */
+  default Optional<ColStatistics> estimate(List<ColStatistics> argStats) {
+    if (argStats.isEmpty()) {
+      return Optional.empty();
+    }
+    return Optional.of(argStats.get(0).clone());
+  }
+
+  /**
+   * Computes the output statistics of the actual UDF, ensuring NDV does not 
exceed numRows.

Review Comment:
   If we want to enforce that NDV does not exceed the numRows then its better 
to do it outside the estimator than put the burden on each implementation that 
we may not even have control over.



##########
ql/src/test/queries/clientpositive/list_bucket_dml_6.q:
##########
@@ -1,10 +1,12 @@
 --! qt:dataset:srcpart
+-- this ensures consistent split file counts between localhost & CI runs
+set tez.grouping.split-count=1;

Review Comment:
   This seems to address a flakiness issue that is independent of the PR. If 
that's the case then better to address it in a separate Jira ticket with a 
dedicated PR. Mixing flakiness fixes with core changes creates various 
complications especially during backports to other branches.



##########
ql/src/java/org/apache/hadoop/hive/ql/stats/estimator/StatEstimator.java:
##########
@@ -24,20 +24,53 @@
 import org.apache.hadoop.hive.ql.plan.ColStatistics;
 
 /**
- * Enables statistics related computation on UDFs
+ * Enables statistics related computation on UDFs.
+ *
+ * <p>This interface provides two default implementations:
+ * <ul>
+ *   <li>{@link #estimate(List)} - clones the first argument's statistics 
(suitable for most UDFs)</li>
+ *   <li>{@link #estimate(List, long)} - calls estimate(List) and caps NDV at 
numRows</li>
+ * </ul>
+ *
+ * <p>UDFs that simply pass through statistics (like LOWER, UPPER) can use the 
defaults.
+ * UDFs that combine statistics (like IF, WHEN, COALESCE) should override 
{@link #estimate(List)}.

Review Comment:
   Since the respective methods have Javadoc the overview here is not strictly 
necessary and appears a bit repetitive.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-29368: More accurate pessimistic stats combining [hive]

Reply via email to