okumin commented on code in PR #6244:
URL: https://github.com/apache/hive/pull/6244#discussion_r2767210302


##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java:
##########
@@ -2030,6 +2030,18 @@ public Object process(Node nd, Stack<Node> stack, 
NodeProcessorCtx procCtx,
         }
       }
 
+      // NDV=0 means join key statistics are unavailable - fall back to 
joinFactor heuristic
+      if (allSatisfyPreCondition) {
+        for (int pos = 0; pos < parents.size(); pos++) {
+          ReduceSinkOperator parent = (ReduceSinkOperator) 
jop.getParentOperators().get(pos);
+          List<String> keyExprs = 
StatsUtils.getQualifedReducerKeyNames(parent.getConf().getOutputKeyColumnNames());
+          if (!satisfyPrecondition(parent.getStatistics(), keyExprs)) {
+            allSatisfyPreCondition = false;
+            break;
+          }
+        }
+      }

Review Comment:
   Note: This is probably ok but I want to check it again



##########
ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFWhen.java:
##########
@@ -144,6 +148,53 @@ static class WhenStatEstimator implements StatEstimator {
 
     @Override
     public Optional<ColStatistics> estimate(List<ColStatistics> argStats) {
+      return estimate(argStats, null);
+    }
+
+    @Override
+    public Optional<ColStatistics> estimate(List<ColStatistics> argStats, 
List<ExprNodeDesc> argExprs) {

Review Comment:
   I'm clarifying my understanding. Please let me know if I'm overlooking 
something.
   Let's assume the number of distinct values of `col_2` is 2, that of 
`col_100` is 100, and that of `col_999` is 999.
   
   The true derived NDV of the following expression is 3. The original 
implementation returns 1, and this implementation returns 3.
   
   ```sql
   CASE
     WHEN category BETWEEN 0 AND 4 THEN 'CODE_00'
     WHEN category BETWEEN 5 AND 9 THEN 'CODE_01'
     ELSE 'CODE_ELSE'
   END
   ```
   
   That of this is 2. The original implementation returns 1, and this 
implementation returns 2.
   
   ```sql
   CASE
     WHEN category BETWEEN 0 AND 4 THEN 'CODE_00'
     WHEN category BETWEEN 5 AND 9 THEN 'CODE_01'
     ELSE 'CODE_01'
   END
   ```
   
   That of this is 100, 101, or 102. The original implementation returns 100, 
and this implementation returns 100.
   
   ```sql
   CASE
     WHEN category BETWEEN 0 AND 4 THEN 'CODE_00'
     WHEN category BETWEEN 5 AND 9 THEN 'CODE_01'
     ELSE col_100
   END
   ```
   
   That of this is 999 ~ 1100. The original implementation returns 999, and 
this implementation returns 999.
   
   ```sql
   CASE
     WHEN category BETWEEN 0 AND 4 THEN 'CODE_00'
     WHEN category BETWEEN 5 AND 9 THEN col_999
     ELSE col_100
   END
   ```
   
   That of this is 6 ~ 8. The original implementation returns 2, and this 
implementation returns 2.
   
   ```sql
   CASE
       WHEN category BETWEEN 0 AND 4 THEN 'CODE_00'
       WHEN category BETWEEN 5 AND 9 THEN 'CODE_01'
       WHEN category BETWEEN 10 AND 14 THEN 'CODE_02'
       WHEN category BETWEEN 15 AND 19 THEN 'CODE_03'
       WHEN category BETWEEN 20 AND 24 THEN 'CODE_04'
       WHEN category BETWEEN 25 AND 29 THEN 'CODE_05'
     ELSE col_2
   END
   ```
   
   I'd say the current patch does not introduce worse estimation in any case.



##########
ql/src/java/org/apache/hadoop/hive/ql/plan/Statistics.java:
##########
@@ -256,7 +256,12 @@ public void addToColumnStats(List<ColStatistics> colStats) 
{
             updatedCS = columnStats.get(key);
             updatedCS.setAvgColLen(Math.max(updatedCS.getAvgColLen(), 
cs.getAvgColLen()));
             updatedCS.setNumNulls(StatsUtils.safeAdd(updatedCS.getNumNulls(), 
cs.getNumNulls()));
-            updatedCS.setCountDistint(Math.max(updatedCS.getCountDistint(), 
cs.getCountDistint()));
+            if(updatedCS.getCountDistint() > 0 && cs.getCountDistint() > 0) {

Review Comment:
   ```suggestion
               if (updatedCS.getCountDistint() > 0 && cs.getCountDistint() > 0) 
{
   ```



##########
ql/src/java/org/apache/hadoop/hive/ql/stats/estimator/StatEstimator.java:
##########
@@ -40,4 +41,18 @@ public interface StatEstimator {
    * @return {@link ColStatistics} estimate for the actual UDF.
    */
   public Optional<ColStatistics> estimate(List<ColStatistics> argStats);
+
+  /**
+   * Computes the output statistics of the actual UDF with access to original 
expressions.
+   *
+   * This method provides access to the original expression nodes, allowing 
estimators to make
+   * more accurate estimates when expressions have special properties (e.g., 
all constants).
+   *
+   * @param argStats the statistics for every argument of the UDF
+   * @param argExprs the original expression nodes for every argument of the 
UDF
+   * @return {@link ColStatistics} estimate for the actual UDF.
+   */
+  default Optional<ColStatistics> estimate(List<ColStatistics> argStats, 
List<ExprNodeDesc> argExprs) {
+    return estimate(argStats);
+  }

Review Comment:
   I guess we can satisfy the requirements for the current use case without 
adding a new method. We may obtain the required information via 
`GenericUDF#initialize` if it's been initialized. If not initialized, we will 
probably need this method. This example materializes a constant at compile-time.
   
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFRound.java



##########
ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFWhen.java:
##########


Review Comment:
   Can we simplify the implementation and handle a few more general cases? This 
is an idea I'm not obsessed with.
   
   ```java
     @Override
     public ObjectInspector initialize(ObjectInspector[] arguments) throws 
UDFArgumentTypeException {
       ...
       numberOfDistinctConstants = ???
     }
   
       static class WhenStatEstimator implements StatEstimator {
   
       @Override
       public Optional<ColStatistics> estimate(List<ColStatistics> argStats) {
         ...
         var statistics = combiner.getResult();
         if (statistics.getCountDistint() > 0 && numberOfDistinctConstants > 
statistics.getCountDistint()) {
           statistics.setCountDistinct(numberOfDistinctConstants);
         }
         return statistics;
       }
     }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to