Re: [PR] [improvement](fe) Infer set operator distinct by NDV [doris]

via GitHub Wed, 17 Jun 2026 04:41:08 -0700


morrySnow commented on code in PR #64618:
URL: https://github.com/apache/doris/pull/64618#discussion_r3427799964



##########
fe/fe-core/src/main/java/org/apache/doris/nereids/rules/rewrite/InferSetOperatorDistinct.java:
##########
@@ -77,4 +96,82 @@ private boolean rejectNLJ(Plan plan) {
         }
         return true;
     }
+
+    private boolean shouldGenerateAggregateByNdv(Plan plan, List<? extends 
NamedExpression> groupKeys) {
+        Statistics stats = plan.getStats();
+        if (stats == null) {
+            stats = plan.accept(derive, new StatsDerive.DeriveContext());

Review Comment:
   **Potential NPE**: When `stats` is null, the code falls back to 
`plan.accept(derive, ...)` to derive stats on-the-fly. However, the result of 
`plan.accept(...)` is **never checked for null** before `stats.getRowCount()` 
is called on line 105. If stats derivation fails and returns null (which can 
happen for certain plan node types), this will throw an NPE.
   
   **Suggestion**: Add a null check after derivation:
   ```java
   if (stats == null) {
       stats = plan.accept(derive, new StatsDerive.DeriveContext());
       if (stats == null) {
           return false;
       }
   }
   ```



##########
fe/fe-core/src/main/java/org/apache/doris/nereids/rules/rewrite/InferSetOperatorDistinct.java:
##########
@@ -77,4 +96,82 @@ private boolean rejectNLJ(Plan plan) {
         }
         return true;
     }
+
+    private boolean shouldGenerateAggregateByNdv(Plan plan, List<? extends 
NamedExpression> groupKeys) {
+        Statistics stats = plan.getStats();
+        if (stats == null) {
+            stats = plan.accept(derive, new StatsDerive.DeriveContext());
+        }
+        if (stats.getRowCount() <= 0) {
+            return false;
+        }
+
+        List<ColumnStatistic> lower = new ArrayList<>();
+        List<ColumnStatistic> medium = new ArrayList<>();
+        List<ColumnStatistic> high = new ArrayList<>();
+
+        List<ColumnStatistic>[] cards = new List[] { lower, medium, high };
+
+        for (NamedExpression key : groupKeys) {
+            ColumnStatistic colStats = 
ExpressionEstimation.INSTANCE.estimate(key, stats);
+            if (colStats.isUnKnown) {
+                return false;
+            }
+            if (stats.getRowCount() * 0.9 <= colStats.ndv) {
+                return false;
+            }
+            cards[groupByCardinality(colStats, 
stats.getRowCount())].add(colStats);
+        }
+
+        double lowerCartesian = 1.0;
+        for (ColumnStatistic colStats : lower) {
+            lowerCartesian = lowerCartesian * colStats.ndv;
+        }
+
+        // Same NDV heuristic as EagerAggRewriter#checkStats, but kept local 
because set-op
+        // local distinct and eager aggregation have different optimization 
boundaries.
+        double lowerUpper = Math.max(stats.getRowCount() / 20, 1);
+        lowerUpper = Math.pow(lowerUpper, Math.max(lower.size() / 2, 1));
+
+        if (high.isEmpty() && (lower.size() + medium.size()) <= 2) {
+            return true;
+        }
+
+        if (high.isEmpty() && medium.isEmpty()) {
+            if (lower.size() == 1 && lowerCartesian * 20 <= 
stats.getRowCount()) {
+                return true;
+            } else if (lower.size() == 2 && lowerCartesian * 7 <= 
stats.getRowCount()) {
+                return true;
+            } else if (lower.size() <= 3 && lowerCartesian * 20 <= 
stats.getRowCount()
+                    && lowerCartesian < lowerUpper) {
+                return true;
+            } else {
+                return false;
+            }
+        }
+
+        if (high.size() >= 2 || medium.size() > 2 || (high.size() == 1 && 
!medium.isEmpty())) {
+            return false;
+        }
+
+        double lowerCartesianLowerBound = stats.getRowCount() / 
LOWER_AGGREGATE_EFFECT_COEFFICIENT;
+        if (high.size() + medium.size() == 1 && lower.size() <= 2
+                && lowerCartesian <= lowerCartesianLowerBound) {
+            return true;
+        }
+
+        return false;
+    }
+
+    private int groupByCardinality(ColumnStatistic colStats, double rowCount) {
+        if (rowCount == 0 || colStats.ndv * 
MEDIUM_AGGREGATE_EFFECT_COEFFICIENT > rowCount) {
+            return 2;
+        } else if (colStats.ndv * MEDIUM_AGGREGATE_EFFECT_COEFFICIENT <= 
rowCount
+                && colStats.ndv * LOW_AGGREGATE_EFFECT_COEFFICIENT > rowCount) 
{
+            return 1;
+        } else if (colStats.ndv * LOW_AGGREGATE_EFFECT_COEFFICIENT <= 
rowCount) {
+            return 0;
+        }

Review Comment:
   **Unreachable code**: The `return 2;` at line 174 (after the `if-else 
if-else if` chain) is **unreachable**. All three branches of the `if-else 
if-else if` chain at lines 167-173 cover every possible case by the third 
branch (`else if (ndv * 1000 <= rowCount)`). The `return 2;` fallthrough is 
dead code. Remove it or replace with an `assert false`.



##########
fe/fe-core/src/main/java/org/apache/doris/nereids/jobs/executor/Rewriter.java:
##########
@@ -605,7 +605,7 @@ public class Rewriter extends AbstractBatchJobExecutor {
                         topic("",
                                 cascadesContext -> 
cascadesContext.rewritePlanContainsTypes(SetOperation.class),
                                 topDown(new MergeOneRowRelationIntoUnion()),
-                                costBased(topDown(new 
InferSetOperatorDistinct())),
+                                topDown(new InferSetOperatorDistinct()),

Review Comment:
   **Loss of hint support**: Changing from `costBased(topDown(...))` to 
`topDown(...)` means the `use_INFER_SET_OPERATOR_DISTINCT` and 
`use_NO_INFER_SET_OPERATOR_DISTINCT` hints are now **silently ignored**. 
   
   Previously, `CostBasedRewriteJob.checkRuleHint()` (line 117 of 
CostBasedRewriteJob.java) checked for these hints and could force the rule on 
or off. With a plain `topDown`, there is no hint mechanism.
   
   **Users who relied on `use_NO_INFER_SET_OPERATOR_DISTINCT` to disable the 
rule will now find their hint has no effect.** If this is intentional, the hint 
should be deprecated and the PR description should explicitly document this 
behavioral change.



##########
fe/fe-core/src/main/java/org/apache/doris/nereids/rules/rewrite/InferSetOperatorDistinct.java:
##########
@@ -77,4 +96,82 @@ private boolean rejectNLJ(Plan plan) {
         }
         return true;
     }
+
+    private boolean shouldGenerateAggregateByNdv(Plan plan, List<? extends 
NamedExpression> groupKeys) {
+        Statistics stats = plan.getStats();
+        if (stats == null) {
+            stats = plan.accept(derive, new StatsDerive.DeriveContext());
+        }
+        if (stats.getRowCount() <= 0) {
+            return false;
+        }
+
+        List<ColumnStatistic> lower = new ArrayList<>();
+        List<ColumnStatistic> medium = new ArrayList<>();
+        List<ColumnStatistic> high = new ArrayList<>();
+
+        List<ColumnStatistic>[] cards = new List[] { lower, medium, high };
+
+        for (NamedExpression key : groupKeys) {
+            ColumnStatistic colStats = 
ExpressionEstimation.INSTANCE.estimate(key, stats);
+            if (colStats.isUnKnown) {
+                return false;
+            }
+            if (stats.getRowCount() * 0.9 <= colStats.ndv) {
+                return false;
+            }
+            cards[groupByCardinality(colStats, 
stats.getRowCount())].add(colStats);
+        }
+
+        double lowerCartesian = 1.0;
+        for (ColumnStatistic colStats : lower) {
+            lowerCartesian = lowerCartesian * colStats.ndv;
+        }
+
+        // Same NDV heuristic as EagerAggRewriter#checkStats, but kept local 
because set-op

Review Comment:
   **Missing documentation for NDV heuristic**: The comment at line 131-132 
notes this follows `EagerAggRewriter#checkStats`, but the magic constants and 
branching logic are not explained:
   - Why `lowerCartesian * 20 <= rowCount` for `lower.size() == 1` but `* 7` 
for `lower.size() == 2`?
   - Why `lowerUpper = pow(max(rowCount/20, 1), max(lower.size()/2, 1))`?
   - Why `lowerCartesianLowerBound = rowCount / 10000`?
   
   These thresholds encode empirical tuning decisions about when deduplication 
is beneficial. Without comments explaining the rationale, future maintainers 
can't adjust them safely. **Consider adding brief inline comments for each 
threshold** (as `EagerAggRewriter` does at lines 634-637).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [improvement](fe) Infer set operator distinct by NDV [doris]

Reply via email to