morrySnow commented on code in PR #64618:
URL: https://github.com/apache/doris/pull/64618#discussion_r3427799964
##########
fe/fe-core/src/main/java/org/apache/doris/nereids/rules/rewrite/InferSetOperatorDistinct.java:
##########
@@ -77,4 +96,82 @@ private boolean rejectNLJ(Plan plan) {
}
return true;
}
+
+ private boolean shouldGenerateAggregateByNdv(Plan plan, List<? extends
NamedExpression> groupKeys) {
+ Statistics stats = plan.getStats();
+ if (stats == null) {
+ stats = plan.accept(derive, new StatsDerive.DeriveContext());
Review Comment:
**Potential NPE**: When `stats` is null, the code falls back to
`plan.accept(derive, ...)` to derive stats on-the-fly. However, the result of
`plan.accept(...)` is **never checked for null** before `stats.getRowCount()`
is called on line 105. If stats derivation fails and returns null (which can
happen for certain plan node types), this will throw an NPE.
**Suggestion**: Add a null check after derivation:
```java
if (stats == null) {
stats = plan.accept(derive, new StatsDerive.DeriveContext());
if (stats == null) {
return false;
}
}
```
##########
fe/fe-core/src/main/java/org/apache/doris/nereids/rules/rewrite/InferSetOperatorDistinct.java:
##########
@@ -77,4 +96,82 @@ private boolean rejectNLJ(Plan plan) {
}
return true;
}
+
+ private boolean shouldGenerateAggregateByNdv(Plan plan, List<? extends
NamedExpression> groupKeys) {
+ Statistics stats = plan.getStats();
+ if (stats == null) {
+ stats = plan.accept(derive, new StatsDerive.DeriveContext());
+ }
+ if (stats.getRowCount() <= 0) {
+ return false;
+ }
+
+ List<ColumnStatistic> lower = new ArrayList<>();
+ List<ColumnStatistic> medium = new ArrayList<>();
+ List<ColumnStatistic> high = new ArrayList<>();
+
+ List<ColumnStatistic>[] cards = new List[] { lower, medium, high };
+
+ for (NamedExpression key : groupKeys) {
+ ColumnStatistic colStats =
ExpressionEstimation.INSTANCE.estimate(key, stats);
+ if (colStats.isUnKnown) {
+ return false;
+ }
+ if (stats.getRowCount() * 0.9 <= colStats.ndv) {
+ return false;
+ }
+ cards[groupByCardinality(colStats,
stats.getRowCount())].add(colStats);
+ }
+
+ double lowerCartesian = 1.0;
+ for (ColumnStatistic colStats : lower) {
+ lowerCartesian = lowerCartesian * colStats.ndv;
+ }
+
+ // Same NDV heuristic as EagerAggRewriter#checkStats, but kept local
because set-op
+ // local distinct and eager aggregation have different optimization
boundaries.
+ double lowerUpper = Math.max(stats.getRowCount() / 20, 1);
+ lowerUpper = Math.pow(lowerUpper, Math.max(lower.size() / 2, 1));
+
+ if (high.isEmpty() && (lower.size() + medium.size()) <= 2) {
+ return true;
+ }
+
+ if (high.isEmpty() && medium.isEmpty()) {
+ if (lower.size() == 1 && lowerCartesian * 20 <=
stats.getRowCount()) {
+ return true;
+ } else if (lower.size() == 2 && lowerCartesian * 7 <=
stats.getRowCount()) {
+ return true;
+ } else if (lower.size() <= 3 && lowerCartesian * 20 <=
stats.getRowCount()
+ && lowerCartesian < lowerUpper) {
+ return true;
+ } else {
+ return false;
+ }
+ }
+
+ if (high.size() >= 2 || medium.size() > 2 || (high.size() == 1 &&
!medium.isEmpty())) {
+ return false;
+ }
+
+ double lowerCartesianLowerBound = stats.getRowCount() /
LOWER_AGGREGATE_EFFECT_COEFFICIENT;
+ if (high.size() + medium.size() == 1 && lower.size() <= 2
+ && lowerCartesian <= lowerCartesianLowerBound) {
+ return true;
+ }
+
+ return false;
+ }
+
+ private int groupByCardinality(ColumnStatistic colStats, double rowCount) {
+ if (rowCount == 0 || colStats.ndv *
MEDIUM_AGGREGATE_EFFECT_COEFFICIENT > rowCount) {
+ return 2;
+ } else if (colStats.ndv * MEDIUM_AGGREGATE_EFFECT_COEFFICIENT <=
rowCount
+ && colStats.ndv * LOW_AGGREGATE_EFFECT_COEFFICIENT > rowCount)
{
+ return 1;
+ } else if (colStats.ndv * LOW_AGGREGATE_EFFECT_COEFFICIENT <=
rowCount) {
+ return 0;
+ }
Review Comment:
**Unreachable code**: The `return 2;` at line 174 (after the `if-else
if-else if` chain) is **unreachable**. All three branches of the `if-else
if-else if` chain at lines 167-173 cover every possible case by the third
branch (`else if (ndv * 1000 <= rowCount)`). The `return 2;` fallthrough is
dead code. Remove it or replace with an `assert false`.
##########
fe/fe-core/src/main/java/org/apache/doris/nereids/jobs/executor/Rewriter.java:
##########
@@ -605,7 +605,7 @@ public class Rewriter extends AbstractBatchJobExecutor {
topic("",
cascadesContext ->
cascadesContext.rewritePlanContainsTypes(SetOperation.class),
topDown(new MergeOneRowRelationIntoUnion()),
- costBased(topDown(new
InferSetOperatorDistinct())),
+ topDown(new InferSetOperatorDistinct()),
Review Comment:
**Loss of hint support**: Changing from `costBased(topDown(...))` to
`topDown(...)` means the `use_INFER_SET_OPERATOR_DISTINCT` and
`use_NO_INFER_SET_OPERATOR_DISTINCT` hints are now **silently ignored**.
Previously, `CostBasedRewriteJob.checkRuleHint()` (line 117 of
CostBasedRewriteJob.java) checked for these hints and could force the rule on
or off. With a plain `topDown`, there is no hint mechanism.
**Users who relied on `use_NO_INFER_SET_OPERATOR_DISTINCT` to disable the
rule will now find their hint has no effect.** If this is intentional, the hint
should be deprecated and the PR description should explicitly document this
behavioral change.
##########
fe/fe-core/src/main/java/org/apache/doris/nereids/rules/rewrite/InferSetOperatorDistinct.java:
##########
@@ -77,4 +96,82 @@ private boolean rejectNLJ(Plan plan) {
}
return true;
}
+
+ private boolean shouldGenerateAggregateByNdv(Plan plan, List<? extends
NamedExpression> groupKeys) {
+ Statistics stats = plan.getStats();
+ if (stats == null) {
+ stats = plan.accept(derive, new StatsDerive.DeriveContext());
+ }
+ if (stats.getRowCount() <= 0) {
+ return false;
+ }
+
+ List<ColumnStatistic> lower = new ArrayList<>();
+ List<ColumnStatistic> medium = new ArrayList<>();
+ List<ColumnStatistic> high = new ArrayList<>();
+
+ List<ColumnStatistic>[] cards = new List[] { lower, medium, high };
+
+ for (NamedExpression key : groupKeys) {
+ ColumnStatistic colStats =
ExpressionEstimation.INSTANCE.estimate(key, stats);
+ if (colStats.isUnKnown) {
+ return false;
+ }
+ if (stats.getRowCount() * 0.9 <= colStats.ndv) {
+ return false;
+ }
+ cards[groupByCardinality(colStats,
stats.getRowCount())].add(colStats);
+ }
+
+ double lowerCartesian = 1.0;
+ for (ColumnStatistic colStats : lower) {
+ lowerCartesian = lowerCartesian * colStats.ndv;
+ }
+
+ // Same NDV heuristic as EagerAggRewriter#checkStats, but kept local
because set-op
Review Comment:
**Missing documentation for NDV heuristic**: The comment at line 131-132
notes this follows `EagerAggRewriter#checkStats`, but the magic constants and
branching logic are not explained:
- Why `lowerCartesian * 20 <= rowCount` for `lower.size() == 1` but `* 7`
for `lower.size() == 2`?
- Why `lowerUpper = pow(max(rowCount/20, 1), max(lower.size()/2, 1))`?
- Why `lowerCartesianLowerBound = rowCount / 10000`?
These thresholds encode empirical tuning decisions about when deduplication
is beneficial. Without comments explaining the rationale, future maintainers
can't adjust them safely. **Consider adding brief inline comments for each
threshold** (as `EagerAggRewriter` does at lines 634-637).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]