Re: [PR] [opt](dphyper)optimize dphyper's statistics calculation [doris]

via GitHub Wed, 01 Jul 2026 02:56:53 -0700


morrySnow commented on code in PR #64559:
URL: https://github.com/apache/doris/pull/64559#discussion_r3504925488



##########
fe/fe-core/src/main/java/org/apache/doris/statistics/Statistics.java:
##########
@@ -121,10 +121,16 @@ public void normalizeColumnStatistics() {
     }
 
     public void normalizeColumnStatistics(double inputRowCount) {
-        normalizeColumnStatistics(this.rowCount, false);
+        normalizeColumnStatistics(inputRowCount, expressionToColumnStats, 
false);
     }
 
     public void normalizeColumnStatistics(double inputRowCount, boolean 
isNumNullsDecreaseByProportion) {
+        normalizeColumnStatistics(inputRowCount, expressionToColumnStats, 
isNumNullsDecreaseByProportion);
+    }
+
+    public void normalizeColumnStatistics(double inputRowCount,
+            Map<Expression, ColumnStatistic> inputExpressionToColumnStats,

Review Comment:
   **🐛 `inputExpressionToColumnStats` parameter is never used**
   
   The new 3-arg `normalizeColumnStatistics(double, Map<Expression, 
ColumnStatistic>, boolean)` accepts `inputExpressionToColumnStats` but the 
method body exclusively iterates `this.expressionToColumnStats` (line 132: `for 
(Entry<Expression, ColumnStatistic> entry : 
expressionToColumnStats.entrySet())`).
   
   All 9+ callers in `FilterEstimation.java` and `JoinEstimation.java` pass 
`inputStats.columnStatistics()` as this argument, which is silently discarded. 
The same issue applies to the internal call from 
`withRowCountAndEnforceValid()` which passes `this.expressionToColumnStats`.
   
   **Impact**: If this parameter was intended to normalize against a different 
column set (e.g., the parent's column statistics before the current operation), 
the missing logic means column statistics normalization is operating on the 
wrong column map. If the parameter is genuinely unused, it should be removed to 
avoid confusing future maintainers.



##########
fe/fe-core/src/main/java/org/apache/doris/nereids/jobs/cascades/OptimizeGroupExpressionJob.java:
##########
@@ -93,9 +93,7 @@ private List<Rule> getExplorationRules(CascadesContext 
cascadesContext) {
     private List<Rule> getJoinRules() {
         boolean isDisableJoinReorder = 
context.getCascadesContext().getConnectContext().getSessionVariable()
                 .isDisableJoinReorder()
-                || context.getCascadesContext().isLeadingDisableJoinReorder()
-                || 
context.getCascadesContext().getMemo().getGroupExpressionsSize() > 
context.getCascadesContext()
-                
.getConnectContext().getSessionVariable().memoMaxGroupExpressionSize;
+                || context.getCascadesContext().isLeadingDisableJoinReorder();

Review Comment:
   **⚠️ Memo group expression size safety guard removed without replacement**
   
   The condition `memo.getGroupExpressionsSize() > memoMaxGroupExpressionSize` 
was removed from `getJoinRules()`. This guard previously disabled join reorder 
rules when the memo grew too large (default threshold: 10000), preventing 
optimization-time OOM for queries with large search spaces.
   
   Combined with the `dphyperLimit` increase from 1000 → 2600 in 
`SessionVariable.java`, this removes a critical safety valve. For queries with 
many joins, the optimizer may now enumerate far more plans without any 
size-based bail-out.
   
   **Suggestion**: Either keep the guard or provide an alternative bounding 
mechanism (e.g., limit on number of enumerated plans) and document why the 
removal is safe.



##########
fe/fe-core/src/main/java/org/apache/doris/nereids/cost/CostModel.java:
##########
@@ -566,8 +566,9 @@ public Cost visitPhysicalNestedLoopJoin(
         if (leftStatistics.getRowCount() < 10 * rightStatistics.getRowCount()) 
{
             nljPenalty = Math.min(leftStatistics.getRowCount(), 
rightStatistics.getRowCount());
         }
+        nljPenalty = Math.max(nljPenalty, 1.0);
         return Cost.of(context.getSessionVariable(),
-                leftStatistics.getRowCount() * rightStatistics.getRowCount(),
+                leftStatistics.getRowCount() * rightStatistics.getRowCount() * 
nljPenalty,

Review Comment:
   **⚠️ NLJ CPU cost now multiplied by penalty factor**
   
   Old: `leftStatistics.getRowCount() * rightStatistics.getRowCount()` (no 
penalty)
   New: `leftStatistics.getRowCount() * rightStatistics.getRowCount() * 
nljPenalty`
   
   When `leftRowCount < 10 * rightRowCount`, `nljPenalty = 
Math.min(leftRowCount, rightRowCount)`. For large tables, this can be in the 
millions, making the CPU cost `O(N³)` instead of `O(N²)`. This may severely 
over-penalize NLJ plans that would actually be efficient (e.g., small left side 
probing large right side with index), biasing the optimizer toward potentially 
worse hash join plans.
   
   **Suggestion**: Verify this is intentional. If the goal is to correct 
over-optimistic NLJ costs, consider a more moderate penalty factor (e.g., 
`log(nljPenalty)` or a capped multiplier).



##########
fe/fe-core/src/main/java/org/apache/doris/nereids/memo/GroupExpression.java:
##########
@@ -344,6 +344,16 @@ public void setEstOutputRowCount(double estOutputRowCount) 
{
         this.estOutputRowCount = estOutputRowCount;
     }
 
+    public double getEstOutputRowCount() {
+        return estOutputRowCount;

Review Comment:
   **⚠️ `clearCostState()` doesn't reset `estOutputRowCount`**
   
   This method clears `cost`, `lowestCostTable`, and `requestPropertiesMap` but 
leaves `estOutputRowCount` untouched (default `-1`, set via 
`setEstOutputRowCount()`).
   
   After `clearCostState()` is called in 
`MemoStatsAndCostRecomputer.recomputeGroupExpressionCost()`, any code that 
reads `getEstOutputRowCount()` before it's re-set will see the *old* row count 
from the prior optimization pass. If the re-estimation code path ever skips 
calling `setEstOutputRowCount()` for a particular expression, the stale value 
persists silently with no indicator.
   
   **Suggestion**: Reset `estOutputRowCount = -1` in `clearCostState()` or 
rename the method to clarify it only clears cost-related state.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [opt](dphyper)optimize dphyper's statistics calculation [doris]

Reply via email to