Re: [PR] [CALCITE-7010] The well-known count bug [calcite]

via GitHub Sat, 21 Jun 2025 06:36:22 -0700


iwanttobepowerful commented on code in PR #4375:
URL: https://github.com/apache/calcite/pull/4375#discussion_r2159980897



##########
core/src/main/java/org/apache/calcite/sql2rel/RelDecorrelator.java:
##########
@@ -782,11 +788,198 @@ protected RexNode removeCorrelationExpr(
 
     RelNode newRel = relBuilder.build();
 
+    for (AggregateCall aggCall : rel.getAggCallList()) {
+      if (aggCall.getAggregation() instanceof SqlCountAggFunction) {
+        parentPropagatesNullValues = false;
+        break;
+      }
+    }
+
+    if (rel.getGroupType() == Aggregate.Group.SIMPLE
+        && rel.getGroupSet().isEmpty()
+        && !frame.corDefOutputs.isEmpty()
+        && !parentPropagatesNullValues) {
+      newRel = rewriteScalarAggregate(rel, newRel, outputMap, corDefOutputs);
+    }
+
     // Aggregate does not change input ordering so corVars will be
     // located at the same position as the input newProject.
     return register(rel, newRel, outputMap, corDefOutputs);
   }
 
+  /**
+   * Special case where the group by is static (i.e., aggregation functions 
without group by).
+   *
+   * <p>Background:
+   *   For the query:
+   *     SELECT SUM(salary), COUNT(name) FROM A;
+   *   When table A is empty, it returns [null, 0].
+   *   But for
+   *     SELECT SUM(salary), COUNT(name) FROM A group by id
+   *   When table A is empty, it returns empty. This causes result mismatch.
+   * In the general decorrelation framework, we add corVar as an additional 
groupKey to
+   * rewrite Correlate as JOIN. (See the code above for details) This means 
that when the input
+   * is empty, the result produced using a JOIN is incorrect.
+   *
+   * <p>We refer to this situation as: `The well-known count bug`,
+   * More details about this issue: Optimization of Nested SQL Queries 
Revisited
+   * (https://dl.acm.org/doi/pdf/10.1145/38714.38723)
+   *
+   * <p>To handle this situation, we using a LEFT JOIN to ensure that an 
output is always produced.
+   *
+   * <p>Given the SQL:
+   *   SELECT deptno FROM dept d
+   *     WHERE 0 = (SELECT COUNT(*) FROM emp e WHERE d.deptno = e.deptno)
+   * Corresponding plan:
+   *    LogicalProject(DEPTNO=[$0])
+   *      LogicalCorrelate(correlation=[$cor0], joinType=[inner], 
requiredColumns=[{0}])
+   *        LogicalProject(DEPTNO=[$0])
+   *          LogicalTableScan(table=[[scott, DEPT]])
+   *        LogicalProject(cs=[true])
+   *          LogicalFilter(condition=[=(0, $0)])
+   *            LogicalAggregate(group=[{}], EXPR$0=[COUNT()])
+   *              LogicalFilter(condition=[=($cor0.DEPTNO, $7)])
+   *                LogicalTableScan(table=[[scott, EMP]])
+   *
+   * <p>Rewriting this as:
+   *   SELECT d.deptno FROM dept d
+   *     JOIN (
+   *         SELECT true, e.deptno FROM emp e WHERE e.deptno IS NOT NULL
+   *         GROUP BY e.deptno HAVING COUNT(*) = 0
+   *     ) AS d0 ON d.deptno = d0.deptno
+   * produces an incorrect result.
+   * Corresponding plan:
+   *    LogicalProject(DEPTNO=[$0])
+   *      LogicalJoin(condition=[=($0, $2)], joinType=[inner])
+   *        LogicalProject(DEPTNO=[$0])
+   *          LogicalTableScan(table=[[scott, DEPT]])
+   *        LogicalProject(cs=[true], DEPTNO=[$0])
+   *          LogicalFilter(condition=[=(0, $1)])
+   *            LogicalAggregate(group=[{0}], EXPR$0=[COUNT()]) // corresponds 
to {@code oldRel}
+   *              LogicalProject(DEPTNO=[$7])
+   *                LogicalFilter(condition=[IS NOT NULL($7)])
+   *                  LogicalTableScan(table=[[scott, EMP]])
+   *  We can clearly observe that due to the presence of the GROUP BY clause,
+   *  COUNT(*) = 0 will never evaluate to true, since rows with zero records 
won't appear
+   *  in the GROUP BY results. This produced incorrect results.
+   *
+   * <p>Rewrite Aggregate as:
+   *   SELECT d.deptno FROM dept d
+   *     JOIN (
+   *          SELECT true AS cs, deptno
+   *          FROM (
+   *              SELECT d2.deptno,
+   *                     CASE WHEN cnt0 IS NOT NULL THEN cnt0 ELSE 0 END AS cnt
+   *              FROM (
+   *                  SELECT d2.deptno, COUNT(e.empno) AS cnt0
+   *                  FROM
+   *                    (SELECT deptno FROM dept GROUP BY deptno) d2
+   *                  LEFT JOIN (
+   *                      SELECT * FROM emp WHERE deptno IS NOT NULL ) e
+   *                  ON d2.deptno IS NOT DISTINCT FROM e.deptno
+   *                  GROUP BY d2.deptno
+   *              ) AS left_join
+   *          ) AS case_count
+   *          WHERE cnt = 0
+   *     ) AS d0 ON d.deptno = d0.deptno
+   * Corresponding plan:
+   * [01] LogicalProject(DEPTNO=[$0])
+   * [02]   LogicalJoin(condition=[=($0, $2)], joinType=[inner])
+   * [03]     LogicalProject(DEPTNO=[$0])
+   * [04]       LogicalTableScan(table=[[scott, DEPT]])
+   * [05]     LogicalProject(cs=[true], DEPTNO=[$0])
+   * [06]       LogicalFilter(condition=[=(0, $1)])
+   * [07]         LogicalProject(DEPTNO=[$0], EXPR$0=[CASE(IS NOT NULL($2), 
$2, 0)])
+   * [08]           LogicalJoin(condition=[IS NOT DISTINCT FROM($0, $1)], 
joinType=[left])
+   * [09]             LogicalAggregate(group=[{0}])
+   * [10]               LogicalProject(DEPTNO=[$0])
+   * [11]                 LogicalTableScan(table=[[scott, DEPT]])
+   * [12]             LogicalAggregate(group=[{0}], EXPR$0=[COUNT()])

Review Comment:
   Does this include the optimization of aggregation pushdown? From the 
perspective of intuitiveness and ensuring a one-to-one correspondence between 
the code and the execution plan, I think the plan posted here should be the one 
without the aggregation pushdown optimization.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [CALCITE-7010] The well-known count bug [calcite]

Reply via email to