Re: [PR] [CALCITE-7010] The well-known count bug [calcite]

via GitHub Tue, 17 Jun 2025 14:00:48 -0700


mihaibudiu commented on code in PR #4375:
URL: https://github.com/apache/calcite/pull/4375#discussion_r2152962977



##########
core/src/main/java/org/apache/calcite/sql2rel/RelDecorrelator.java:
##########
@@ -760,6 +766,177 @@ protected RexNode removeCorrelationExpr(
 
     RelNode newRel = relBuilder.build();
 
+    for (AggregateCall aggCall : rel.getAggCallList()) {
+      if (aggCall.getAggregation() instanceof SqlCountAggFunction) {
+        parentPropagatesNullValues = false;
+        break;
+      }
+    }
+
+    // Special case where the group by is static (i.e., aggregation functions 
without group by).
+    //
+    // Background:
+    //   For the query:
+    //     SELECT SUM(salary), COUNT(name) FROM A;
+    //   When table A is empty, it returns [null, 0].
+    //   But for
+    //     SELECT SUM(salary), COUNT(name) FROM A group by id
+    //   When table A is empty, it returns empty. This causes result mismatch.
+    // In the general decorrelation framework, we add corVar as an additional 
groupKey to
+    // rewrite Correlate as JOIN. (See the code above for details) This means 
that when the input
+    // is empty, the default return value of aggregate functions without a 
GROUP BY clause
+    // is ignored. This leads to inconsistent results.
+    //
+    // We refer to this situation as: `The well-known count bug`,
+    // More details about this issue: Optimization of Nested SQL Queries 
Revisited
+    // (https://dl.acm.org/doi/pdf/10.1145/38714.38723)
+    //
+    // To handle this situation, we ensure aggregated result output through 
pre-join

Review Comment:
   the terminology of pre-join is confusing; perhaps you mean you are using a 
LEFT JOIN to ensure that an output is always produced.



##########
core/src/main/java/org/apache/calcite/sql2rel/RelDecorrelator.java:
##########
@@ -760,6 +766,177 @@ protected RexNode removeCorrelationExpr(
 
     RelNode newRel = relBuilder.build();
 
+    for (AggregateCall aggCall : rel.getAggCallList()) {
+      if (aggCall.getAggregation() instanceof SqlCountAggFunction) {
+        parentPropagatesNullValues = false;
+        break;
+      }
+    }
+
+    // Special case where the group by is static (i.e., aggregation functions 
without group by).
+    //
+    // Background:
+    //   For the query:
+    //     SELECT SUM(salary), COUNT(name) FROM A;
+    //   When table A is empty, it returns [null, 0].
+    //   But for
+    //     SELECT SUM(salary), COUNT(name) FROM A group by id
+    //   When table A is empty, it returns empty. This causes result mismatch.
+    // In the general decorrelation framework, we add corVar as an additional 
groupKey to
+    // rewrite Correlate as JOIN. (See the code above for details) This means 
that when the input
+    // is empty, the default return value of aggregate functions without a 
GROUP BY clause
+    // is ignored. This leads to inconsistent results.
+    //
+    // We refer to this situation as: `The well-known count bug`,
+    // More details about this issue: Optimization of Nested SQL Queries 
Revisited
+    // (https://dl.acm.org/doi/pdf/10.1145/38714.38723)
+    //
+    // To handle this situation, we ensure aggregated result output through 
pre-join
+    //
+    // Given the SQL:
+    //   SELECT deptno FROM dept d
+    //     WHERE 0 = ( SELECT COUNT(*) FROM emp e WHERE d.deptno = e.deptno)
+    // Corresponding plan:
+    //    LogicalProject(DEPTNO=[$0])
+    //      LogicalCorrelate(correlation=[$cor0], joinType=[inner], 
requiredColumns=[{0}])
+    //        LogicalProject(DEPTNO=[$0])
+    //          LogicalTableScan(table=[[scott, DEPT]])
+    //        LogicalProject(cs=[true])
+    //          LogicalFilter(condition=[=(0, $0)])
+    //            LogicalAggregate(group=[{}], EXPR$0=[COUNT()])
+    //              LogicalFilter(condition=[=($cor0.DEPTNO, $7)])
+    //                LogicalTableScan(table=[[scott, EMP]])
+    //
+    // The regular rewrite as:
+    //   SELECT d.deptno FROM dept d
+    //     JOIN (
+    //         SELECT true, e.deptno FROM emp e WHERE e.deptno IS NOT NULL
+    //         GROUP BY e.deptno HAVING COUNT(*) = 0
+    //     ) AS d0 ON d.deptno = d0.deptno
+    // Corresponding plan:
+    //    LogicalProject(DEPTNO=[$0])
+    //      LogicalJoin(condition=[=($0, $2)], joinType=[inner])
+    //        LogicalProject(DEPTNO=[$0])
+    //          LogicalTableScan(table=[[scott, DEPT]])
+    //        LogicalProject(cs=[true], DEPTNO=[$0])
+    //          LogicalFilter(condition=[=(0, $1)])
+    //            LogicalAggregate(group=[{0}], EXPR$0=[COUNT()])
+    //              LogicalProject(DEPTNO=[$7])
+    //                LogicalFilter(condition=[IS NOT NULL($7)])
+    //                  LogicalTableScan(table=[[scott, EMP]])
+    //  We can clearly observe that due to the presence of the GROUP BY clause,
+    //  COUNT(*) = 0 will never evaluate to true, since rows with zero records 
won't appear
+    //  in the GROUP BY results. This contradicts our intended semantics.
+    //
+    // Rewrite Aggregate as:
+    //   SELECT d.deptno FROM dept d
+    //     JOIN (
+    //          SELECT true AS cs, d2.deptno

Review Comment:
   had to remove this "d2" to make this program compile



##########
core/src/main/java/org/apache/calcite/sql2rel/RelDecorrelator.java:
##########
@@ -760,6 +766,177 @@ protected RexNode removeCorrelationExpr(
 
     RelNode newRel = relBuilder.build();
 
+    for (AggregateCall aggCall : rel.getAggCallList()) {
+      if (aggCall.getAggregation() instanceof SqlCountAggFunction) {
+        parentPropagatesNullValues = false;
+        break;
+      }
+    }
+
+    // Special case where the group by is static (i.e., aggregation functions 
without group by).
+    //
+    // Background:
+    //   For the query:
+    //     SELECT SUM(salary), COUNT(name) FROM A;
+    //   When table A is empty, it returns [null, 0].
+    //   But for
+    //     SELECT SUM(salary), COUNT(name) FROM A group by id
+    //   When table A is empty, it returns empty. This causes result mismatch.
+    // In the general decorrelation framework, we add corVar as an additional 
groupKey to
+    // rewrite Correlate as JOIN. (See the code above for details) This means 
that when the input
+    // is empty, the default return value of aggregate functions without a 
GROUP BY clause
+    // is ignored. This leads to inconsistent results.
+    //
+    // We refer to this situation as: `The well-known count bug`,
+    // More details about this issue: Optimization of Nested SQL Queries 
Revisited
+    // (https://dl.acm.org/doi/pdf/10.1145/38714.38723)
+    //
+    // To handle this situation, we ensure aggregated result output through 
pre-join
+    //
+    // Given the SQL:
+    //   SELECT deptno FROM dept d
+    //     WHERE 0 = ( SELECT COUNT(*) FROM emp e WHERE d.deptno = e.deptno)
+    // Corresponding plan:
+    //    LogicalProject(DEPTNO=[$0])
+    //      LogicalCorrelate(correlation=[$cor0], joinType=[inner], 
requiredColumns=[{0}])
+    //        LogicalProject(DEPTNO=[$0])
+    //          LogicalTableScan(table=[[scott, DEPT]])
+    //        LogicalProject(cs=[true])
+    //          LogicalFilter(condition=[=(0, $0)])
+    //            LogicalAggregate(group=[{}], EXPR$0=[COUNT()])
+    //              LogicalFilter(condition=[=($cor0.DEPTNO, $7)])
+    //                LogicalTableScan(table=[[scott, EMP]])
+    //
+    // The regular rewrite as:

Review Comment:
   Rewriting this as:
   ...
   produces an incorrect result.



##########
core/src/main/java/org/apache/calcite/sql2rel/RelDecorrelator.java:
##########
@@ -760,6 +766,177 @@ protected RexNode removeCorrelationExpr(
 
     RelNode newRel = relBuilder.build();
 
+    for (AggregateCall aggCall : rel.getAggCallList()) {
+      if (aggCall.getAggregation() instanceof SqlCountAggFunction) {
+        parentPropagatesNullValues = false;
+        break;
+      }
+    }
+
+    // Special case where the group by is static (i.e., aggregation functions 
without group by).
+    //
+    // Background:
+    //   For the query:
+    //     SELECT SUM(salary), COUNT(name) FROM A;
+    //   When table A is empty, it returns [null, 0].
+    //   But for
+    //     SELECT SUM(salary), COUNT(name) FROM A group by id
+    //   When table A is empty, it returns empty. This causes result mismatch.
+    // In the general decorrelation framework, we add corVar as an additional 
groupKey to
+    // rewrite Correlate as JOIN. (See the code above for details) This means 
that when the input
+    // is empty, the default return value of aggregate functions without a 
GROUP BY clause
+    // is ignored. This leads to inconsistent results.
+    //
+    // We refer to this situation as: `The well-known count bug`,
+    // More details about this issue: Optimization of Nested SQL Queries 
Revisited
+    // (https://dl.acm.org/doi/pdf/10.1145/38714.38723)
+    //
+    // To handle this situation, we ensure aggregated result output through 
pre-join
+    //
+    // Given the SQL:
+    //   SELECT deptno FROM dept d
+    //     WHERE 0 = ( SELECT COUNT(*) FROM emp e WHERE d.deptno = e.deptno)
+    // Corresponding plan:
+    //    LogicalProject(DEPTNO=[$0])
+    //      LogicalCorrelate(correlation=[$cor0], joinType=[inner], 
requiredColumns=[{0}])
+    //        LogicalProject(DEPTNO=[$0])
+    //          LogicalTableScan(table=[[scott, DEPT]])
+    //        LogicalProject(cs=[true])
+    //          LogicalFilter(condition=[=(0, $0)])
+    //            LogicalAggregate(group=[{}], EXPR$0=[COUNT()])
+    //              LogicalFilter(condition=[=($cor0.DEPTNO, $7)])
+    //                LogicalTableScan(table=[[scott, EMP]])
+    //
+    // The regular rewrite as:
+    //   SELECT d.deptno FROM dept d
+    //     JOIN (
+    //         SELECT true, e.deptno FROM emp e WHERE e.deptno IS NOT NULL
+    //         GROUP BY e.deptno HAVING COUNT(*) = 0
+    //     ) AS d0 ON d.deptno = d0.deptno
+    // Corresponding plan:
+    //    LogicalProject(DEPTNO=[$0])
+    //      LogicalJoin(condition=[=($0, $2)], joinType=[inner])
+    //        LogicalProject(DEPTNO=[$0])
+    //          LogicalTableScan(table=[[scott, DEPT]])
+    //        LogicalProject(cs=[true], DEPTNO=[$0])
+    //          LogicalFilter(condition=[=(0, $1)])
+    //            LogicalAggregate(group=[{0}], EXPR$0=[COUNT()])
+    //              LogicalProject(DEPTNO=[$7])
+    //                LogicalFilter(condition=[IS NOT NULL($7)])
+    //                  LogicalTableScan(table=[[scott, EMP]])
+    //  We can clearly observe that due to the presence of the GROUP BY clause,
+    //  COUNT(*) = 0 will never evaluate to true, since rows with zero records 
won't appear
+    //  in the GROUP BY results. This contradicts our intended semantics.
+    //
+    // Rewrite Aggregate as:
+    //   SELECT d.deptno FROM dept d
+    //     JOIN (
+    //          SELECT true AS cs, d2.deptno
+    //          FROM (
+    //              SELECT d2.deptno,
+    //                     CASE WHEN cnt0 IS NOT NULL THEN cnt0 ELSE 0 END AS 
cnt
+    //              FROM (
+    //                  SELECT d2.deptno, COUNT(e.empno) AS cnt0
+    //                  FROM
+    //                    (SELECT deptno FROM dept GROUP BY deptno) d2
+    //                  LEFT JOIN (
+    //                      SELECT * FROM emp WHERE deptno IS NOT NULL ) e
+    //                  ON d2.deptno IS NOT DISTINCT FROM e.deptno
+    //                  GROUP BY d2.deptno
+    //              ) AS left_join
+    //          ) AS case_count
+    //          WHERE cnt = 0
+    //     ) AS d0 ON d.deptno = d0.deptno
+    // Corresponding plan:
+    //    LogicalProject(DEPTNO=[$0])
+    //      LogicalJoin(condition=[=($0, $2)], joinType=[inner])
+    //        LogicalProject(DEPTNO=[$0])
+    //          LogicalTableScan(table=[[scott, DEPT]])
+    //        LogicalProject(cs=[true], DEPTNO=[$0])
+    //          LogicalFilter(condition=[=(0, $1)])
+    //            LogicalProject(DEPTNO=[$0], EXPR$0=[CASE(IS NOT NULL($2), 
$2, 0)])
+    //              LogicalJoin(condition=[IS NOT DISTINCT FROM($0, $1)], 
joinType=[left])
+    //                LogicalAggregate(group=[{0}]
+    //                  LogicalProject(DEPTNO=[$0])
+    //                    LogicalTableScan(table=[[scott, DEPT]])
+    //                LogicalAggregate(group=[{0}], EXPR$0=[COUNT()])
+    //                  LogicalProject(DEPTNO=[$7])
+    //                    LogicalFilter(condition=[IS NOT NULL($7)])
+    //                      LogicalTableScan(table=[[scott, EMP]])
+    //
+    // Here we perform an early join, preserving all possible CorVar sets from 
the outer scope
+    // and their corresponding aggregation results. This ensures that for any 
row from the left
+    // input of the Correlation, there is always an aggregation result 
available for join output.
+    //
+    // Implementation based on: Improving Unnesting of Complex Queries
+    // 
(https://dl.gi.de/server/api/core/bitstreams/c1918e8c-6a87-4da2-930a-bfed289f2388/content)
+    if (rel.getGroupType() == Aggregate.Group.SIMPLE
+        && rel.getGroupSet().isEmpty()
+        && !frame.corDefOutputs.isEmpty()
+        && !parentPropagatesNullValues) {
+      final Pair<CorrelationId, Frame> outerFramePair = 
requireNonNull(this.frameStack.peek());
+      final Frame outFrame = outerFramePair.right;
+      RexBuilder rexBuilder = relBuilder.getRexBuilder();
+
+      int groupKeySize = (int) corDefOutputs.keySet().stream()
+          .filter(a -> a.corr.equals(outerFramePair.left))
+          .count();
+      List<RelDataTypeField> newRelFields = newRel.getRowType().getFieldList();
+      ImmutableBitSet.Builder corFieldBuilder = ImmutableBitSet.builder();
+
+      // Here we record the mapping between the original index and the new 
project.
+      // For the count, we map it as `case when x is null then 0 else x`.
+      final Map<Integer, RexNode> newProjectMap = new HashMap<>();
+      final List<RexNode> conditions = new ArrayList<>();
+      for (Map.Entry<CorDef, Integer> corDefOutput : corDefOutputs.entrySet()) 
{
+        CorDef corDef = corDefOutput.getKey();
+        Integer corIndex = corDefOutput.getValue();
+        if (corDef.corr.equals(outerFramePair.left)) {
+          int newIdx = 
requireNonNull(outFrame.oldToNewOutputs.get(corDef.field));
+          corFieldBuilder.set(newIdx);
+
+          RelDataType type = 
outFrame.r.getRowType().getFieldList().get(newIdx).getType();
+          RexNode left = new RexInputRef(corFieldBuilder.cardinality() - 1, 
type);
+          newProjectMap.put(corIndex + groupKeySize, left);
+          conditions.add(
+              relBuilder.isNotDistinctFrom(left,
+                  new RexInputRef(corIndex + groupKeySize,
+                      newRelFields.get(corIndex).getType())));
+        }
+      }
+
+      ImmutableBitSet groupSet = corFieldBuilder.build();
+      final RelNode join = relBuilder.push(outFrame.r)

Review Comment:
   you can annotate the code here with the line from the above plan that is 
being constructed
   



##########
core/src/main/java/org/apache/calcite/sql2rel/RelDecorrelator.java:
##########
@@ -760,6 +766,177 @@ protected RexNode removeCorrelationExpr(
 
     RelNode newRel = relBuilder.build();
 
+    for (AggregateCall aggCall : rel.getAggCallList()) {
+      if (aggCall.getAggregation() instanceof SqlCountAggFunction) {
+        parentPropagatesNullValues = false;
+        break;
+      }
+    }
+
+    // Special case where the group by is static (i.e., aggregation functions 
without group by).
+    //
+    // Background:
+    //   For the query:
+    //     SELECT SUM(salary), COUNT(name) FROM A;
+    //   When table A is empty, it returns [null, 0].
+    //   But for
+    //     SELECT SUM(salary), COUNT(name) FROM A group by id
+    //   When table A is empty, it returns empty. This causes result mismatch.
+    // In the general decorrelation framework, we add corVar as an additional 
groupKey to
+    // rewrite Correlate as JOIN. (See the code above for details) This means 
that when the input
+    // is empty, the default return value of aggregate functions without a 
GROUP BY clause
+    // is ignored. This leads to inconsistent results.
+    //
+    // We refer to this situation as: `The well-known count bug`,
+    // More details about this issue: Optimization of Nested SQL Queries 
Revisited
+    // (https://dl.acm.org/doi/pdf/10.1145/38714.38723)
+    //
+    // To handle this situation, we ensure aggregated result output through 
pre-join
+    //
+    // Given the SQL:
+    //   SELECT deptno FROM dept d
+    //     WHERE 0 = ( SELECT COUNT(*) FROM emp e WHERE d.deptno = e.deptno)
+    // Corresponding plan:
+    //    LogicalProject(DEPTNO=[$0])
+    //      LogicalCorrelate(correlation=[$cor0], joinType=[inner], 
requiredColumns=[{0}])
+    //        LogicalProject(DEPTNO=[$0])
+    //          LogicalTableScan(table=[[scott, DEPT]])
+    //        LogicalProject(cs=[true])
+    //          LogicalFilter(condition=[=(0, $0)])
+    //            LogicalAggregate(group=[{}], EXPR$0=[COUNT()])
+    //              LogicalFilter(condition=[=($cor0.DEPTNO, $7)])
+    //                LogicalTableScan(table=[[scott, EMP]])
+    //
+    // The regular rewrite as:
+    //   SELECT d.deptno FROM dept d
+    //     JOIN (
+    //         SELECT true, e.deptno FROM emp e WHERE e.deptno IS NOT NULL
+    //         GROUP BY e.deptno HAVING COUNT(*) = 0
+    //     ) AS d0 ON d.deptno = d0.deptno
+    // Corresponding plan:
+    //    LogicalProject(DEPTNO=[$0])
+    //      LogicalJoin(condition=[=($0, $2)], joinType=[inner])
+    //        LogicalProject(DEPTNO=[$0])
+    //          LogicalTableScan(table=[[scott, DEPT]])
+    //        LogicalProject(cs=[true], DEPTNO=[$0])
+    //          LogicalFilter(condition=[=(0, $1)])
+    //            LogicalAggregate(group=[{0}], EXPR$0=[COUNT()])
+    //              LogicalProject(DEPTNO=[$7])
+    //                LogicalFilter(condition=[IS NOT NULL($7)])
+    //                  LogicalTableScan(table=[[scott, EMP]])
+    //  We can clearly observe that due to the presence of the GROUP BY clause,
+    //  COUNT(*) = 0 will never evaluate to true, since rows with zero records 
won't appear
+    //  in the GROUP BY results. This contradicts our intended semantics.

Review Comment:
   you can't really "contradict" the semantics; you just produce an incorrect 
result.
   (you are not respecting the semantics; and it is not "our" semantics, it is 
the SQL semantics).



##########
core/src/main/java/org/apache/calcite/sql2rel/RelDecorrelator.java:
##########
@@ -760,6 +766,177 @@ protected RexNode removeCorrelationExpr(
 
     RelNode newRel = relBuilder.build();
 
+    for (AggregateCall aggCall : rel.getAggCallList()) {
+      if (aggCall.getAggregation() instanceof SqlCountAggFunction) {
+        parentPropagatesNullValues = false;
+        break;
+      }
+    }
+
+    // Special case where the group by is static (i.e., aggregation functions 
without group by).
+    //
+    // Background:
+    //   For the query:
+    //     SELECT SUM(salary), COUNT(name) FROM A;
+    //   When table A is empty, it returns [null, 0].
+    //   But for
+    //     SELECT SUM(salary), COUNT(name) FROM A group by id
+    //   When table A is empty, it returns empty. This causes result mismatch.
+    // In the general decorrelation framework, we add corVar as an additional 
groupKey to
+    // rewrite Correlate as JOIN. (See the code above for details) This means 
that when the input
+    // is empty, the default return value of aggregate functions without a 
GROUP BY clause
+    // is ignored. This leads to inconsistent results.
+    //
+    // We refer to this situation as: `The well-known count bug`,
+    // More details about this issue: Optimization of Nested SQL Queries 
Revisited
+    // (https://dl.acm.org/doi/pdf/10.1145/38714.38723)
+    //
+    // To handle this situation, we ensure aggregated result output through 
pre-join
+    //
+    // Given the SQL:
+    //   SELECT deptno FROM dept d
+    //     WHERE 0 = ( SELECT COUNT(*) FROM emp e WHERE d.deptno = e.deptno)
+    // Corresponding plan:
+    //    LogicalProject(DEPTNO=[$0])
+    //      LogicalCorrelate(correlation=[$cor0], joinType=[inner], 
requiredColumns=[{0}])
+    //        LogicalProject(DEPTNO=[$0])
+    //          LogicalTableScan(table=[[scott, DEPT]])
+    //        LogicalProject(cs=[true])
+    //          LogicalFilter(condition=[=(0, $0)])
+    //            LogicalAggregate(group=[{}], EXPR$0=[COUNT()])
+    //              LogicalFilter(condition=[=($cor0.DEPTNO, $7)])
+    //                LogicalTableScan(table=[[scott, EMP]])
+    //
+    // The regular rewrite as:
+    //   SELECT d.deptno FROM dept d
+    //     JOIN (
+    //         SELECT true, e.deptno FROM emp e WHERE e.deptno IS NOT NULL
+    //         GROUP BY e.deptno HAVING COUNT(*) = 0
+    //     ) AS d0 ON d.deptno = d0.deptno
+    // Corresponding plan:
+    //    LogicalProject(DEPTNO=[$0])
+    //      LogicalJoin(condition=[=($0, $2)], joinType=[inner])
+    //        LogicalProject(DEPTNO=[$0])
+    //          LogicalTableScan(table=[[scott, DEPT]])
+    //        LogicalProject(cs=[true], DEPTNO=[$0])
+    //          LogicalFilter(condition=[=(0, $1)])
+    //            LogicalAggregate(group=[{0}], EXPR$0=[COUNT()])
+    //              LogicalProject(DEPTNO=[$7])
+    //                LogicalFilter(condition=[IS NOT NULL($7)])
+    //                  LogicalTableScan(table=[[scott, EMP]])
+    //  We can clearly observe that due to the presence of the GROUP BY clause,
+    //  COUNT(*) = 0 will never evaluate to true, since rows with zero records 
won't appear
+    //  in the GROUP BY results. This contradicts our intended semantics.
+    //
+    // Rewrite Aggregate as:
+    //   SELECT d.deptno FROM dept d
+    //     JOIN (
+    //          SELECT true AS cs, d2.deptno
+    //          FROM (
+    //              SELECT d2.deptno,
+    //                     CASE WHEN cnt0 IS NOT NULL THEN cnt0 ELSE 0 END AS 
cnt
+    //              FROM (
+    //                  SELECT d2.deptno, COUNT(e.empno) AS cnt0
+    //                  FROM
+    //                    (SELECT deptno FROM dept GROUP BY deptno) d2
+    //                  LEFT JOIN (
+    //                      SELECT * FROM emp WHERE deptno IS NOT NULL ) e
+    //                  ON d2.deptno IS NOT DISTINCT FROM e.deptno
+    //                  GROUP BY d2.deptno
+    //              ) AS left_join
+    //          ) AS case_count
+    //          WHERE cnt = 0
+    //     ) AS d0 ON d.deptno = d0.deptno
+    // Corresponding plan:
+    //    LogicalProject(DEPTNO=[$0])

Review Comment:
   I suggest numbering these lines - see below.



##########
core/src/main/java/org/apache/calcite/sql2rel/RelDecorrelator.java:
##########
@@ -760,6 +766,177 @@ protected RexNode removeCorrelationExpr(
 
     RelNode newRel = relBuilder.build();
 
+    for (AggregateCall aggCall : rel.getAggCallList()) {
+      if (aggCall.getAggregation() instanceof SqlCountAggFunction) {
+        parentPropagatesNullValues = false;
+        break;
+      }
+    }
+
+    // Special case where the group by is static (i.e., aggregation functions 
without group by).
+    //
+    // Background:
+    //   For the query:
+    //     SELECT SUM(salary), COUNT(name) FROM A;
+    //   When table A is empty, it returns [null, 0].
+    //   But for
+    //     SELECT SUM(salary), COUNT(name) FROM A group by id
+    //   When table A is empty, it returns empty. This causes result mismatch.
+    // In the general decorrelation framework, we add corVar as an additional 
groupKey to
+    // rewrite Correlate as JOIN. (See the code above for details) This means 
that when the input
+    // is empty, the default return value of aggregate functions without a 
GROUP BY clause
+    // is ignored. This leads to inconsistent results.
+    //
+    // We refer to this situation as: `The well-known count bug`,
+    // More details about this issue: Optimization of Nested SQL Queries 
Revisited
+    // (https://dl.acm.org/doi/pdf/10.1145/38714.38723)
+    //
+    // To handle this situation, we ensure aggregated result output through 
pre-join
+    //
+    // Given the SQL:
+    //   SELECT deptno FROM dept d
+    //     WHERE 0 = ( SELECT COUNT(*) FROM emp e WHERE d.deptno = e.deptno)
+    // Corresponding plan:
+    //    LogicalProject(DEPTNO=[$0])
+    //      LogicalCorrelate(correlation=[$cor0], joinType=[inner], 
requiredColumns=[{0}])
+    //        LogicalProject(DEPTNO=[$0])
+    //          LogicalTableScan(table=[[scott, DEPT]])
+    //        LogicalProject(cs=[true])
+    //          LogicalFilter(condition=[=(0, $0)])
+    //            LogicalAggregate(group=[{}], EXPR$0=[COUNT()])
+    //              LogicalFilter(condition=[=($cor0.DEPTNO, $7)])
+    //                LogicalTableScan(table=[[scott, EMP]])
+    //
+    // The regular rewrite as:
+    //   SELECT d.deptno FROM dept d
+    //     JOIN (
+    //         SELECT true, e.deptno FROM emp e WHERE e.deptno IS NOT NULL
+    //         GROUP BY e.deptno HAVING COUNT(*) = 0
+    //     ) AS d0 ON d.deptno = d0.deptno
+    // Corresponding plan:
+    //    LogicalProject(DEPTNO=[$0])
+    //      LogicalJoin(condition=[=($0, $2)], joinType=[inner])
+    //        LogicalProject(DEPTNO=[$0])
+    //          LogicalTableScan(table=[[scott, DEPT]])
+    //        LogicalProject(cs=[true], DEPTNO=[$0])
+    //          LogicalFilter(condition=[=(0, $1)])
+    //            LogicalAggregate(group=[{0}], EXPR$0=[COUNT()])
+    //              LogicalProject(DEPTNO=[$7])
+    //                LogicalFilter(condition=[IS NOT NULL($7)])
+    //                  LogicalTableScan(table=[[scott, EMP]])
+    //  We can clearly observe that due to the presence of the GROUP BY clause,
+    //  COUNT(*) = 0 will never evaluate to true, since rows with zero records 
won't appear
+    //  in the GROUP BY results. This contradicts our intended semantics.
+    //
+    // Rewrite Aggregate as:
+    //   SELECT d.deptno FROM dept d
+    //     JOIN (
+    //          SELECT true AS cs, d2.deptno
+    //          FROM (
+    //              SELECT d2.deptno,
+    //                     CASE WHEN cnt0 IS NOT NULL THEN cnt0 ELSE 0 END AS 
cnt

Review Comment:
   so this rewrites NULL (empty set) to 0?



##########
core/src/main/java/org/apache/calcite/sql2rel/RelDecorrelator.java:
##########
@@ -760,6 +766,177 @@ protected RexNode removeCorrelationExpr(
 
     RelNode newRel = relBuilder.build();
 
+    for (AggregateCall aggCall : rel.getAggCallList()) {
+      if (aggCall.getAggregation() instanceof SqlCountAggFunction) {
+        parentPropagatesNullValues = false;
+        break;
+      }
+    }
+
+    // Special case where the group by is static (i.e., aggregation functions 
without group by).
+    //
+    // Background:
+    //   For the query:
+    //     SELECT SUM(salary), COUNT(name) FROM A;
+    //   When table A is empty, it returns [null, 0].
+    //   But for
+    //     SELECT SUM(salary), COUNT(name) FROM A group by id
+    //   When table A is empty, it returns empty. This causes result mismatch.
+    // In the general decorrelation framework, we add corVar as an additional 
groupKey to
+    // rewrite Correlate as JOIN. (See the code above for details) This means 
that when the input
+    // is empty, the default return value of aggregate functions without a 
GROUP BY clause

Review Comment:
   I would not say that the result is ignored; there simply is no result.
   I would change the comment to say: thus the result produced using a JOIN is 
incorrect.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [CALCITE-7010] The well-known count bug [calcite]

Reply via email to