Re: [PR] [CALCITE-5465] Rule of AGGREGATE_EXPAND_DISTINCT_AGGREGATES produces an incorrect plan when sql has distinct agg-call with rollup [calcite]

via GitHub Tue, 18 Nov 2025 17:07:39 -0800


mihaibudiu commented on code in PR #4636:
URL: https://github.com/apache/calcite/pull/4636#discussion_r2540079659



##########
core/src/test/resources/org/apache/calcite/test/RelOptRulesTest.xml:
##########
@@ -3283,11 +3285,13 @@ LogicalAggregate(group=[{0, 1}], groups=[[{0, 1}, {0}, 
{}]], EXPR$2=[COUNT(DISTI
     </Resource>
     <Resource name="planAfter">
       <![CDATA[
-LogicalAggregate(group=[{0, 1}], groups=[[{0, 1}, {0}, {}]], EXPR$2=[COUNT($2) 
FILTER $4], EXPR$3=[MIN($3) FILTER $5])
-  LogicalProject(DEPTNO=[$0], JOB=[$1], ENAME=[$2], EXPR$3=[$3], $g_0=[=($4, 
0)], $g_1=[=($4, 1)])
-    LogicalAggregate(group=[{0, 1, 2}], groups=[[{0, 1, 2}, {0, 1}]], 
EXPR$3=[SUM($3)], $g=[GROUPING($0, $1, $2)])
-      LogicalProject(DEPTNO=[$7], JOB=[$2], ENAME=[$1], SAL=[$5])
-        LogicalTableScan(table=[[CATALOG, SALES, EMP]])
+LogicalProject(DEPTNO=[$0], JOB=[$1], EXPR$2=[CAST(CASE(=($11, 0), $2, =($11, 
1), $3, =($11, 3), $4, null:BIGINT)):BIGINT NOT NULL], EXPR$3=[CASE(=($11, 0), 
$5, =($11, 1), $6, =($11, 3), $7, null:INTEGER)])
+  LogicalFilter(condition=[OR(AND(=($11, 0), >($8, 0)), AND(=($11, 1), >($9, 
0)), =($11, 3))])
+    LogicalAggregate(group=[{0, 1}], groups=[[{0, 1}, {0}, {}]], 
EXPR$2=[COUNT($2) FILTER $4], EXPR$2=[COUNT($2) FILTER $6], EXPR$2=[COUNT($2) 
FILTER $8], EXPR$3=[MIN($3) FILTER $5], EXPR$3=[MIN($3) FILTER $7], 
EXPR$3=[MIN($3) FILTER $9], $g_present_0=[COUNT() FILTER $5], 
$g_present_1=[COUNT() FILTER $7], $g_present_2=[COUNT() FILTER $9], 
$g_final=[GROUPING($0, $1)])

Review Comment:
   why are there multiple fields with the same name in this aggregate?
   



##########
core/src/main/java/org/apache/calcite/rel/rules/AggregateExpandDistinctAggregatesRule.java:
##########
@@ -425,39 +426,141 @@ private static RelBuilder 
convertSingletonDistinct(RelBuilder relBuilder,
     return relBuilder;
   }
 
+  /**
+   * Rewrite aggregates that use GROUPING SETS. High-level steps:
+   *
+   * <ol>
+   * <li>
+   *   Build the set of grouping keys (<code>groupSets</code>) required by the 
rewrite.
+   *   This includes: the original <code>groupSet</code>, each declared 
grouping set from
+   *   the SQL, and any additional grouping-set variants required to expand 
DISTINCT
+   *   aggregates (that is, the union of distinct-argument columns with each 
grouping set,
+   *   plus any filter columns).
+   * </li>
+   * <li>
+   *   Compute the <code>fullGroupSet</code> (the union of all bits) and run a 
bottom
+   *   aggregate over that set to produce a <code>GROUPING(...)</code> value 
and
+   *   per-variant boolean marker columns.
+   * </li>
+   * <li>
+   *   From the bottom result, construct upper aggregates that select the 
appropriate
+   *   values for each declared grouping set (using the grouping id and 
presence
+   *   indicators to route rows correctly).
+   * </li>
+   * <li>
+   *   Finally, apply a filter to remove internal-only rows that do not 
contribute to
+   *   any declared grouping-set result.
+   * </li>
+   * </ol>
+   *
+   * <p>Example (ROLLUP + DISTINCT):
+   * <ul>
+   * <li>SQL = {@code
+   *   SELECT deptno, COUNT(DISTINCT sal)
+   *   FROM emp
+   *   GROUP BY ROLLUP(deptno)
+   * }.
+   * </li>
+   * <li>Remarks and high-level steps performed by the rewrite:
+   *   <ol>
+   *   <li>Expand ROLLUP: {@code ROLLUP(deptno)} produces two grouping sets:
+   *       (deptno) and (). The rewrite must produce results for both the 
per-department
+   *       counts and the grand-total.</li>
+   *   <li>Enumerate distinct grouping-set combinations: because the aggregate 
is
+   *       {@code COUNT(DISTINCT sal)}, for each grouping set we must consider 
the
+   *       distinct-argument unioned with the grouping set. That yields the
+   *       grouping-set combinations {@code {deptno, sal}} (for the (deptno)
+   *       grouping set) and {@code {sal}} (for the empty grouping set).</li>
+   *   <li>Compute {@code fullGroupSet}: take the union of all bits that 
appear in
+   *       any grouping set or distinct-arg combination. In this example the
+   *       {@code fullGroupSet} is {@code {deptno, sal}}. A bottom aggregate 
runs
+   *       over the {@code fullGroupSet} to materialize rows at the granularity
+   *       needed (for example per-(deptno, sal) and per-(sal)) and to produce
+   *       a {@code GROUPING(...)} value and boolean markers identifying which
+   *       declared grouping-set(s) each row contributes to (see
+   *       {@code distinctFilterArgMap} and the code that builds {@code 
filters}).</li>
+   *   <li>Project boolean selectors: convert the numeric {@code GROUPING(...)}
+   *       result into boolean selector columns (and combine with any filter
+   *       arguments). These selectors indicate whether a bottom-row should be
+   *       counted for a particular declared grouping set. The code building 
the
+   *       selector projection appears where {@code nodeZ} is used and aliased
+   *       into {@code $g_<value>} (and variants with filter suffixes).</li>
+   *   <li>Build upper aggregates: for each declared grouping set the rewrite
+   *       creates an upper aggregate that groups by the remapped top-group key
+   *       and uses the boolean selectors to pick the rows that belong to that
+   *       grouping set. For the example:
+   *     <ul>
+   *       <li>For grouping set {@code (deptno)}: group by {@code deptno} and 
count
+   *       distinct {@code sal} using only rows where the selector for
+   *       {@code {deptno, sal}} is true.</li>
+   *       <li>For grouping set {@code ()}: compute the grand total
+   *       {@code COUNT(DISTINCT sal)} by counting distinct {@code sal} over 
rows
+   *       where the selector for {@code {sal}} is true.</li>
+   *     </ul>
+   *   The code that builds these upper aggregates is where {@code 
upperAggCalls}
+   *   and {@code aggCallOrdinals} are populated.
+   *   </li>
+   *   <li>Filter out internal rows: some bottom rows exist only to make
+   *       internal-only grouping-set combinations possible; they never
+   *       contribute to any original grouping-set result. The rewrite filters
+   *       those rows using the conditions assembled in {@code 
keepConditions}.</li>
+   *   </ol>
+   * </li>
+   * </ul>

Review Comment:
   this is a clear improvement, but my suggestion was to take a pair of plans 
from one of the test you wrote, e.g.:
   ```
   LogicalAggregate(group=[{0, 1}], groups=[[{0, 1}, {0}, {}]], 
EXPR$2=[COUNT(DISTINCT $2)])
     LogicalProject(DEPTNO=[$7], JOB=[$2], ENAME=[$1])
       LogicalTableScan(table=[[CATALOG, SALES, EMP]])
   ```
   and
   
   ```
   1 LogicalProject(DEPTNO=[$0], JOB=[$1], EXPR$2=[CAST(CASE(=($8, 0), $2, 
=($8, 1), $3, =($8, 3), $4, null:BIGINT)):BIGINT NOT NULL])
   2  LogicalFilter(condition=[OR(AND(=($8, 0), >($5, 0)), AND(=($8, 1), >($6, 
0)), =($8, 3))])
   3    LogicalAggregate(group=[{0, 1}], groups=[[{0, 1}, {0}, {}]], 
EXPR$2=[COUNT($2) FILTER $3], EXPR$2=[COUNT($2) FILTER $5], EXPR$2=[COUNT($2) 
FILTER $7], $g_present_0=[COUNT() FILTER $4], $g_present_1=[COUNT() FILTER $6], 
$g_present_2=[COUNT() FILTER $8], $g_final=[GROUPING($0, $1)])
   4      LogicalProject(DEPTNO=[$0], JOB=[$1], ENAME=[$2], $g_0=[=($3, 0)], 
$g_1=[=($3, 1)], $g_2=[=($3, 2)], $g_3=[=($3, 3)], $g_6=[=($3, 6)], $g_7=[=($3, 
7)])
   5        LogicalAggregate(group=[{0, 1, 2}], groups=[[{0, 1, 2}, {0, 1}, {0, 
2}, {0}, {2}, {}]], $g=[GROUPING($0, $1, $2)])
   6          LogicalProject(DEPTNO=[$7], JOB=[$2], ENAME=[$1])
   7            LogicalTableScan(table=[[CATALOG, SALES, EMP]])
   ```
   
   and explain in the code as you go along: "here I am building line 7 from the 
result plan, which has the following role"...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [CALCITE-5465] Rule of AGGREGATE_EXPAND_DISTINCT_AGGREGATES produces an incorrect plan when sql has distinct agg-call with rollup [calcite]

Reply via email to