mihaibudiu commented on code in PR #4636:
URL: https://github.com/apache/calcite/pull/4636#discussion_r2540079659
##########
core/src/test/resources/org/apache/calcite/test/RelOptRulesTest.xml:
##########
@@ -3283,11 +3285,13 @@ LogicalAggregate(group=[{0, 1}], groups=[[{0, 1}, {0},
{}]], EXPR$2=[COUNT(DISTI
</Resource>
<Resource name="planAfter">
<![CDATA[
-LogicalAggregate(group=[{0, 1}], groups=[[{0, 1}, {0}, {}]], EXPR$2=[COUNT($2)
FILTER $4], EXPR$3=[MIN($3) FILTER $5])
- LogicalProject(DEPTNO=[$0], JOB=[$1], ENAME=[$2], EXPR$3=[$3], $g_0=[=($4,
0)], $g_1=[=($4, 1)])
- LogicalAggregate(group=[{0, 1, 2}], groups=[[{0, 1, 2}, {0, 1}]],
EXPR$3=[SUM($3)], $g=[GROUPING($0, $1, $2)])
- LogicalProject(DEPTNO=[$7], JOB=[$2], ENAME=[$1], SAL=[$5])
- LogicalTableScan(table=[[CATALOG, SALES, EMP]])
+LogicalProject(DEPTNO=[$0], JOB=[$1], EXPR$2=[CAST(CASE(=($11, 0), $2, =($11,
1), $3, =($11, 3), $4, null:BIGINT)):BIGINT NOT NULL], EXPR$3=[CASE(=($11, 0),
$5, =($11, 1), $6, =($11, 3), $7, null:INTEGER)])
+ LogicalFilter(condition=[OR(AND(=($11, 0), >($8, 0)), AND(=($11, 1), >($9,
0)), =($11, 3))])
+ LogicalAggregate(group=[{0, 1}], groups=[[{0, 1}, {0}, {}]],
EXPR$2=[COUNT($2) FILTER $4], EXPR$2=[COUNT($2) FILTER $6], EXPR$2=[COUNT($2)
FILTER $8], EXPR$3=[MIN($3) FILTER $5], EXPR$3=[MIN($3) FILTER $7],
EXPR$3=[MIN($3) FILTER $9], $g_present_0=[COUNT() FILTER $5],
$g_present_1=[COUNT() FILTER $7], $g_present_2=[COUNT() FILTER $9],
$g_final=[GROUPING($0, $1)])
Review Comment:
why are there multiple fields with the same name in this aggregate?
##########
core/src/main/java/org/apache/calcite/rel/rules/AggregateExpandDistinctAggregatesRule.java:
##########
@@ -425,39 +426,141 @@ private static RelBuilder
convertSingletonDistinct(RelBuilder relBuilder,
return relBuilder;
}
+ /**
+ * Rewrite aggregates that use GROUPING SETS. High-level steps:
+ *
+ * <ol>
+ * <li>
+ * Build the set of grouping keys (<code>groupSets</code>) required by the
rewrite.
+ * This includes: the original <code>groupSet</code>, each declared
grouping set from
+ * the SQL, and any additional grouping-set variants required to expand
DISTINCT
+ * aggregates (that is, the union of distinct-argument columns with each
grouping set,
+ * plus any filter columns).
+ * </li>
+ * <li>
+ * Compute the <code>fullGroupSet</code> (the union of all bits) and run a
bottom
+ * aggregate over that set to produce a <code>GROUPING(...)</code> value
and
+ * per-variant boolean marker columns.
+ * </li>
+ * <li>
+ * From the bottom result, construct upper aggregates that select the
appropriate
+ * values for each declared grouping set (using the grouping id and
presence
+ * indicators to route rows correctly).
+ * </li>
+ * <li>
+ * Finally, apply a filter to remove internal-only rows that do not
contribute to
+ * any declared grouping-set result.
+ * </li>
+ * </ol>
+ *
+ * <p>Example (ROLLUP + DISTINCT):
+ * <ul>
+ * <li>SQL = {@code
+ * SELECT deptno, COUNT(DISTINCT sal)
+ * FROM emp
+ * GROUP BY ROLLUP(deptno)
+ * }.
+ * </li>
+ * <li>Remarks and high-level steps performed by the rewrite:
+ * <ol>
+ * <li>Expand ROLLUP: {@code ROLLUP(deptno)} produces two grouping sets:
+ * (deptno) and (). The rewrite must produce results for both the
per-department
+ * counts and the grand-total.</li>
+ * <li>Enumerate distinct grouping-set combinations: because the aggregate
is
+ * {@code COUNT(DISTINCT sal)}, for each grouping set we must consider
the
+ * distinct-argument unioned with the grouping set. That yields the
+ * grouping-set combinations {@code {deptno, sal}} (for the (deptno)
+ * grouping set) and {@code {sal}} (for the empty grouping set).</li>
+ * <li>Compute {@code fullGroupSet}: take the union of all bits that
appear in
+ * any grouping set or distinct-arg combination. In this example the
+ * {@code fullGroupSet} is {@code {deptno, sal}}. A bottom aggregate
runs
+ * over the {@code fullGroupSet} to materialize rows at the granularity
+ * needed (for example per-(deptno, sal) and per-(sal)) and to produce
+ * a {@code GROUPING(...)} value and boolean markers identifying which
+ * declared grouping-set(s) each row contributes to (see
+ * {@code distinctFilterArgMap} and the code that builds {@code
filters}).</li>
+ * <li>Project boolean selectors: convert the numeric {@code GROUPING(...)}
+ * result into boolean selector columns (and combine with any filter
+ * arguments). These selectors indicate whether a bottom-row should be
+ * counted for a particular declared grouping set. The code building
the
+ * selector projection appears where {@code nodeZ} is used and aliased
+ * into {@code $g_<value>} (and variants with filter suffixes).</li>
+ * <li>Build upper aggregates: for each declared grouping set the rewrite
+ * creates an upper aggregate that groups by the remapped top-group key
+ * and uses the boolean selectors to pick the rows that belong to that
+ * grouping set. For the example:
+ * <ul>
+ * <li>For grouping set {@code (deptno)}: group by {@code deptno} and
count
+ * distinct {@code sal} using only rows where the selector for
+ * {@code {deptno, sal}} is true.</li>
+ * <li>For grouping set {@code ()}: compute the grand total
+ * {@code COUNT(DISTINCT sal)} by counting distinct {@code sal} over
rows
+ * where the selector for {@code {sal}} is true.</li>
+ * </ul>
+ * The code that builds these upper aggregates is where {@code
upperAggCalls}
+ * and {@code aggCallOrdinals} are populated.
+ * </li>
+ * <li>Filter out internal rows: some bottom rows exist only to make
+ * internal-only grouping-set combinations possible; they never
+ * contribute to any original grouping-set result. The rewrite filters
+ * those rows using the conditions assembled in {@code
keepConditions}.</li>
+ * </ol>
+ * </li>
+ * </ul>
Review Comment:
this is a clear improvement, but my suggestion was to take a pair of plans
from one of the test you wrote, e.g.:
```
LogicalAggregate(group=[{0, 1}], groups=[[{0, 1}, {0}, {}]],
EXPR$2=[COUNT(DISTINCT $2)])
LogicalProject(DEPTNO=[$7], JOB=[$2], ENAME=[$1])
LogicalTableScan(table=[[CATALOG, SALES, EMP]])
```
and
```
1 LogicalProject(DEPTNO=[$0], JOB=[$1], EXPR$2=[CAST(CASE(=($8, 0), $2,
=($8, 1), $3, =($8, 3), $4, null:BIGINT)):BIGINT NOT NULL])
2 LogicalFilter(condition=[OR(AND(=($8, 0), >($5, 0)), AND(=($8, 1), >($6,
0)), =($8, 3))])
3 LogicalAggregate(group=[{0, 1}], groups=[[{0, 1}, {0}, {}]],
EXPR$2=[COUNT($2) FILTER $3], EXPR$2=[COUNT($2) FILTER $5], EXPR$2=[COUNT($2)
FILTER $7], $g_present_0=[COUNT() FILTER $4], $g_present_1=[COUNT() FILTER $6],
$g_present_2=[COUNT() FILTER $8], $g_final=[GROUPING($0, $1)])
4 LogicalProject(DEPTNO=[$0], JOB=[$1], ENAME=[$2], $g_0=[=($3, 0)],
$g_1=[=($3, 1)], $g_2=[=($3, 2)], $g_3=[=($3, 3)], $g_6=[=($3, 6)], $g_7=[=($3,
7)])
5 LogicalAggregate(group=[{0, 1, 2}], groups=[[{0, 1, 2}, {0, 1}, {0,
2}, {0}, {2}, {}]], $g=[GROUPING($0, $1, $2)])
6 LogicalProject(DEPTNO=[$7], JOB=[$2], ENAME=[$1])
7 LogicalTableScan(table=[[CATALOG, SALES, EMP]])
```
and explain in the code as you go along: "here I am building line 7 from the
result plan, which has the following role"...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]