yonatan-sevenai opened a new issue, #21374:
URL: https://github.com/apache/datafusion/issues/21374
### Describe the bug
When the SQL unparser encounters a Projection → Limit → Aggregate (or
Projection → Sort → Aggregate) plan shape where the Aggregate's aliases are
inlined (no intermediate Projection between
Limit and Aggregate), it emits the aggregate expressions twice — once in
the outer SELECT and once inside a spurious derived subquery.
This happens because:
1. reconstruct_select_statement reaches through the Limit/Sort via
`find_agg_node_within_select` to claim the Aggregate, setting the SELECT items
and GROUP BY
2. When recursion then hits the Limit/Sort arm, it sees `already_projected
== true` and wraps everything below in a derived subquery
3. The Aggregate ends up emitted in both the outer SELECT and inside the
derived subquery
PR with fix coming momentarily
### To Reproduce
1. Parser-roundtrip query (works correctly because the parser inserts an
intermediate Projection)
```
SELECT __agg_0 AS "min(j1_id)", __agg_1 AS "max(j1_id)"
FROM (
SELECT min(j1_rename) AS __agg_0, max(j1_rename) AS __agg_1
FROM (SELECT j1_id AS j1_rename FROM j1) AS bla
LIMIT 20
)
```
This roundtrips correctly because the parser produces a plan with a
Projection between the Limit and
Aggregate:
```
Projection: __agg_0 AS min(j1_id), __agg_1 AS max(j1_id)
Limit: skip=0, fetch=20
Projection: min(bla.j1_rename) AS __agg_0, max(bla.j1_rename) AS
__agg_1 ← parser inserts this
Aggregate: groupBy=[[]], aggr=[[min(bla.j1_rename),
max(bla.j1_rename)]]
SubqueryAlias: bla
```
2. Manually constructed plan (exposes the bug)
When the aliases are inlined into the Aggregate (no intermediate
Projection), as optimizers or plan builders may produce:
```
Projection: __agg_0 AS max1(j1_id), __agg_1 AS max2(j1_id)
Limit: skip=0, fetch=20
Aggregate: groupBy=[[]], aggr=[[max(bla.j1_rename) AS __agg_0,
max(bla.j1_rename) AS __agg_1]]
SubqueryAlias: bla
Projection: j1.j1_id AS j1_rename
TableScan: j1
```
Actual output:
```
SELECT max(bla.j1_rename) AS "max1(j1_id)", max(bla.j1_rename) AS
"max2(j1_id)"
FROM (SELECT max(bla.j1_rename) AS __agg_0, max(bla.j1_rename) AS __agg_1
FROM (SELECT j1.j1_id AS j1_rename FROM j1) AS bla LIMIT 20)
```
The aggregate is emitted twice and the outer SELECT references
bla.j1_rename which is out of scope.
3. Same bug with Sort instead of Limit
```
Projection: __agg_0 AS max1(j1_id)
Sort: __agg_0 ASC
Aggregate: groupBy=[[]], aggr=[[max(bla.j1_rename) AS __agg_0]]
SubqueryAlias: bla
```
Produces the same spurious derived subquery pattern.
### Expected behavior
The unparser should emit a single SELECT statement:
```
SELECT max(bla.j1_rename) AS "max1(j1_id)", max(bla.j1_rename) AS
"max2(j1_id)"
FROM (SELECT j1.j1_id AS j1_rename FROM j1) AS bla
LIMIT 20
```
When the Projection claims an Aggregate by reaching through a Limit or
Sort, those intermediate nodes should be folded into the same SELECT statement
rather than triggering a derived
subquery.
### Additional context
The fix is to detect in the Projection arm when
`reconstruct_select_statement `found an Aggregate and the Projection's direct
child is a Limit or Sort. In that case, fold the Limit/Sort
clauses (LIMIT/OFFSET/ORDER BY) into the current query and recurse into
the Limit/Sort's child, skipping the node so it never reaches the
already_projected guard.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]