alamb commented on code in PR #22009:
URL: https://github.com/apache/datafusion/pull/22009#discussion_r3195889606
##########
datafusion/optimizer/src/eliminate_limit.rs:
##########
@@ -210,7 +210,11 @@ mod tests {
Sort: test.a ASC NULLS LAST, fetch=3
Limit: skip=0, fetch=2
Aggregate: groupBy=[[test.a]], aggr=[[sum(test.b)]]
- TableScan: test
+ RightSemi Join: test.a = test.a
Review Comment:
Did you consider adding a `LIMIT` directly to the GroupByHash operator? I am
not sure how much extra complexity that is, but you could probably model
similarly to a `SUM(a FILTER key IN <hash table>)`
Though it might add a lot more complexity 🤔
Or we could implement a special GroupBy operator, similar to what
@avantgardnerio implemented for GROUPBY with a limit
https://github.com/apache/datafusion/tree/main/datafusion/physical-plan/src/aggregates/topk)
##########
datafusion/optimizer/src/push_down_limit.rs:
##########
@@ -237,6 +252,99 @@ fn transformed_limit(
Ok(Transformed::yes(make_limit(skip, fetch, Arc::new(input))))
}
+/// Rewrite `LIMIT K (GROUP BY keys, aggs)` into a key preselection followed
Review Comment:
I would probably describe this more like:
```sql
SELECT aggs(...) GROUP BY keys LIMIT k
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]