Re: [PR] Optimize group by with limit [datafusion]

via GitHub Wed, 06 May 2026 06:48:16 -0700


alamb commented on code in PR #22009:
URL: https://github.com/apache/datafusion/pull/22009#discussion_r3195889606



##########
datafusion/optimizer/src/eliminate_limit.rs:
##########
@@ -210,7 +210,11 @@ mod tests {
           Sort: test.a ASC NULLS LAST, fetch=3
             Limit: skip=0, fetch=2
               Aggregate: groupBy=[[test.a]], aggr=[[sum(test.b)]]
-                TableScan: test
+                RightSemi Join: test.a = test.a

Review Comment:
   Did you consider adding a `LIMIT` directly to the GroupByHash operator? I am 
not sure how much extra complexity that is, but you could probably model 
similarly to a `SUM(a FILTER key IN <hash table>)`
   
   Though it might add a lot more complexity 🤔 
   
   Or we could implement a special GroupBy operator, similar to what 
@avantgardnerio implemented for GROUPBY with a limit 
https://github.com/apache/datafusion/tree/main/datafusion/physical-plan/src/aggregates/topk)



##########
datafusion/optimizer/src/push_down_limit.rs:
##########
@@ -237,6 +252,99 @@ fn transformed_limit(
     Ok(Transformed::yes(make_limit(skip, fetch, Arc::new(input))))
 }
 
+/// Rewrite `LIMIT K (GROUP BY keys, aggs)` into a key preselection followed

Review Comment:
   I would probably describe this more like:
   ```sql
   SELECT aggs(...) GROUP BY keys LIMIT k
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Optimize group by with limit [datafusion]

Reply via email to