alamb opened a new issue, #6899:
URL: https://github.com/apache/arrow-datafusion/issues/6899
### Is your feature request related to a problem or challenge?
DataFusion optimizes queries like `... ORDER BY value LIMIT 10` by only
keeping the top 10 ("limit") rows when sorting which is great!
Another common pattern (that we also have in IOx)
(https://github.com/influxdata/influxdb_iox/pull/8187/files#r1257834347) is
queries like the following to select the top N values "per partition"
```sql
SELECT ...
ROW_NUMBER() OVER (PARTITION BY value1, ORDER BY value2) as rn
WHERE
rn < 10
```
Currently the plan will be something like:
```
Filter(rn < 10)
WindowExec(ROW_NUMBER...)
Sort(value1, value2)
```
The problem with this plan is that it will sort (and copy) the ENTIRE input
even when the query only needs the first 10 rows of each partition
### Describe the solution you'd like
It would be awesome to optimize this case somehow so that it did not need to
sort the entire input (and somehow could only keep the top N values per
partition). I am not sure how easy this would be to do for sorting
### Describe alternatives you've considered
Maybe we could at least teach the window operator to only emit the top N
values per partition if there was a row number predicate at at least save some
of that work -- the sort would still be required, but at least the window
operator would do less work
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]