[GitHub] [arrow-datafusion] alamb opened a new issue, #7196: Optimize "LIMIT" queries

via GitHub Fri, 04 Aug 2023 11:40:24 -0700


alamb opened a new issue, #7196:
URL: https://github.com/apache/arrow-datafusion/issues/7196


   ### Is your feature request related to a problem or challenge?
   
   This pattern is common:
   
   ```
   SELECT c1, c2
   FROM t
   ORDER BY c3
   LIMIT 10
   ```
   
   For example we have queries in IOx like the following (this is the same 
pattern @NGA-TRAN describes on 
https://github.com/apache/arrow-datafusion/issues/7162)
   
   ```
   SELECT tag, value1, ...
   FROM t
   WHERE other_column = 'foo'
   ORDER BY time
   LIMIT 10
   ```
   
   
   ### Describe the solution you'd like
   
   
   If the data *IS NOT* already sorted, what happens today is a plan like
   
   ```
   LIMIT(fetch=10)
     SORT(sort_exprs=[c3] fetch=10)
       SCAN(...)
   ```
   
   And the Sort can take partial advantage of the fetch -- and it will be 
better after @gruuya 's change in 
https://github.com/apache/arrow-datafusion/pull/7180
   
   We can probably do better still with a special operator like the following 
that uses some specialized structure (perhaps some type of heap)
   
   ```
   TOPK(fetch=10, sort_exprs=[c3])
       SCAN(...)
   ```
   
   ### Describe alternatives you've considered
   
   If the data is already sorted the right way, DataFusion can just read first 
N values and stop as described on 
https://github.com/apache/arrow-datafusion/issues/7162
   
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb opened a new issue, #7196: Optimize "LIMIT" queries

Reply via email to