zhuqi-lucas opened a new issue, #21317:
URL: https://github.com/apache/datafusion/issues/21317

   **Is your feature request related to a problem or challenge?**
   
   Currently sort pushdown reorders **files** by min/max statistics to achieve 
sort elimination. But within each file, row groups are read in their original 
order (or reversed via `reverse_row_groups`). If row groups within a file have 
non-overlapping ranges, reordering them by statistics could further optimize 
scan order.
   
   **Describe the solution you'd like**
   
   Pass the desired sort order into the `FileOpener` and have it re-sort row 
groups based on their min/max statistics to match the scan's desired ordering. 
This would be especially effective with morselized scans where TopK queries 
could terminate after reading a single row group.
   
   For example, a file with 4 row groups:
   ```
   RG1: min=100, max=200
   RG2: min=1,   max=50
   RG3: min=300, max=400
   RG4: min=51,  max=99
   ```
   
   For `ORDER BY col ASC LIMIT 10`, reordering to `[RG2, RG4, RG1, RG3]` lets 
TopK find the smallest values first and potentially skip later row groups 
entirely via dynamic filters.
   
   **Additional context**
   
   This was suggested by @adriangb in 
https://github.com/apache/datafusion/pull/21182#discussion_r3000108065:
   
   > I do think there's one more trick we could have up our sleeves: instead of 
only reversing row group orders we could pass the desired sort order into the 
opener and have it re-sort the row groups based on stats to try to match the 
scan's desired ordering. This might be especially effective once we have 
morselized scans since we could terminate after a single row group for TopK 
queries.
   
   Related code: 
[`ParquetSource::try_pushdown_sort`](https://github.com/apache/datafusion/blob/main/datafusion/datasource-parquet/src/source.rs)
 — currently only supports reverse scan (`reverse_row_groups=true`), could be 
extended to do statistics-based row group reordering.
   
   Parent issue: https://github.com/apache/datafusion/issues/17348


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to