zhuqi-lucas opened a new issue, #21317: URL: https://github.com/apache/datafusion/issues/21317
**Is your feature request related to a problem or challenge?** Currently sort pushdown reorders **files** by min/max statistics to achieve sort elimination. But within each file, row groups are read in their original order (or reversed via `reverse_row_groups`). If row groups within a file have non-overlapping ranges, reordering them by statistics could further optimize scan order. **Describe the solution you'd like** Pass the desired sort order into the `FileOpener` and have it re-sort row groups based on their min/max statistics to match the scan's desired ordering. This would be especially effective with morselized scans where TopK queries could terminate after reading a single row group. For example, a file with 4 row groups: ``` RG1: min=100, max=200 RG2: min=1, max=50 RG3: min=300, max=400 RG4: min=51, max=99 ``` For `ORDER BY col ASC LIMIT 10`, reordering to `[RG2, RG4, RG1, RG3]` lets TopK find the smallest values first and potentially skip later row groups entirely via dynamic filters. **Additional context** This was suggested by @adriangb in https://github.com/apache/datafusion/pull/21182#discussion_r3000108065: > I do think there's one more trick we could have up our sleeves: instead of only reversing row group orders we could pass the desired sort order into the opener and have it re-sort the row groups based on stats to try to match the scan's desired ordering. This might be especially effective once we have morselized scans since we could terminate after a single row group for TopK queries. Related code: [`ParquetSource::try_pushdown_sort`](https://github.com/apache/datafusion/blob/main/datafusion/datasource-parquet/src/source.rs) — currently only supports reverse scan (`reverse_row_groups=true`), could be extended to do statistics-based row group reordering. Parent issue: https://github.com/apache/datafusion/issues/17348 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
