adriangb opened a new issue, #19724: URL: https://github.com/apache/datafusion/issues/19724
`FileScanConfig::try_pushdown_sort` could support re-sorting or re-arranging the `FileGroup`s themselves using min/max statistics to satisfy the queries preferred sort order. This is described in section 5.3 of [Pruning in Snowflake: Working Smarter, Not Harder](https://arxiv.org/pdf/2504.11540). Some considerations are: - If we start re-building groups what should the parallelism be? One the one hand it would make sense to try to match the original parallelism, on the other hand that may not be possible (e.g. if we can only satisfy the sort ordering by making groups `[[f1, f2, f3], [f4]]` maybe it's worth it to have lopsided groups, less or more groups) or even optimal (in a TopK query reduced parallelism can lead to faster queries if we end up only scanning 1 group or even 1 file; all of the work opening the others is wasted effort; this is also known as `ProgressiveEval` and discussed in https://github.com/apache/datafusion/issues/15191). - -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
