[I] Re-sort file groups in FileScanConfig to satisfy ordering requirements [datafusion]

via GitHub Fri, 09 Jan 2026 14:39:12 -0800


adriangb opened a new issue, #19724:
URL: https://github.com/apache/datafusion/issues/19724


   `FileScanConfig::try_pushdown_sort` could support re-sorting or re-arranging 
the `FileGroup`s themselves using min/max statistics to satisfy the queries 
preferred sort order.
   
   This is described in section 5.3 of [Pruning in Snowflake: Working Smarter, 
Not Harder](https://arxiv.org/pdf/2504.11540).
   
   Some considerations are:
   - If we start re-building groups what should the parallelism be? One the one 
hand it would make sense to try to match the original parallelism, on the other 
hand that may not be possible (e.g. if we can only satisfy the sort ordering by 
making groups `[[f1, f2, f3], [f4]]` maybe it's worth it to have lopsided 
groups, less or more groups) or even optimal (in a TopK query reduced 
parallelism can lead to faster queries if we end up only scanning 1 group or 
even 1 file; all of the work opening the others is wasted effort; this is also 
known as `ProgressiveEval` and discussed in 
https://github.com/apache/datafusion/issues/15191).
   - 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Re-sort file groups in FileScanConfig to satisfy ordering requirements [datafusion]

Reply via email to