Re: [I] feat: global file reorder in shared work queue for TopK optimization [datafusion]

via GitHub Mon, 20 Apr 2026 07:05:21 -0700


zhuqi-lucas commented on issue #21733:
URL: https://github.com/apache/datafusion/issues/21733#issuecomment-4281453035


   Thanks @alamb! Totally agree on having a generic API rather than hardcoding 
the sort heuristic.
   
   I'm thinking something like a `FileReorderStrategy` trait:
   
   ```rust
   /// Strategy for reordering files in the shared work queue
   /// to maximize dynamic filter efficiency.
   pub trait FileReorderStrategy: Send + Sync + Debug {
       /// Reorder the files before placing them in the shared work queue.
       /// The default is no-op (original order preserved).
       fn reorder(&self, files: Vec<PartitionedFile>) -> Vec<PartitionedFile>;
   }
   ```
   
   Then `SharedWorkSource` accepts an optional strategy, and different 
heuristics can be plugged in:
   - **TopK sort**: reorder by sort column min/max statistics (most important 
case)
   - **Filter selectivity**: reorder by estimated selectivity from file-level 
statistics (files likely to be fully pruned go last)
   - **Future**: any data-source-specific heuristic
   
   The strategy could be created by `DataSource::create_sibling_state()` or 
passed through `FileScanConfig`, since the data source knows what optimization 
makes sense for the query.
   
   Will start with the TopK sort case as the first implementation and design 
the API to be extensible for others.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] feat: global file reorder in shared work queue for TopK optimization [datafusion]

Reply via email to