zhuqi-lucas commented on issue #21733:
URL: https://github.com/apache/datafusion/issues/21733#issuecomment-4281453035
Thanks @alamb! Totally agree on having a generic API rather than hardcoding
the sort heuristic.
I'm thinking something like a `FileReorderStrategy` trait:
```rust
/// Strategy for reordering files in the shared work queue
/// to maximize dynamic filter efficiency.
pub trait FileReorderStrategy: Send + Sync + Debug {
/// Reorder the files before placing them in the shared work queue.
/// The default is no-op (original order preserved).
fn reorder(&self, files: Vec<PartitionedFile>) -> Vec<PartitionedFile>;
}
```
Then `SharedWorkSource` accepts an optional strategy, and different
heuristics can be plugged in:
- **TopK sort**: reorder by sort column min/max statistics (most important
case)
- **Filter selectivity**: reorder by estimated selectivity from file-level
statistics (files likely to be fully pruned go last)
- **Future**: any data-source-specific heuristic
The strategy could be created by `DataSource::create_sibling_state()` or
passed through `FileScanConfig`, since the data source knows what optimization
makes sense for the query.
Will start with the TopK sort case as the first implementation and design
the API to be extensible for others.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]