alamb commented on code in PR #21580:
URL: https://github.com/apache/datafusion/pull/21580#discussion_r3088558848
##########
benchmarks/queries/sort_pushdown_inexact/q1.sql:
##########
@@ -0,0 +1,8 @@
+-- Inexact path: TopK + DESC LIMIT on ASC-declared file.
Review Comment:
Perhaps we can add the benchmark queries as a separate PR to make it easier
to run automated benchmarks against this code
##########
datafusion/datasource-parquet/src/opener.rs:
##########
@@ -1223,9 +1230,20 @@ impl RowGroupsPrunedParquetOpen {
// Prepare the access plan (extract row groups and row selection)
let mut prepared_plan = access_plan.prepare(rg_metadata)?;
- // Potentially reverse the access plan for performance.
- // See `ParquetSource::try_pushdown_sort` for the rationale.
- if prepared.reverse_row_groups {
+ // Reorder row groups by statistics if sort order is known.
Review Comment:
I wonder if this would be easier to understand if it were part of
"preparing" an access plan rather than something that was applied as a post
pass
##########
datafusion/datasource-parquet/src/opener.rs:
##########
@@ -154,6 +155,10 @@ pub(super) struct ParquetMorselizer {
pub max_predicate_cache_size: Option<usize>,
/// Whether to read row groups in reverse order
pub reverse_row_groups: bool,
+ /// Optional sort order used to reorder row groups by their min/max
statistics.
Review Comment:
This would also avoid having to plumb the option through so much (aka only
pass down an AccessPlanOptimizer rather than a bunch of different flags)
##########
datafusion/datasource-parquet/src/opener.rs:
##########
@@ -154,6 +155,10 @@ pub(super) struct ParquetMorselizer {
pub max_predicate_cache_size: Option<usize>,
/// Whether to read row groups in reverse order
pub reverse_row_groups: bool,
+ /// Optional sort order used to reorder row groups by their min/max
statistics.
Review Comment:
Given we already have `reverse_row_groups` and there may be other per-row
group optimizations, I wonder if we should make some sort of more general
purpose API
Maybe something like
```rust
trait AccessPlanOptimizer {
fn optimize(&self, plan: AccessPlan) -> AccessPlan;
}
```
And then we could test the various optimization passes together without
having to actually run a query plan 🤔
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]