zhuqi-lucas commented on code in PR #21580:
URL: https://github.com/apache/datafusion/pull/21580#discussion_r3071325657
##########
datafusion/datasource-parquet/src/access_plan.rs:
##########
@@ -377,6 +382,106 @@ impl PreparedAccessPlan {
})
}
+ /// Reorder row groups by their min statistics for the given sort order.
+ ///
+ /// This helps TopK queries find optimal values first. For ASC sort,
+ /// row groups with the smallest min values come first. For DESC sort,
+ /// row groups with the largest min values come first.
+ ///
+ /// Gracefully skips reordering when:
+ /// - There is a row_selection (too complex to remap)
+ /// - 0 or 1 row groups (nothing to reorder)
+ /// - Sort expression is not a simple column reference
+ /// - Statistics are unavailable
+ pub(crate) fn reorder_by_statistics(
Review Comment:
Thanks @Dandandan for review! That's a great extension. The
reorder_by_statistics method is generic enough to take any LexOrdering — it
doesn't need to be tied to TopK specifically. So extending this for GROUP BY
should be a matter of:
1. Computing a preferred RG ordering from grouping keys in the aggregate
planner
2. Passing it through to ParquetSource::sort_order_for_reorder
Happy to track this as a follow-up issue. Will open one after this PR
lands.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]