Dandandan commented on code in PR #21580:
URL: https://github.com/apache/datafusion/pull/21580#discussion_r3071295010
##########
datafusion/datasource-parquet/src/access_plan.rs:
##########
@@ -377,6 +382,106 @@ impl PreparedAccessPlan {
})
}
+ /// Reorder row groups by their min statistics for the given sort order.
+ ///
+ /// This helps TopK queries find optimal values first. For ASC sort,
+ /// row groups with the smallest min values come first. For DESC sort,
+ /// row groups with the largest min values come first.
+ ///
+ /// Gracefully skips reordering when:
+ /// - There is a row_selection (too complex to remap)
+ /// - 0 or 1 row groups (nothing to reorder)
+ /// - Sort expression is not a simple column reference
+ /// - Statistics are unavailable
+ pub(crate) fn reorder_by_statistics(
Review Comment:
I think @adriangb had the great idea to also order by grouping keys which can
* reduce cardinality within partitions (partition-local state can be smaller)
* allow for better cache locality (row groups with more equal keys are
grouped together)
Doesn't have to be in this PR but perhaps we can think about how it fits in.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]