zhuqi-lucas commented on code in PR #21956:
URL: https://github.com/apache/datafusion/pull/21956#discussion_r3252443486
##########
datafusion/datasource-parquet/src/source.rs:
##########
@@ -581,9 +615,63 @@ impl FileSource for ParquetSource {
encryption_factory: self.get_encryption_factory_with_config(),
max_predicate_cache_size: self.max_predicate_cache_size(),
reverse_row_groups: self.reverse_row_groups,
+ sort_order_for_reorder: self.sort_order_for_reorder.clone(),
}))
}
+ /// Reorder the files in the shared work queue so the most
+ /// "promising" files are read first, matching the strategy of
+ /// `PreparedAccessPlan::reorder_by_statistics` at the row-group
+ /// level: key off the file's `min`, and let the sort direction
+ /// follow the request (ASC by `min` for ASC requests, DESC by
+ /// `min` for DESC requests).
+ ///
+ /// Keeping both layers consistent matters because they share the
+ /// same convergence story for TopK's dynamic filter: file `i`'s
+ /// `min` is a lower bound on every row group inside it, so the
+ /// order chosen here is a natural prefix of the order
+ /// `reorder_by_statistics` will produce within each file.
+ ///
+ /// Files missing statistics sort to the end so present-stats
+ /// files run first.
+ fn reorder_files(
Review Comment:
Thanks @adriangb good point, addressed in latest PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]