adriangb commented on code in PR #21956:
URL: https://github.com/apache/datafusion/pull/21956#discussion_r3250229819
##########
datafusion/datasource-parquet/src/source.rs:
##########
@@ -581,9 +615,63 @@ impl FileSource for ParquetSource {
encryption_factory: self.get_encryption_factory_with_config(),
max_predicate_cache_size: self.max_predicate_cache_size(),
reverse_row_groups: self.reverse_row_groups,
+ sort_order_for_reorder: self.sort_order_for_reorder.clone(),
}))
}
+ /// Reorder the files in the shared work queue so the most
+ /// "promising" files are read first, matching the strategy of
+ /// `PreparedAccessPlan::reorder_by_statistics` at the row-group
+ /// level: key off the file's `min`, and let the sort direction
+ /// follow the request (ASC by `min` for ASC requests, DESC by
+ /// `min` for DESC requests).
+ ///
+ /// Keeping both layers consistent matters because they share the
+ /// same convergence story for TopK's dynamic filter: file `i`'s
+ /// `min` is a lower bound on every row group inside it, so the
+ /// order chosen here is a natural prefix of the order
+ /// `reorder_by_statistics` will produce within each file.
+ ///
+ /// Files missing statistics sort to the end so present-stats
+ /// files run first.
+ fn reorder_files(
Review Comment:
Not required in this PR but I wonder if we could move some of these helpers
out into a `sort.rs` module in the same package or something to keep
`source.rs` simpler.
##########
datafusion/core/tests/dataframe/mod.rs:
##########
@@ -3268,7 +3268,7 @@ async fn
union_with_mix_of_presorted_and_explicitly_resorted_inputs_with_reparti
UnionExec
DataSourceExec: file_groups={1 group:
[[{testdata}/alltypes_tiny_pages.parquet]]}, projection=[id],
output_ordering=[id@0 ASC NULLS LAST], file_type=parquet
SortExec: expr=[id@0 ASC NULLS LAST], preserve_partitioning=[false]
- DataSourceExec: file_groups={1 group:
[[{testdata}/alltypes_tiny_pages.parquet]]}, projection=[id], file_type=parquet
+ DataSourceExec: file_groups={1 group:
[[{testdata}/alltypes_tiny_pages.parquet]]}, projection=[id],
file_type=parquet, sort_order_for_reorder=[id@0 ASC NULLS LAST]
Review Comment:
maybe `inexact_output_ordering` ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]