gene-bordegaray commented on code in PR #22657:
URL: https://github.com/apache/datafusion/pull/22657#discussion_r3401722093
##########
datafusion/catalog-listing/src/table.rs:
##########
@@ -690,12 +715,45 @@ impl ListingTable {
/// Get the list of files for a scan as well as the file level statistics.
/// The list is grouped to let the execution plan know how the files should
/// be distributed to different threads / executors.
+ ///
+ /// If [`ListingOptions::output_partitioning`] is set, the returned file
+ /// groups preserve that declared partition count, including empty trailing
+ /// groups when needed, rather than using
+ /// [`ListingOptions::target_partitions`].
Review Comment:
Addressed with tighter docs on ListingOptions::target_partitions /
output_partitioning. The current behavior is that output_partitioning is
authoritative when set, and target_partitions is only used when no output
partitioning is declared.
##########
datafusion/catalog-listing/src/table.rs:
##########
@@ -505,12 +506,31 @@ impl TableProvider for ListingTable {
// at the same time. This is because the limit should be applied after
the filters are applied.
let statistic_file_limit = if filters.is_empty() { limit } else { None
};
+ let declared_output_partitioning = if partition_filters.is_empty() {
+ self.options.output_partitioning.clone()
+ } else {
+ // Partition pruning can remove files before grouping. Without a
+ // stable file-to-declared-partition mapping, regrouping the
+ // remaining files could shift them into the wrong partition index.
+ None
Review Comment:
Addressed. For declared output partitioning, listing now keeps the full file
set through file-group assignment and applies partition filters within each
declared group afterward. That preserves the declared partition indexes while
still pruning files.
##########
datafusion/catalog-listing/src/table.rs:
##########
@@ -505,12 +506,31 @@ impl TableProvider for ListingTable {
// at the same time. This is because the limit should be applied after
the filters are applied.
let statistic_file_limit = if filters.is_empty() { limit } else { None
};
+ let declared_output_partitioning = if partition_filters.is_empty() {
+ self.options.output_partitioning.clone()
+ } else {
+ // Partition pruning can remove files before grouping. Without a
+ // stable file-to-declared-partition mapping, regrouping the
+ // remaining files could shift them into the wrong partition index.
+ None
+ };
+ let target_partitions = declared_output_partitioning
+ .as_ref()
+ .map(Partitioning::partition_count)
+ .unwrap_or(self.options.target_partitions);
+
Review Comment:
Addressed. with_output_partitioning aligns target_partitions from the
declared partition count, and scan planning validates the final file group
count against output_partitioning.partition_count() before building the scan.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]