zhuqi-lucas commented on code in PR #21182:
URL: https://github.com/apache/datafusion/pull/21182#discussion_r3001780400
##########
datafusion/datasource/src/file_scan_config.rs:
##########
@@ -1289,13 +1384,281 @@ impl FileScanConfig {
new_config.file_source = new_file_source;
- // Phase 1: Clear output_ordering for Inexact
- // (we're only reversing row groups, not guaranteeing perfect ordering)
- if !is_exact {
+ // Sort files within groups by statistics when not reversing
+ let all_non_overlapping = if !reverse_file_groups {
+ if let Some(sort_order) = LexOrdering::new(order.iter().cloned()) {
+ let projected_schema = new_config.projected_schema()?;
+ let projection_indices = new_config
+ .file_source
+ .projection()
+ .as_ref()
+ .and_then(|p| ordered_column_indices_from_projection(p));
+ let result = Self::sort_files_within_groups_by_statistics(
+ &new_config.file_groups,
+ &sort_order,
+ &projected_schema,
+ projection_indices.as_deref(),
+ );
+ new_config.file_groups = result.file_groups;
+ result.all_non_overlapping
+ } else {
+ false
+ }
+ } else {
+ // When reversing, files are already reversed above. We skip
+ // statistics-based sorting here because it would undo the
reversal.
+ // Note: reverse path is always Inexact, so all_non_overlapping
+ // is not used (is_exact is false).
+ false
+ };
+
+ if is_exact && all_non_overlapping {
+ // Truly exact: within-file ordering guaranteed and files are
non-overlapping.
+ //
+ // When there are multiple groups, redistribute files using
consecutive
+ // assignment so that each group remains non-overlapping AND
groups are
+ // ordered relative to each other. This enables:
+ // - No SortExec per partition (files in each group are sorted &
non-overlapping)
+ // - SPM cheaply merges ordered streams (O(n) merge)
+ // - Parallel I/O across partitions
+ //
+ // Before (bin-packing may interleave):
+ // Group 0: [file_01(1-10), file_03(21-30)] ← gap, interleaved
with group 1
+ // Group 1: [file_02(11-20), file_04(31-40)]
Review Comment:
After further analysis, I am considering removing the redistribution logic
entirely. The three benefits listed in the comment are not actually unique to
redistribution:
1. **No SortExec per partition** — true regardless of redistribution, as
long as files within each group are non-overlapping
2. **SPM cheaply merges ordered streams** — SPM is O(n) merge whether groups
are interleaved or consecutive
3. **Parallel I/O across partitions** — actually *better* with interleaved
groups, since SPM alternates pulling from both partitions, keeping both I/O
streams active
The only real difference is that consecutive assignment makes each
partition's file reads more sequential (fewer open/close alternations). But
interleaved groups give better I/O parallelism because both partitions are
actively scanning simultaneously.
Given the marginal benefit vs added complexity (new function + tests), I
think we should remove `redistribute_files_across_groups_by_statistics` and
just keep the core optimization: **per-partition sort elimination via
statistics-based non-overlapping detection**.
What do you think?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]