adriangb commented on code in PR #22493:
URL: https://github.com/apache/datafusion/pull/22493#discussion_r3294820727


##########
datafusion/datasource/src/file_scan_config/mod.rs:
##########
@@ -973,9 +978,44 @@ impl DataSource for FileScanConfig {
                 }
             }
             SortOrderPushdownResult::Inexact { inner } => {
-                Ok(SortOrderPushdownResult::Inexact {
-                    inner: Arc::new(self.rebuild_with_source(inner, false, 
order)?),
-                })
+                let mut config = self.rebuild_with_source(inner, false, 
order)?;
+                // `rebuild_with_source` reorders files by stats; if the
+                // post-sort files are non-overlapping AND the request now
+                // validates against the new file groups, `output_ordering`
+                // is preserved and we can upgrade back to Exact. This
+                // restores the sort-elimination behaviour that lived in
+                // the `Unsupported` → `try_sort_file_groups_by_statistics`
+                // path before #21956 routed `column_in_file_schema` cases
+                // here.
+                if config.output_ordering.is_empty() {
+                    Ok(SortOrderPushdownResult::Inexact {
+                        inner: Arc::new(config),
+                    })
+                } else {

Review Comment:
   I wonder if we can use some sort of early return / early break to avoid the 
indentation for the `else` branch.



##########
datafusion/datasource/src/file_scan_config/sort_pushdown.rs:
##########
@@ -138,31 +138,74 @@ impl FileScanConfig {
             false
         };
 
-        if is_exact && all_non_overlapping {
-            // Truly exact: within-file ordering guaranteed and files are 
non-overlapping.
-            // Keep output_ordering so SortExec can be eliminated for each 
partition.
-            //
-            // We intentionally do NOT redistribute files across groups here.
-            // The planning-phase bin-packing may interleave file ranges 
across groups:
-            //
-            //   Group 0: [f1(1-10), f3(21-30)]   ← interleaved with group 1
-            //   Group 1: [f2(11-20), f4(31-40)]
-            //
-            // This interleaving is actually beneficial because SPM pulls from 
both
-            // partitions concurrently, keeping parallel I/O active:
-            //
-            //   SPM: pull P0 [1-10] → pull P1 [11-20] → pull P0 [21-30] → 
pull P1 [31-40]
-            //        ^^^^^^^^^^^^     ^^^^^^^^^^^^
-            //        both partitions scanning files simultaneously
-            //
-            // If we were to redistribute files consecutively:
-            //   Group 0: [f1(1-10), f2(11-20)]   ← all values < group 1
-            //   Group 1: [f3(21-30), f4(31-40)]
-            //
-            // SPM would read ALL of group 0 first (values always smaller), 
then group 1.
-            // This degrades to single-threaded sequential I/O — the other 
partition
-            // sits idle the entire time, losing the parallelism benefit.
+        // Decide whether to keep `output_ordering` (i.e. let the outer
+        // pushdown report `Exact` and drop `SortExec`).
+        //
+        // Two paths can produce a keep:
+        //
+        //   1. `is_exact && all_non_overlapping`: the source already had
+        //      validated ordering and the post-sort files still don't
+        //      overlap — Exact carries through unchanged.
+        //
+        //   2. `!is_exact && all_non_overlapping`: source returned
+        //      `Inexact` because pre-sort `validated_output_ordering()`
+        //      stripped the declaration (files were listed out of order
+        //      on disk). After our stats-based sort the files are now
+        //      non-overlapping — re-validate against the new file
+        //      groups and, if it passes, upgrade back to Exact so the
+        //      outer wrapper drops the `SortExec`. Without this, the
+        //      `Inexact` branch stayed Inexact even when reorder
+        //      restored a perfectly valid ordering, leaving an
+        //      unnecessary `SortExec` above the source (regression
+        //      after #21956's `column_in_file_schema` signal pushed
+        //      this scenario into the Inexact branch instead of the
+        //      `try_sort_file_groups_by_statistics` fallback).
+        //
+        // We intentionally do NOT redistribute files across groups here.
+        // The planning-phase bin-packing may interleave file ranges across 
groups:
+        //
+        //   Group 0: [f1(1-10), f3(21-30)]   ← interleaved with group 1
+        //   Group 1: [f2(11-20), f4(31-40)]
+        //
+        // This interleaving is actually beneficial because SPM pulls from both
+        // partitions concurrently, keeping parallel I/O active.
+        let keep_ordering = if all_non_overlapping {
+            if is_exact {

Review Comment:
   I would personally flatten this into a match:
   
   ```rust
   match (is_exact, all_non_overlapping) {
     (true, true) => ...,
     ...
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to