zhuqi-lucas opened a new pull request, #22493:
URL: https://github.com/apache/datafusion/pull/22493

   ## Which issue does this PR close?
   
   Regression introduced by #21956 — single-partition multi-file scans with
   `WITH ORDER` (or inferred parquet `sorting_columns` metadata) plus files
   listed out of order on disk no longer have their `SortExec` eliminated,
   even when the stats-based file reorder produces a non-overlapping layout.
   This was the central case PR #21182 was designed to fix.
   
   ## Rationale for this change
   
   In the Phase 2 sort pushdown design (#21182), when a file source returned
   `Unsupported` because `validated_output_ordering()` stripped its declared
   ordering, the outer `FileScanConfig::try_pushdown_sort` invoked a fallback
   (`try_sort_file_groups_by_statistics`) that reordered files by min/max
   stats, re-validated the ordering against the new file groups, and could
   upgrade the result back to `Exact` — dropping the outer `SortExec`.
   
   PR #21956 added the `column_in_file_schema` signal so that plain-column
   sort requests would return `Inexact` (with `sort_order_for_reorder` set)
   instead of `Unsupported`. That enabled runtime per-RG reorder, but it
   also pulled the typical wrong-file-order case out of the
   `Unsupported`-fallback re-validation path. The Inexact branch of
   `rebuild_with_source` always strips `output_ordering`, so even when the
   post-sort file groups became non-overlapping and the declared ordering
   would re-validate, the outer wrapper returned `Inexact` and PushdownSort
   kept the `SortExec`.
   
   The SLT comment on test 6.1 (`# … → SortExec eliminated`) still describes
   the pre-#21956 behaviour, but the recorded expectation was updated to
   match the post-#21956 plan where SortExec stayed — a silent regression.
   
   ## What changes are included in this PR?
   
   ### `rebuild_with_source`
   
   When `is_exact=false` but the stats-based file sort produced
   `all_non_overlapping=true`, re-validate the declared `output_ordering`
   against the new file groups. If `ordering_satisfy` passes, preserve
   `output_ordering` instead of stripping it.
   
   ### Outer `FileScanConfig::try_pushdown_sort`
   
   On the `Inexact` branch, inspect `config.output_ordering` after rebuild
   and return `Exact` if it survived. This restores the
   `Unsupported → upgrade Exact` semantics, but on the `Inexact` branch the
   plain-column path now follows.
   
   ### Safety
   
   - When the source has no declared ordering (no `WITH ORDER` and no
     parquet `sorting_columns` metadata), `self.output_ordering` is empty,
     the re-validate produces no orderings, `ordering_satisfy` returns
     `false`, and `SortExec` stays. min/max stats alone never trigger an
     upgrade.
   - When `all_non_overlapping=false` (overlapping files post-sort),
     `output_ordering` is stripped exactly as before.
   
   ### Tests
   
   `sort_pushdown.slt`:
   
   - Tests **4.1** (parquet metadata), **6.1** (`WITH ORDER ASC`),
     **8.1** (`WITH ORDER DESC` with reverse), **G.1**, **G.2** (multi-
     partition SPM + BufferExec) — expectations updated to match the
     restored `SortExec`-eliminated plans (matches the PR #21182 era plans
     and the SLT comments that were already in the file).
   - New **Test 5b** — files written without `ORDER BY` (no
     `sorting_columns` metadata) + external table without `WITH ORDER`:
     asserts that even when min/max stats happen to be non-overlapping,
     `SortExec` is kept.
   
   ## Are these changes tested?
   
   Yes — `sort_pushdown.slt` covers the affected paths comprehensively.
   Full `cargo test -p datafusion-sqllogictest --test sqllogictests` and
   `cargo test -p datafusion-datasource --lib` / `-p
   datafusion-datasource-parquet --lib` / `-p datafusion-physical-optimizer`
   pass. `cargo clippy -D warnings` clean.
   
   ## Are there any user-facing changes?
   
   Yes — for the in-scope scenarios above, `EXPLAIN` plans now show the
   `SortExec` removed and `output_ordering` set on `DataSourceExec`. Query
   results are unaffected.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to