zhuqi-lucas opened a new issue, #22494:
URL: https://github.com/apache/datafusion/issues/22494

   ## Describe the bug
   
   PR #21956 added the `column_in_file_schema` signal in 
`ParquetSource::try_pushdown_sort`, which routes plain-column sort requests 
through the `Inexact` branch when the source's `output_ordering` was stripped 
by `validated_output_ordering()` (files listed in wrong order on disk).
   
   The `Inexact` branch of `FileScanConfig::try_pushdown_sort` calls 
`rebuild_with_source` with `is_exact=false`, which unconditionally strips 
`output_ordering` even when the post-sort file groups are non-overlapping and 
the declared ordering would re-validate. As a result, `SortExec` stays above 
the source for the canonical Phase 2 scenario (#21182) — even when stats-based 
file reorder restores a perfectly valid ordering.
   
   Before #21956, these cases returned `Unsupported` and were upgraded to 
`Exact` via `try_sort_file_groups_by_statistics` (the fallback) — `SortExec` 
was eliminated.
   
   ## Repro
   
   See `sort_pushdown.slt` Test 6.1 — `reversed_with_order_parquet` with three 
out-of-order files: the comment says "SortExec eliminated" but the recorded 
plan keeps `SortExec`.
   
   ```sql
   CREATE EXTERNAL TABLE reversed_with_order_parquet(id INT, value INT)
   STORED AS PARQUET
   LOCATION 'test_files/scratch/sort_pushdown/reversed/'   -- [a_high(7-9), 
b_mid(4-6), c_low(1-3)] alphabetical
   WITH ORDER (id ASC);
   
   EXPLAIN SELECT * FROM reversed_with_order_parquet ORDER BY id ASC;
   -- physical_plan
   -- 01)SortExec: expr=[id@0 ASC NULLS LAST], preserve_partitioning=[false]   
← should be gone
   -- 02)--DataSourceExec: file_groups={1 group: [[c_low, b_mid, a_high]]}, 
..., sort_order_for_reorder=[id@0 ASC NULLS LAST]
   ```
   
   ## Expected
   
   `SortExec` eliminated after Phase 2 reorder when:
   - ordering is declared (`WITH ORDER` or parquet `sorting_columns` metadata),
   - post-sort file groups are non-overlapping,
   - no NULLs in the sort columns of non-last files.
   
   ## Related
   
   - #21182 — original Phase 2 sort elimination
   - #21956 — introduced the regression (added `column_in_file_schema`)
   - #22493 — the fix


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to