zhuqi-lucas opened a new issue, #22494:
URL: https://github.com/apache/datafusion/issues/22494
## Describe the bug
PR #21956 added the `column_in_file_schema` signal in
`ParquetSource::try_pushdown_sort`, which routes plain-column sort requests
through the `Inexact` branch when the source's `output_ordering` was stripped
by `validated_output_ordering()` (files listed in wrong order on disk).
The `Inexact` branch of `FileScanConfig::try_pushdown_sort` calls
`rebuild_with_source` with `is_exact=false`, which unconditionally strips
`output_ordering` even when the post-sort file groups are non-overlapping and
the declared ordering would re-validate. As a result, `SortExec` stays above
the source for the canonical Phase 2 scenario (#21182) — even when stats-based
file reorder restores a perfectly valid ordering.
Before #21956, these cases returned `Unsupported` and were upgraded to
`Exact` via `try_sort_file_groups_by_statistics` (the fallback) — `SortExec`
was eliminated.
## Repro
See `sort_pushdown.slt` Test 6.1 — `reversed_with_order_parquet` with three
out-of-order files: the comment says "SortExec eliminated" but the recorded
plan keeps `SortExec`.
```sql
CREATE EXTERNAL TABLE reversed_with_order_parquet(id INT, value INT)
STORED AS PARQUET
LOCATION 'test_files/scratch/sort_pushdown/reversed/' -- [a_high(7-9),
b_mid(4-6), c_low(1-3)] alphabetical
WITH ORDER (id ASC);
EXPLAIN SELECT * FROM reversed_with_order_parquet ORDER BY id ASC;
-- physical_plan
-- 01)SortExec: expr=[id@0 ASC NULLS LAST], preserve_partitioning=[false]
← should be gone
-- 02)--DataSourceExec: file_groups={1 group: [[c_low, b_mid, a_high]]},
..., sort_order_for_reorder=[id@0 ASC NULLS LAST]
```
## Expected
`SortExec` eliminated after Phase 2 reorder when:
- ordering is declared (`WITH ORDER` or parquet `sorting_columns` metadata),
- post-sort file groups are non-overlapping,
- no NULLs in the sort columns of non-last files.
## Related
- #21182 — original Phase 2 sort elimination
- #21956 — introduced the regression (added `column_in_file_schema`)
- #22493 — the fix
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]