zhuqi-lucas opened a new pull request, #21266:
URL: https://github.com/apache/datafusion/pull/21266

   ## Which issue does this PR close?
   
   Related to https://github.com/apache/datafusion/issues/17348
   Precursor to https://github.com/apache/datafusion/pull/21182
   
   ## Rationale for this change
   
   The sort pushdown benchmark (#21213) uses TPC-H data where file names happen 
to match sort key order, so the optimization shows no difference vs. main 
([comment](https://github.com/apache/datafusion/pull/21182#issuecomment-4158740710)).
   
   This PR generates custom benchmark data with **reversed file names** so the 
optimization is required to achieve sort elimination:
   
   ```
   c_high.parquet: l_orderkey 1-200k     (c sorts last alphabetically, but has 
lowest keys)
   b_mid.parquet:  l_orderkey 200k-400k
   a_low.parquet:  l_orderkey 400k+      (a sorts first alphabetically, but has 
highest keys)
   ```
   
   **On main (without optimization)**:
   - Alphabetical file order: `[a_low(400k+), b_mid(200k-400k), c_high(1-200k)]`
   - `validated_output_ordering()` sees files out of order → strips ordering
   - SortExec stays → slower
   
   **With optimization (#21182)**:
   - `sort_files_within_groups_by_statistics()` reorders to `[c_high, b_mid, 
a_low]`
   - Files non-overlapping → ordering valid → SortExec eliminated → faster
   
   ## What changes are included in this PR?
   
   - New `data_sort_pushdown` function in `bench.sh` that uses `datafusion-cli` 
to split TPC-H lineitem data into 3 sorted parquet files with reversed naming
   - Updated `run_sort_pushdown` / `run_sort_pushdown_sorted` to use the custom 
data path
   
   ## Test plan
   
   - [x] `cargo clippy -p datafusion-benchmarks` — 0 warnings
   - [x] Local benchmark shows sort elimination with optimization PR
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to