geoffreyclaude opened a new pull request, #15560:
URL: https://github.com/apache/datafusion/pull/15560

   ## Which issue does this PR close?
   
   - Closes #15559
   
   ## Rationale for this change
   
   Currently, the benchmarks folder in DataFusion does not include dedicated 
benchmarks for TopK queries (i.e., queries formatted as `SELECT ... ORDER BY 
column LIMIT n`).
   
   With ongoing work to optimize these queries, having dedicated benchmarks is 
valuable for measuring progress.
   
   ## What changes are included in this PR?
   
   ### Sorted TPCH Support
   
   - A new `--sort` flag has been added to `tpch/convert.rs` to output the TPCH 
tables sorted by their first (key) column. Although the generator outputs CSV 
files already sorted by the first column, the sorted order was not stored in 
the converted files.
   - A new `--sorted` flag has been added to both `sort_tpch.rs` and 
`tpch/run.rs`. When enabled, it injects the file sort order into the 
`ListingOptions`, allowing DataFusion optimizations to take advantage of 
pre-sorted input. This is necessary because DataFusion does not currently read 
the "sortedness" from Parquet files.
   
   ### TopK Benchmark Extension
   
   - In `sort_tpch.rs`, an optional `--limit n` flag has been introduced. When 
provided, it appends a `LIMIT n` clause to the SQL query, effectively 
converting a standard sort query into a TopK query.
   - By combining the `--limit n` and `--sorted` flags, it is now possible to 
test TopK queries on pre-sorted inputs.
   
   ## Are these changes tested?
   
   ## Are there any user-facing changes?
   
   No, only developer-facing benchmark changes:
   - New command-line options have been added to the benchmarks: `--limit` to 
append a LIMIT clause, `--sorted` to indicate pre-sorted input data, and 
`--sort` to generate pre-sorted input data.
   - These options are opt-in and do not affect the default behavior of the 
benchmarks unless explicitly specified.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to