geoffreyclaude opened a new pull request, #15560: URL: https://github.com/apache/datafusion/pull/15560
## Which issue does this PR close? - Closes #15559 ## Rationale for this change Currently, the benchmarks folder in DataFusion does not include dedicated benchmarks for TopK queries (i.e., queries formatted as `SELECT ... ORDER BY column LIMIT n`). With ongoing work to optimize these queries, having dedicated benchmarks is valuable for measuring progress. ## What changes are included in this PR? ### Sorted TPCH Support - A new `--sort` flag has been added to `tpch/convert.rs` to output the TPCH tables sorted by their first (key) column. Although the generator outputs CSV files already sorted by the first column, the sorted order was not stored in the converted files. - A new `--sorted` flag has been added to both `sort_tpch.rs` and `tpch/run.rs`. When enabled, it injects the file sort order into the `ListingOptions`, allowing DataFusion optimizations to take advantage of pre-sorted input. This is necessary because DataFusion does not currently read the "sortedness" from Parquet files. ### TopK Benchmark Extension - In `sort_tpch.rs`, an optional `--limit n` flag has been introduced. When provided, it appends a `LIMIT n` clause to the SQL query, effectively converting a standard sort query into a TopK query. - By combining the `--limit n` and `--sorted` flags, it is now possible to test TopK queries on pre-sorted inputs. ## Are these changes tested? ## Are there any user-facing changes? No, only developer-facing benchmark changes: - New command-line options have been added to the benchmarks: `--limit` to append a LIMIT clause, `--sorted` to indicate pre-sorted input data, and `--sort` to generate pre-sorted input data. - These options are opt-in and do not affect the default behavior of the benchmarks unless explicitly specified. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
