Dandandan opened a new issue, #21696: URL: https://github.com/apache/datafusion/issues/21696
[`datafusion-partitioned/run.sh`](https://github.com/ClickHouse/ClickBench/blob/main/datafusion-partitioned/run.sh) invokes a fresh `datafusion-cli -f create.sql ...` for **each** of the 3 tries, not just try 1. Since `create.sql` does `CREATE EXTERNAL TABLE ... LOCATION 'partitioned'` and the default is `collect_statistics = true`, every try re-scans all Parquet footers in a cold process. Other engines keep one process across tries — e.g. `duckdb-parquet-partitioned` runs all 3 tries in one `duckdb hits.db` session with `parquet_metadata_cache=true`. Per the [ClickBench rules](https://github.com/ClickHouse/ClickBench#caching), tries 2–3 are meant to be hot; our script makes them effectively cold. **Fix:** drop OS cache once before try 1, then run all 3 tries in a single `datafusion-cli` session that has already executed `create.sql`. Part of #18489. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
