Dandandan opened a new issue, #21696:
URL: https://github.com/apache/datafusion/issues/21696

   
[`datafusion-partitioned/run.sh`](https://github.com/ClickHouse/ClickBench/blob/main/datafusion-partitioned/run.sh)
 invokes a fresh `datafusion-cli -f create.sql ...` for **each** of the 3 
tries, not just try 1. Since `create.sql` does `CREATE EXTERNAL TABLE ... 
LOCATION 'partitioned'` and the default is `collect_statistics = true`, every 
try re-scans all Parquet footers in a cold process.
   
   Other engines keep one process across tries — e.g. 
`duckdb-parquet-partitioned` runs all 3 tries in one `duckdb hits.db` session 
with `parquet_metadata_cache=true`. Per the [ClickBench 
rules](https://github.com/ClickHouse/ClickBench#caching), tries 2–3 are meant 
to be hot; our script makes them effectively cold.
   
   **Fix:** drop OS cache once before try 1, then run all 3 tries in a single 
`datafusion-cli` session that has already executed `create.sql`.
   
   Part of #18489.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to