jychen7 opened a new issue, #5404:
URL: https://github.com/apache/arrow-datafusion/issues/5404

   **Describe the problem**
   Datafusion v19.rc1 by default turn on  `repartition_file_scans` at 
https://github.com/apache/arrow-datafusion/pull/5295
   
   with my local Macbook Pro (2.6 GHz 6-Core Intel Core i7, 32 GB 2667 MHz 
DDR4), for following query on clickbench 14GB `hits.parquet`:
   - v19.rc1 took 12.343 seconds (yeah, 8x faster than v18, was 83.863 seconds)
   - DuckDB v0.6.1 took `real 0.566 user 1.876031 sys 0.357483`
       - clock time 566ms
       - cpu time 1.87s
       - I think clock time is smaller than cpu time, because of it uses 
multiple CPU cores in parallel.
   
   **To Reproduce**
   Setup
   ```
   wget --continue https://datasets.clickhouse.com/hits_compatible/hits.parquet
   ```
   
   Datafusion
   ```
   git clone https://github.com/apache/arrow-datafusion.git
   git checkout 19.0.0-rc1
   cd datafusion-cli
   cargo build --release
   
   target/release/datafusion-cli -f create.sql q23_limit_1.sql
   // output: 1 row in set. Query took 12.343 seconds
   ```
   where
   ```
   # create.sql 
   CREATE EXTERNAL TABLE hits
   STORED AS PARQUET
   LOCATION 'hits.parquet';
   # q23_limit_1.sql
   SELECT * FROM hits WHERE "URL" LIKE '%google%' limit 1;
   ```
   
   DuckDB
   ```
   brew install duckdb
   duckdb
   > .timer on
   > SELECT * FROM read_parquet('hits.parquet') WHERE URL LIKE '%google%' LIMIT 
1;
   // output: Run Time (s): real 0.566 user 1.876031 sys 0.357483
   ```
   
   **Expected behavior**
   1. with single core, datafusion-cli tooks 2s (like cpu time of DuckDB)
   2. with multi cores, datafusion-cli tooks 0.6s (like real time of DuckDB)
   
   **Additional context**
   Add any other context about the problem here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to