[GitHub] [arrow-datafusion] cube2222 opened a new issue #2109: Almost 100x slowdown on 0.7.0 with CSV file

GitBox Mon, 28 Mar 2022 15:01:15 -0700


cube2222 opened a new issue #2109:
URL: https://github.com/apache/arrow-datafusion/issues/2109



   **Describe the bug**
   I'm running benchmarks for [OctoSQL](github.com/cube2222/octosql) and 
datafusion-cli is one of the tools I compare against. The previous version I 
used (0.6.0 I think) did the benchmark in 1.5 second. The new version takes 100 
(!!!) seconds. It also prints "0 rows in set", which makes me think this is a 
CSV decoder regression.
   
   This is based on the nyc yellow taxi dataset.
   
   **To Reproduce**
   ```bash
   curl https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-04.csv 
-o taxi.csv
   
   echo "CREATE EXTERNAL TABLE taxi
   STORED AS CSV
   WITH HEADER ROW
   LOCATION './taxi.csv';
   
   SELECT passenger_count, COUNT(*), AVG(total_amount) FROM taxi GROUP BY 
passenger_count" > datafusion_commands.txt
   
   datafusion-cli -f datafusion_commands.txt
   ```
   
   **Expected behavior**
   Datafusion is supposed to be blazingly fast.
   
   **Additional context**
   Add any other context about the problem here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] cube2222 opened a new issue #2109: Almost 100x slowdown on 0.7.0 with CSV file

Reply via email to