[GitHub] [arrow-datafusion] alamb opened a new issue, #5505: [Epic] DataFusion Benchmarking

via GitHub Tue, 07 Mar 2023 07:17:36 -0800


alamb opened a new issue, #5505:
URL: https://github.com/apache/arrow-datafusion/issues/5505


   # Call to action:
   
   Let's invest more effort in DataFusion benchmarking, both as a mechanism for 
technical evangelism as well as a guide for actual performance improvements. 
   
   # Background
   
   We have several examples of performance “comparisons” showing DataFusion not 
doing well against DuckDB or pola.rs that really was a test of how fast CSV or 
JSON parsing can go ([this blog 
](https://www.confessionsofadataguy.com/dataframe-showdown-polars-vs-spark-vs-pandas-vs-datafusion-guess-who-wins/)is
 one such example) – recent work should make these comparisons much more 
favorable in the future
   
   It is in the interest of all projects based on DataFusion to focus on their 
own users and use cases rather than having to explain why they are using 
supposedly "inferior" technology due to misleading benchmark results (for 
example recently on ClickBench – see 
https://github.com/apache/arrow-datafusion/issues/5276).
   
   Of course not only will improved benchmarking help evangelize DataFusion 
more, it will also directly help guide the community’s optimization efforts.
   
   # Task List
   
   - [ ] https://github.com/apache/arrow-datafusion/issues/5276
   - [ ] https://github.com/apache/arrow-datafusion/issues/5502
   - [ ]  https://github.com/apache/arrow-datafusion/issues/5504
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb opened a new issue, #5505: [Epic] DataFusion Benchmarking

Reply via email to