[GitHub] [arrow-datafusion] tustvold edited a comment on issue #1652: ARROW2: Performance benchmark

GitBox Mon, 24 Jan 2022 01:59:25 -0800


tustvold edited a comment on issue #1652:
URL: 
https://github.com/apache/arrow-datafusion/issues/1652#issuecomment-1019913842



   Big :+1: to this, getting some concrete numbers would be really nice.
   
   FWIW some ideas for whoever picks this up that I at least would be very 
interested in:
   
   * Performance against current arrow-rs master, a number of non-trivial 
performance improvements have landed in the last month, with more currently in 
progress
   
   * Performance of floating point aggregates, I seem to remember all the TPCH 
queries testing such fail to run correctly, but I could be mistaken
   
   * Performance of dictionary arrays, there is a substantial amount of work 
completed and in-flight to improve this situation as it has historically been 
poor (and still is WIP)- https://github.com/apache/arrow-rs/issues/1113, 
https://github.com/apache/arrow-datafusion/issues/1610, 
https://github.com/apache/arrow-datafusion/issues/1474, 
https://github.com/apache/arrow-rs/pull/1180, etc...
   
   * Performance on parquet files with reasonable row group sizes, the OOM 
would suggest they aren't teeny but wanted to clarify - there is currently a 
limitation of arrow-rs's parquet writer that makes it produce impractically 
small row groups - https://github.com/apache/arrow-rs/issues/1211
   
   * Performance of specific operators, e.g. FilterExec or SortPreservingMerge 
or ParquetExec, I'd basically be interested in where the performance gains are, 
and where we might gain or lose performance in a potential switch. Perhaps some 
execution plan metrics, or a perf dump or something :thinking:
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] tustvold edited a comment on issue #1652: ARROW2: Performance benchmark

Reply via email to