[GitHub] [arrow-datafusion] jorgecarleitao commented on pull request #68: Experimenting with arrow2

GitBox Tue, 25 May 2021 08:26:52 -0700


jorgecarleitao commented on pull request #68:
URL: https://github.com/apache/arrow-datafusion/pull/68#issuecomment-847969377



   A quick update here:
   
   The benchmarks for TPCH 1 (the result itself is the same) are:
   
   * CSV: 20% slower
       * CSV reader is not parallel and does not re-use memory (working on this)
       * group by uses `value(i)` which in arrow does not perform bound checks 
but arrow2 performs (@Dandandan 's change on the #339 should fix it).
   * in-memory: 10% slower
       * group by uses `value(i)` which in arrow does not perform bound checks 
but arrow2 performs (@Dandandan 's change on the #339 should fix it).
   * without aggregation and in-memory: 2x faster
       * arrow2's sorting is much faster
   
   Arrow2 has a vectorized hash, but I have not used it yet.
   
   With parquet, the first observation is that the conversion from CSV to 
parquet (i.e. `write`) is 10x faster:
   
   master:
   ```
   Converting './data/orders.tbl' to parquet files in directory 
'./data/tpch-parquet/orders'
   Conversion completed in 334242 ms
   ```
   this PR:
   ```
   Converting './data/orders.tbl' to parquet files in directory 
'./data/tpch-parquet/orders'
   Conversion completed in 33968 ms
   ```
   
   The query itself is still 10% slower (even though reading is faster).
   
   Investigations continue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] jorgecarleitao commented on pull request #68: Experimenting with arrow2

Reply via email to