jorgecarleitao commented on pull request #68:
URL: https://github.com/apache/arrow-datafusion/pull/68#issuecomment-847969377
A quick update here:
The benchmarks for TPCH 1 (the result itself is the same) are:
* CSV: 20% slower
* CSV reader is not parallel and does not re-use memory (working on this)
* group by uses `value(i)` which in arrow does not perform bound checks
but arrow2 performs (@Dandandan 's change on the #339 should fix it).
* in-memory: 10% slower
* group by uses `value(i)` which in arrow does not perform bound checks
but arrow2 performs (@Dandandan 's change on the #339 should fix it).
* without aggregation and in-memory: 2x faster
* arrow2's sorting is much faster
Arrow2 has a vectorized hash, but I have not used it yet.
With parquet, the first observation is that the conversion from CSV to
parquet (i.e. `write`) is 10x faster:
master:
```
Converting './data/orders.tbl' to parquet files in directory
'./data/tpch-parquet/orders'
Conversion completed in 334242 ms
```
this PR:
```
Converting './data/orders.tbl' to parquet files in directory
'./data/tpch-parquet/orders'
Conversion completed in 33968 ms
```
The query itself is still 10% slower (even though reading is faster).
Investigations continue.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]