[GitHub] [arrow] dhirschfeld opened issue #2614: Performance Question

GitHub Sun, 23 Sep 2018 17:19:20 -0700

I'm using dask to do calculations on a large amount of data which I then save 
locally to a partitioned parquet file. I figured a partitioned file would allow 
for better parallelism.


Saving the data from multiple workers takes less 30s:
![image](https://user-images.githubusercontent.com/881019/45934465-f2643380-bfe1-11e8-8d4a-42bb43724113.png)


...but I'm finding that reading the data back takes more than 5 times longer 
than writing the data and that more threads doesn't help:

![image](https://user-images.githubusercontent.com/881019/45934402-1e32e980-bfe1-11e8-82e2-480581892bef.png)

The read performance is a bit of a surprise - it's faster to simply extract and 
transform the data from source than it is to read back the transformed data 
from a parquet file.

Is this performance asymmetry (read/write) expected?

[ Full content available at: https://github.com/apache/arrow/issues/2614 ]
This message was relayed via gitbox.apache.org for [email protected]

[GitHub] [arrow] dhirschfeld opened issue #2614: Performance Question

Reply via email to