Jingyuan Wang commented on ARROW-2059:

Here is what I've done. I simple repeated the 1M rows and created a 30M and 
100M testing csv files and try to repeat the process of reading from csv, 
writing as feather and reading from feather and time each part. I also repeated 
the measurement 10 times for the four combination of (python-2.7, python-3.6) x 
(feather-format-0.3.1, feather-format-0.4.0).

Processing 100M rows files all failed on my laptop (16GB memory) except for the 
version of python2.7 and feather-format-0.3.1.

The measurement of 1M rows is as following: 
||python version||feather version|| # rows||write feather||read feather||

The measuremnt of 30M rows is as following:
||python version||feather version|| # rows||write feather||read feather||

>From both tables, performance of writing to feather did degrade from 0.3.1 to 
>0.4.0 with python2 being more dramatically. Reading feather files were 
>actually faster with the newer feather version.

One other thing, I noticed that feather-format-0.3.1 does not even depend on 
Arrow. So the performance difference is more than the Arrow's version upgrade. 
And I do think we need some thorough benchmarks for Arrow or do we already have 

> [Python] Possible performance regression in Feather read/write path
> -------------------------------------------------------------------
>                 Key: ARROW-2059
>                 URL: https://issues.apache.org/jira/browse/ARROW-2059
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Wes McKinney
>            Assignee: Jingyuan Wang
>            Priority: Major
>             Fix For: 0.9.0
> See discussion in https://github.com/wesm/feather/issues/329. Needs to be 
> investigated

This message was sent by Atlassian JIRA

Reply via email to