[jira] [Commented] (ARROW-2059) [Python] Possible performance regression in Feather read/write path

Jingyuan Wang (JIRA) Sun, 04 Feb 2018 20:09:07 -0800

    [ 
https://issues.apache.org/jira/browse/ARROW-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16352023#comment-16352023
 ]


Jingyuan Wang commented on ARROW-2059:
--------------------------------------

Here is what I've done. I simple repeated the 1M rows and created a 30M and 
100M testing csv files and try to repeat the process of reading from csv, 
writing as feather and reading from feather and time each part. I also repeated 
the measurement 10 times for the four combination of (python-2.7, python-3.6) x 
(feather-format-0.3.1, feather-format-0.4.0).

Processing 100M rows files all failed on my laptop (16GB memory) except for the 
version of python2.7 and feather-format-0.3.1.

The measurement of 1M rows is as following: 
||python version||feather version|| # rows||write feather||read feather||
|2.7|0.3.1|1M|0.06216781139|0.05903599262|
|2.7|0.4.0|1M|0.1335380793|0.04576666355|
|3.6|0.3.1|1M|0.07768514156|0.09041910172|
|3.6|0.4.0|1M|0.08690385818|0.05801310539|

The measuremnt of 30M rows is as following:
||python version||feather version|| # rows||write feather||read feather||
|2.7|0.3.1|30M|1.747310066|2.35606482|
|2.7|0.4.0|30M|3.5653723|1.934461188|
|3.6|0.3.1|30M|2.407458949|2.811572456|
|3.6|0.4.0|30M|2.925034189|1.852504301|

>From both tables, performance of writing to feather did degrade from 0.3.1 to 
>0.4.0 with python2 being more dramatically. Reading feather files were 
>actually faster with the newer feather version.

One other thing, I noticed that feather-format-0.3.1 does not even depend on 
Arrow. So the performance difference is more than the Arrow's version upgrade. 
And I do think we need some thorough benchmarks for Arrow or do we already have 
them?

> [Python] Possible performance regression in Feather read/write path
> -------------------------------------------------------------------
>
>                 Key: ARROW-2059
>                 URL: https://issues.apache.org/jira/browse/ARROW-2059
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Wes McKinney
>            Assignee: Jingyuan Wang
>            Priority: Major
>             Fix For: 0.9.0
>
>
> See discussion in https://github.com/wesm/feather/issues/329. Needs to be 
> investigated



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2059) [Python] Possible performance regression in Feather read/write path

Reply via email to