[jira] [Comment Edited] (ARROW-2059) [Python] Possible performance regression in Feather read/write path

Wes McKinney (JIRA) Thu, 26 Jul 2018 18:47:52 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16559132#comment-16559132
 ]


Wes McKinney edited comment on ARROW-2059 at 7/27/18 1:46 AM:
--------------------------------------------------------------

I ran some benchmarks locally (quad-core Xeon E3-1535M, Ubuntu 14.04). There is 
still a slight ~5-10% write performance regression, but reading is faster.

Feather 0.3.1:

{code}
# WRITE
$ python bench.py
Elapsed: 15.497231721878052 seconds
Average: 1.549723172187805

# READ
$ python bench.py 
Elapsed: 9.88158106803894 seconds
Average: 0.988158106803894
{code}

Feather 0.4.0

{code}
# WRITE
$ python bench.py
Elapsed: 16.36524486541748 seconds
Average: 1.636524486541748

# READ
$ python bench.py 
Elapsed: 7.4859395027160645 seconds
Average: 0.7485939502716065
{code}

Here's the benchmarking script so people can run their own experiments. It 
would be useful to look at the perf output more closely and see what else we 
can do to make things faster:

{code}
import io
import pickle
import time

import feather
import pandas as pd


def generate_example():
    buf = io.StringIO("""07300003030539,42198997,-1,2016-10-03T13:14:22.326Z
    41130003053286,42224636,-1,2016-09-20T19:31:51.196Z
    """)

    table = pd.read_csv(buf, header=None)
    table = pd.concat([table] * 5000, axis=0, ignore_index=True)
    table = pd.concat([table] * 1000, axis=0, ignore_index=True)

    with open('example.pkl', 'wb') as f:
        pickle.dump(table, f)


def _get_time():
    return time.clock_gettime(time.CLOCK_REALTIME)


class Timer:

    def __init__(self, iterations):
        self.start_time = _get_time()
        self.iterations = iterations

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_value, tb):
        elapsed = _get_time() - self.start_time
        print("Elapsed: {0} seconds\nAverage: {1}"
              .format(elapsed, elapsed / self.iterations))


def feather_write_bench(iterations=10):
    with open('example.pkl', 'rb') as f:
        data = pickle.load(f)

    with Timer(iterations):
        for i in range(iterations):
            feather.write_dataframe(data, 'example.fth')


def feather_read_bench(iterations=10):
    import gc
    gc.disable()
    with Timer(iterations):
        for i in range(iterations):
            feather.read_dataframe('example.fth')
    gc.enable()


# generate_example()
# feather_write_bench()
# feather_read_bench()

{code}

To use

* Run with generate_example() to make the data file
* Run write benchmarks with feather_write_bench
* Run read benchmarks with feather_read_bench


was (Author: wesmckinn):
I ran some benchmarks locally (quad-core Xeon E3-1535M, Ubuntu 14.04). There is 
still a slight ~5-10% write performance regression, but reading is faster.

Feather 0.3.1:

{code}
# WRITE
$ python bench.py
Elapsed: 15.497231721878052 seconds
Average: 1.549723172187805

# READ
$ python bench.py 
Elapsed: 9.88158106803894 seconds
Average: 0.988158106803894
{code}

Feather 0.4.0

{code}
$ python bench.py
Elapsed: 16.36524486541748 seconds
Average: 1.636524486541748

$ python bench.py 
Elapsed: 7.4859395027160645 seconds
Average: 0.7485939502716065
{code}

Here's the benchmarking script so people can run their own experiments. It 
would be useful to look at the perf output more closely and see what else we 
can do to make things faster:

{code}
import io
import pickle
import time

import feather
import pandas as pd


def generate_example():
    buf = io.StringIO("""07300003030539,42198997,-1,2016-10-03T13:14:22.326Z
    41130003053286,42224636,-1,2016-09-20T19:31:51.196Z
    """)

    table = pd.read_csv(buf, header=None)
    table = pd.concat([table] * 5000, axis=0, ignore_index=True)
    table = pd.concat([table] * 1000, axis=0, ignore_index=True)

    with open('example.pkl', 'wb') as f:
        pickle.dump(table, f)


def _get_time():
    return time.clock_gettime(time.CLOCK_REALTIME)


class Timer:

    def __init__(self, iterations):
        self.start_time = _get_time()
        self.iterations = iterations

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_value, tb):
        elapsed = _get_time() - self.start_time
        print("Elapsed: {0} seconds\nAverage: {1}"
              .format(elapsed, elapsed / self.iterations))


def feather_write_bench(iterations=10):
    with open('example.pkl', 'rb') as f:
        data = pickle.load(f)

    with Timer(iterations):
        for i in range(iterations):
            feather.write_dataframe(data, 'example.fth')


def feather_read_bench(iterations=10):
    import gc
    gc.disable()
    with Timer(iterations):
        for i in range(iterations):
            feather.read_dataframe('example.fth')
    gc.enable()


# generate_example()
# feather_write_bench()
# feather_read_bench()

{code}

To use

* Run with generate_example() to make the data file
* Run write benchmarks with feather_write_bench
* Run read benchmarks with feather_read_bench

> [Python] Possible performance regression in Feather read/write path
> -------------------------------------------------------------------
>
>                 Key: ARROW-2059
>                 URL: https://issues.apache.org/jira/browse/ARROW-2059
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Wes McKinney
>            Assignee: Antoine Pitrou
>            Priority: Major
>             Fix For: 0.10.0
>
>
> See discussion in https://github.com/wesm/feather/issues/329. Needs to be 
> investigated



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (ARROW-2059) [Python] Possible performance regression in Feather read/write path

Reply via email to