[
https://issues.apache.org/jira/browse/ARROW-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16559132#comment-16559132
]
Wes McKinney edited comment on ARROW-2059 at 7/27/18 1:46 AM:
--------------------------------------------------------------
I ran some benchmarks locally (quad-core Xeon E3-1535M, Ubuntu 14.04). There is
still a slight ~5-10% write performance regression, but reading is faster.
Feather 0.3.1:
{code}
# WRITE
$ python bench.py
Elapsed: 15.497231721878052 seconds
Average: 1.549723172187805
# READ
$ python bench.py
Elapsed: 9.88158106803894 seconds
Average: 0.988158106803894
{code}
Feather 0.4.0
{code}
# WRITE
$ python bench.py
Elapsed: 16.36524486541748 seconds
Average: 1.636524486541748
# READ
$ python bench.py
Elapsed: 7.4859395027160645 seconds
Average: 0.7485939502716065
{code}
Here's the benchmarking script so people can run their own experiments. It
would be useful to look at the perf output more closely and see what else we
can do to make things faster:
{code}
import io
import pickle
import time
import feather
import pandas as pd
def generate_example():
buf = io.StringIO("""07300003030539,42198997,-1,2016-10-03T13:14:22.326Z
41130003053286,42224636,-1,2016-09-20T19:31:51.196Z
""")
table = pd.read_csv(buf, header=None)
table = pd.concat([table] * 5000, axis=0, ignore_index=True)
table = pd.concat([table] * 1000, axis=0, ignore_index=True)
with open('example.pkl', 'wb') as f:
pickle.dump(table, f)
def _get_time():
return time.clock_gettime(time.CLOCK_REALTIME)
class Timer:
def __init__(self, iterations):
self.start_time = _get_time()
self.iterations = iterations
def __enter__(self):
return self
def __exit__(self, exc_type, exc_value, tb):
elapsed = _get_time() - self.start_time
print("Elapsed: {0} seconds\nAverage: {1}"
.format(elapsed, elapsed / self.iterations))
def feather_write_bench(iterations=10):
with open('example.pkl', 'rb') as f:
data = pickle.load(f)
with Timer(iterations):
for i in range(iterations):
feather.write_dataframe(data, 'example.fth')
def feather_read_bench(iterations=10):
import gc
gc.disable()
with Timer(iterations):
for i in range(iterations):
feather.read_dataframe('example.fth')
gc.enable()
# generate_example()
# feather_write_bench()
# feather_read_bench()
{code}
To use
* Run with generate_example() to make the data file
* Run write benchmarks with feather_write_bench
* Run read benchmarks with feather_read_bench
was (Author: wesmckinn):
I ran some benchmarks locally (quad-core Xeon E3-1535M, Ubuntu 14.04). There is
still a slight ~5-10% write performance regression, but reading is faster.
Feather 0.3.1:
{code}
# WRITE
$ python bench.py
Elapsed: 15.497231721878052 seconds
Average: 1.549723172187805
# READ
$ python bench.py
Elapsed: 9.88158106803894 seconds
Average: 0.988158106803894
{code}
Feather 0.4.0
{code}
$ python bench.py
Elapsed: 16.36524486541748 seconds
Average: 1.636524486541748
$ python bench.py
Elapsed: 7.4859395027160645 seconds
Average: 0.7485939502716065
{code}
Here's the benchmarking script so people can run their own experiments. It
would be useful to look at the perf output more closely and see what else we
can do to make things faster:
{code}
import io
import pickle
import time
import feather
import pandas as pd
def generate_example():
buf = io.StringIO("""07300003030539,42198997,-1,2016-10-03T13:14:22.326Z
41130003053286,42224636,-1,2016-09-20T19:31:51.196Z
""")
table = pd.read_csv(buf, header=None)
table = pd.concat([table] * 5000, axis=0, ignore_index=True)
table = pd.concat([table] * 1000, axis=0, ignore_index=True)
with open('example.pkl', 'wb') as f:
pickle.dump(table, f)
def _get_time():
return time.clock_gettime(time.CLOCK_REALTIME)
class Timer:
def __init__(self, iterations):
self.start_time = _get_time()
self.iterations = iterations
def __enter__(self):
return self
def __exit__(self, exc_type, exc_value, tb):
elapsed = _get_time() - self.start_time
print("Elapsed: {0} seconds\nAverage: {1}"
.format(elapsed, elapsed / self.iterations))
def feather_write_bench(iterations=10):
with open('example.pkl', 'rb') as f:
data = pickle.load(f)
with Timer(iterations):
for i in range(iterations):
feather.write_dataframe(data, 'example.fth')
def feather_read_bench(iterations=10):
import gc
gc.disable()
with Timer(iterations):
for i in range(iterations):
feather.read_dataframe('example.fth')
gc.enable()
# generate_example()
# feather_write_bench()
# feather_read_bench()
{code}
To use
* Run with generate_example() to make the data file
* Run write benchmarks with feather_write_bench
* Run read benchmarks with feather_read_bench
> [Python] Possible performance regression in Feather read/write path
> -------------------------------------------------------------------
>
> Key: ARROW-2059
> URL: https://issues.apache.org/jira/browse/ARROW-2059
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Reporter: Wes McKinney
> Assignee: Antoine Pitrou
> Priority: Major
> Fix For: 0.10.0
>
>
> See discussion in https://github.com/wesm/feather/issues/329. Needs to be
> investigated
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)