[
https://issues.apache.org/jira/browse/ARROW-12428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324041#comment-17324041
]
David Li commented on ARROW-12428:
----------------------------------
Here's a quick comparison between Pandas/S3FS and PyArrow with a pre_buffer
option implemented:
{noformat}
Python: 3.9.2
Pandas: 1.2.3
PyArrow: 5.0.0 master (9c1e5bd19347635ea9f373bcf93f2cea0231d50a)
Pandas/S3FS: 107.31099020410329 seconds
Pandas/S3FS (no readahead): 676.9701101030223 seconds
PyArrow: 213.81073790509254 seconds
PyArrow (pre-buffer): 29.330630503827706 seconds
Pandas/S3FS (pre-buffer): 54.61801828909665 seconds
Pandas/S3FS (pre-buffer, no readahead): 46.7531590978615 seconds {noformat}
{code:python}
import time
import pandas as pd
import pyarrow.parquet as pq
start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet")
duration = time.monotonic() - start
print("Pandas/S3FS:", duration, "seconds")
start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet",
storage_options={
'default_block_size': 1, # 0 is ignored
'default_fill_cache': False,
})
duration = time.monotonic() - start
print("Pandas/S3FS (no readahead):", duration, "seconds")
start = time.monotonic()
df = pq.read_pandas("s3://ursa-labs-taxi-data/2012/01/data.parquet")
duration = time.monotonic() - start
print("PyArrow:", duration, "seconds")
start = time.monotonic()
df = pq.read_pandas("s3://ursa-labs-taxi-data/2012/01/data.parquet",
pre_buffer=True)
duration = time.monotonic() - start
print("PyArrow (pre-buffer):", duration, "seconds")
start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet",
pre_buffer=True)
duration = time.monotonic() - start
print("Pandas/S3FS (pre-buffer):", duration, "seconds")
start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet",
storage_options={
'default_block_size': 1, # 0 is ignored
'default_fill_cache': False,
}, pre_buffer=True)
duration = time.monotonic() - start
print("Pandas/S3FS (pre-buffer, no readahead):", duration, "seconds")
{code}
> [Python] pyarrow.parquet.read_* should use pre_buffer=True
> ----------------------------------------------------------
>
> Key: ARROW-12428
> URL: https://issues.apache.org/jira/browse/ARROW-12428
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: David Li
> Assignee: David Li
> Priority: Major
> Fix For: 5.0.0
>
>
> If the user is synchronously reading a single file, we should try to read it
> as fast as possible. The one sticking point might be whether it's beneficial
> to enable this no matter the filesystem or whether we should try to only
> enable it on high-latency filesystems.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)