Volker Lorrmann created ARROW-17961:
---------------------------------------
Summary: Add read/write optimization for pyarrow.fs.S3FileSystem
Key: ARROW-17961
URL: https://issues.apache.org/jira/browse/ARROW-17961
Project: Apache Arrow
Issue Type: Improvement
Components: Python
Reporter: Volker Lorrmann
I found large differences in loading time, when loading data from aws s3 using
{{pyarrows.fs.S3FileSystem}} compared to {{s3fs.S3FileSystem}} See example
below.
The difference comes from {{s3fs}} optimization, which {{pyarrow.fs}} is not
(yet) using.
{code:python}
import pyarrow.dataset as ds
import pyarrow.parquet as pq
import pyarrow.fs as pafs
import s3fs
import load_credentials
credentials = load_credentials()
path = "path/to/data" # folder with about 300 small (~10kb) files
fs1 = s3fs.S3FileSystem(
anon=False,
key=credentials["accessKeyId"],
secret=credentials["secretAccessKey"],
token=credentials["sessionToken"],
)
fs2 = pafs.S3FileSystem(
access_key=credentials["accessKeyId"],
secret_key=credentials["secretAccessKey"],
session_token=credentials["sessionToken"],
)
_ = ds.dataset(path, filesystem=fs1).to_table() # takes about 5 seconds
_ = ds.dataset(path, filesystem=fs2).to_table() # takes about 25 seconds
_ = pq.read_table(path, filesyste=fs1) # takes about 5 seconds
_ = pq.read_table(path, filesytem=fs2) # takes about 10 seconds
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)