legout opened a new issue, #14336:
URL: https://github.com/apache/arrow/issues/14336
I found large differences in loading time, when loading data from aws s3.
```python
import pyarrow.dataset as ds
import pyarrow.parquet as pq
import pyarrow.fs as pafs
import s3fs
import load_credentials
credentials = load_credentials()
path = "path/to/data" # folder with about 300 small (~10kb) files
fs1 = s3fs.S3FileSystem(
anon=False,
key=credentials["accessKeyId"],
secret=credentials["secretAccessKey"],
token=credentials["sessionToken"],
)
fs2 = pafs.S3FileSystem(
access_key=credentials["accessKeyId"],
secret_key=credentials["secretAccessKey"],
session_token=credentials["sessionToken"],
)
_ = ds.dataset(path, filesystem=fs1).to_table() # takes about 5 seconds
_ = ds.dataset(path, filesystem=fs2).to_table() # takes about 25 seconds
_ = pq.read_table(path, filesyste=fs1) # takes about 5 seconds
_ = pq.read_table(path, filesytem=fs2) # takes about 10 seconds
```
Is there an (easy) explanation for this differences in loading time?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]