shaktishp opened a new issue, #36587:
URL: https://github.com/apache/arrow/issues/36587
### Describe the usage question you have. Please include as many useful
details as possible.
We are doing one small poc wherein we are comparing the performance when we
load parquet files directly from s3 or from local file system
# jupyter notebook code snippet
# jupyter notebook and s3 is in same region
Import pyarrow.dataset as ds
import time
s3Dataset = ds.dataset(‘location’) # s3 or local file system
scanner = s3Dataset.scanner()
batches = scanner.to_batches()
st = time.time()
for batch in batches:
batch.num_rows
et = time.time()
elapsed = et - st
# code ends here
total files 50 and total size of 500 mb
Time taken to read from s3 is 970 sec
Time taken to read from local file system 5 sec
Any insight why S3 is taking so much time?
Is there any settings we are missing when we are reading files from S3
which would give decent performance?
Any help appreciated.
### Component(s)
FlightRPC, Parquet, Python, Other
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]