[GitHub] [arrow] shaktishp opened a new issue, #36587: Performance issues when loading data from S3

via GitHub Mon, 10 Jul 2023 02:19:13 -0700


shaktishp opened a new issue, #36587:
URL: https://github.com/apache/arrow/issues/36587


   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   We are doing one small poc wherein we are comparing the performance when we 
load parquet files directly from s3 or from local file system
   
   
   
   # jupyter notebook code snippet
   # jupyter notebook and s3 is in same region
   
   Import pyarrow.dataset as ds
   import time
   
   s3Dataset = ds.dataset(‘location’) # s3 or local file system 
   scanner = s3Dataset.scanner()
   batches = scanner.to_batches()
   st = time.time()
   
   for batch in batches:
      batch.num_rows
   
   et = time.time()
   
   elapsed = et - st
   
   # code ends here
   
   total files 50 and total size of 500 mb
   
   Time taken to read from s3 is 970 sec
   Time taken to read from local file system 5 sec
    
    Any insight why S3 is taking so much time?
   
    Is there any settings we are missing when we are reading files from S3 
which would give decent performance?
   
   Any help appreciated. 
   
   
   
   
   
   
   ### Component(s)
   
   FlightRPC, Parquet, Python, Other


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] shaktishp opened a new issue, #36587: Performance issues when loading data from S3

Reply via email to