[GitHub] [arrow] legout opened a new issue, #14336: Large differences in loading data from s3 bucket depending on filesystem

GitBox Thu, 06 Oct 2022 07:35:13 -0700


legout opened a new issue, #14336:
URL: https://github.com/apache/arrow/issues/14336


   I found large differences in loading time, when loading data  from aws s3. 
    
   ```python
   
   import pyarrow.dataset as ds
   import pyarrow.parquet as pq
   import pyarrow.fs as pafs
   import s3fs
   import load_credentials
   
   credentials = load_credentials()
   path = "path/to/data" # folder with about 300 small (~10kb) files
   
   fs1 = s3fs.S3FileSystem(
       anon=False,
       key=credentials["accessKeyId"],
       secret=credentials["secretAccessKey"],
       token=credentials["sessionToken"],
   )
   
   fs2 = pafs.S3FileSystem(
       access_key=credentials["accessKeyId"],
       secret_key=credentials["secretAccessKey"],
       session_token=credentials["sessionToken"],
      
   )
   
   _ = ds.dataset(path, filesystem=fs1).to_table() # takes about 5 seconds
   
   _ = ds.dataset(path, filesystem=fs2).to_table() # takes about 25 seconds
   
   _ = pq.read_table(path, filesyste=fs1) # takes about 5 seconds
   
   _ = pq.read_table(path, filesytem=fs2) # takes about 10 seconds
   
   ```
   
   Is there an (easy) explanation for this differences in loading time?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] legout opened a new issue, #14336: Large differences in loading data from s3 bucket depending on filesystem

Reply via email to