Re: [I] [Python/C++] `S3FileSystem` slow to deserialize due to AWS rule engine JSON parsing [arrow]

via GitHub Sat, 02 Mar 2024 17:11:05 -0800


user293811 commented on issue #40279:
URL: https://github.com/apache/arrow/issues/40279#issuecomment-1974965807


   > With the caching PoC in #40299 I get the following:
   > 
   > ```python
   > >>> %timeit s = S3FileSystem(anonymous=True)
   > 34.2 µs ± 42.8 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
   > >>> %timeit s = S3FileSystem(anonymous=True, region='eu-west-1')
   > 32.8 µs ± 37.4 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
   > >>> %timeit s = S3FileSystem()
   > 52.5 µs ± 262 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
   > ```
   
   Would caching help with the first instantiation?  It currently takes 5 
seconds to connect to S3 on the first try and I'm not able to re-use the 
connection, as each connection is on its own separate process to run their own 
ETL.  
   
   If not, are there other ways I can establish a connection to S3 with pyarrow 
that would be faster?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Python/C++] `S3FileSystem` slow to deserialize due to AWS rule engine JSON parsing [arrow]

Reply via email to