OliLay commented on issue #45749:
URL: https://github.com/apache/arrow/issues/45749#issuecomment-2720399318

   After further investigation it seems like I have found the root cause. 
   When using a larger thread pool, arrow [explicitly 
sets](https://github.com/apache/arrow/blob/c3e399af43c4a2e384a41fad0619589baa9045f0/cpp/src/arrow/filesystem/s3fs.cc#L1178)
 the number of connections in the connection pool to the number of threads in 
the arrow thread pool. 
   This means a new socket connect + TLS handshake will have to be done for 
every single request. When only having metadata requests (as in this case), 
opening a new connection for every very short running request (as there is 
barely any data involved) seems to be expensive, and doing 32 TLS handshakes in 
parallel seems to take ~1sec. I suspect this may be some throttling on the 
server-side also, but not sure. 
   When using a smaller connection pool size, the requests run through in 
<200ms because the AWS SDK implements curl connection handle re-use, hence the 
number of TLS handshakes equals the number of connections in the pool.
   
   I wonder if arrow should really set the number of connections to the number 
of threads. I think this should only be done when you know you have large data 
to transfer and one time overhead of opening a connection is marginal; but for 
metadata requests that doesn't seem viable. Either we can make the connection 
pool size configurable via `S3Options`; or let the user give a hint about how 
the file system is used (e.g. metadata, or large data transfer); and then 
decide based on that how many connections we want to spawn.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to