OliLay commented on issue #45749: URL: https://github.com/apache/arrow/issues/45749#issuecomment-2720399318
After further investigation it seems like I have found the root cause. When using a larger thread pool, arrow [explicitly sets](https://github.com/apache/arrow/blob/c3e399af43c4a2e384a41fad0619589baa9045f0/cpp/src/arrow/filesystem/s3fs.cc#L1178) the number of connections in the connection pool to the number of threads in the arrow thread pool. This means a new socket connect + TLS handshake will have to be done for every single request. When only having metadata requests (as in this case), opening a new connection for every very short running request (as there is barely any data involved) seems to be expensive, and doing 32 TLS handshakes in parallel seems to take ~1sec. I suspect this may be some throttling on the server-side also, but not sure. When using a smaller connection pool size, the requests run through in <200ms because the AWS SDK implements curl connection handle re-use, hence the number of TLS handshakes equals the number of connections in the pool. I wonder if arrow should really set the number of connections to the number of threads. I think this should only be done when you know you have large data to transfer and one time overhead of opening a connection is marginal; but for metadata requests that doesn't seem viable. Either we can make the connection pool size configurable via `S3Options`; or let the user give a hint about how the file system is used (e.g. metadata, or large data transfer); and then decide based on that how many connections we want to spawn. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
