Re: [I] [C++] Increasing thread count leads to worse perforamnce for object storage file systems (S3, Azure Blob) [arrow]

via GitHub Thu, 13 Mar 2025 01:46:39 -0700


OliLay commented on issue #45749:
URL: https://github.com/apache/arrow/issues/45749#issuecomment-2720399318

After further investigation it seems like I have found the root cause.
When using a larger thread pool, arrow [explicitly
sets](https://github.com/apache/arrow/blob/c3e399af43c4a2e384a41fad0619589baa9045f0/cpp/src/arrow/filesystem/s3fs.cc#L1178)
the number of connections in the connection pool to the number of threads in
the arrow thread pool.
This means a new socket connect + TLS handshake will have to be done for
every single request. When only having metadata requests (as in this case),
opening a new connection for every very short running request (as there is
barely any data involved) seems to be expensive, and doing 32 TLS handshakes in
parallel seems to take ~1sec. I suspect this may be some throttling on the
server-side also, but not sure.
When using a smaller connection pool size, the requests run through in
<200ms because the AWS SDK implements curl connection handle re-use, hence the
number of TLS handshakes equals the number of connections in the pool.

I wonder if arrow should really set the number of connections to the number
of threads. I think this should only be done when you know you have large data
to transfer and one time overhead of opening a connection is marginal; but for
metadata requests that doesn't seem viable. Either we can make the connection
pool size configurable via `S3Options`; or let the user give a hint about how
the file system is used (e.g. metadata, or large data transfer); and then
decide based on that how many connections we want to spawn.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [C++] Increasing thread count leads to worse perforamnce for object storage file systems (S3, Azure Blob) [arrow]

Reply via email to