Leonhard Gruenschloss created ARROW-17033:
---------------------------------------------
Summary: [C++] Add GCS connection pool size option
Key: ARROW-17033
URL: https://issues.apache.org/jira/browse/ARROW-17033
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Affects Versions: 8.0.0
Reporter: Leonhard Gruenschloss
Multi-threaded read performance in Arrow's GCS file system implementation
currently is relatively low. Given the high latency of cloud blob systems like
GCS, a common strategy is to use many concurrent readers (if the system has
enough memory to support that), e.g. using 100 threads.
The GCS client library offers a [{{ConnectionPoolSize}}
option|https://googleapis.dev/cpp/google-cloud-storage/latest/structgoogle_1_1cloud_1_1storage_1_1v1_1_1ConnectionPoolSizeOption.html].
If this option is set to a value that's too low, concurrency is throttled. At
the moment, this is not exposed in
[{{GcsOptions}}|https://github.com/apache/arrow/blob/73cdd6a59b52781cc43e097ccd63ac36f705ee2e/cpp/src/arrow/filesystem/gcsfs.h#L59],
consequently limiting multi-threaded throughput.
Instead of exposing this option, an alternative implementation strategy could
be to use the same value as set by {{arrow::io::SetIOThreadPoolCapacity}}.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)