[ 
https://issues.apache.org/jira/browse/ARROW-17033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565027#comment-17565027
 ] 

Weston Pace commented on ARROW-17033:
-------------------------------------

This probably deserves some testing and profiling.  At a first glance at the 
linked doc for {{ConnectionPoolSizeOption}} however I see:

{quote}The library may create more connections than this option configures, for 
example if your application requests many simultaneous downloads. {quote}

It seems like this option shouldn't prevent concurrency.  Also, we should see 
if we can find some concrete guidance on the number of threads.  For example, 
S3 
[recommends|https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance-design-patterns.html]
 "Make one concurrent request for each 85–90 MB/s of desired network throughput"

If the ideal concurrency really is 100 threads we should, for now, document 
this somewhere visible to users so they know to bump the I/O thread pool 
capacity.  In the future we should find a way to adjust the I/O thread pool 
capacity automatically but this is a more considerable task.

> [C++] Add GCS connection pool size option
> -----------------------------------------
>
>                 Key: ARROW-17033
>                 URL: https://issues.apache.org/jira/browse/ARROW-17033
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>    Affects Versions: 8.0.0
>            Reporter: Leonhard Gruenschloss
>            Priority: Minor
>              Labels: GCP, good-first-issue, performance
>
> Multi-threaded read performance in Arrow's GCS file system implementation 
> currently is relatively low. Given the high latency of cloud blob systems 
> like GCS, a common strategy is to use many concurrent readers (if the system 
> has enough memory to support that), e.g. using 100 threads.
> The GCS client library offers a [{{ConnectionPoolSize}} 
> option|https://googleapis.dev/cpp/google-cloud-storage/latest/structgoogle_1_1cloud_1_1storage_1_1v1_1_1ConnectionPoolSizeOption.html].
>  If this option is set to a value that's too low, concurrency is throttled. 
> At the moment, this is not exposed in 
> [{{GcsOptions}}|https://github.com/apache/arrow/blob/73cdd6a59b52781cc43e097ccd63ac36f705ee2e/cpp/src/arrow/filesystem/gcsfs.h#L59],
>  consequently limiting multi-threaded throughput.
> Instead of exposing this option, an alternative implementation strategy could 
> be to use the same value as set by {{arrow::io::SetIOThreadPoolCapacity}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to