agrawal-siddharth commented on code in PR #31721: URL: https://github.com/apache/beam/pull/31721#discussion_r1672915640
########## sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryOptions.java: ########## @@ -109,6 +109,29 @@ public interface BigQueryOptions void setNumStorageWriteApiStreamAppendClients(Integer value); + @Description( + "When using the STORAGE_API_AT_LEAST_ONCE write method with multiplexing (ie. useStorageApiConnectionPool=true), " + + "this option sets the minimum number of connections each pool creates. This is on a per worker, per region basis. " Review Comment: "this option sets the minimum number of connections each pool creates before any connections are shared." When you say "per worker" I'm not sure what a worker corresponds to within the writeAPI storage library. Does each worker create its own StreamWriter? Under the hood, the connection pool is a static map whose KEY is the region, and the VALUE is the pool for that region. So all StreamWriter objects within the same process will share the same map. This could span many workers if they run within the same process. ########## sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryOptions.java: ########## @@ -109,6 +109,26 @@ public interface BigQueryOptions void setNumStorageWriteApiStreamAppendClients(Integer value); + @Description( + "When using the STORAGE_API_AT_LEAST_ONCE write method with multiplexing (ie. useStorageApiConnectionPool=true), " + + "this option sets the minimum number of connections each pool creates. This is on a per worker, per region basis. " + + "Note that in practice, the minimum number of connections created is the minimum of this value and " + + "(numStorageWriteApiStreamAppendClients x num destinations).") + @Default.Integer(2) + Integer getMinConnectionPoolConnections(); + + void setMinConnectionPoolConnections(Integer value); + + @Description( + "When using the STORAGE_API_AT_LEAST_ONCE write method with multiplexing (ie. useStorageApiConnectionPool=true), " + + "this option sets the maximum number of connections each pool creates. This is on a per worker, per region basis. " + + "This value should be greater than or equal to the total number of dynamic destinations, otherwise a " + + "race condition occurs where append operations compete over streams.") + @Default.Integer(20) Review Comment: We have 2 general concerns with too many connections: (a) the user will hit the concurrent connection limit, and (b) it is not fruitful to further increase the number of connections because a single machine can only send so much traffic over the network. From experiments going beyond about 20 connections does not help with increasing throughput. Due to the above, we don't expect it will help much to raise this limit. So I think a customer is less likely to want to tune this parameter but we have it for completeness. ########## sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryOptions.java: ########## @@ -109,6 +109,29 @@ public interface BigQueryOptions void setNumStorageWriteApiStreamAppendClients(Integer value); + @Description( + "When using the STORAGE_API_AT_LEAST_ONCE write method with multiplexing (ie. useStorageApiConnectionPool=true), " + + "this option sets the minimum number of connections each pool creates. This is on a per worker, per region basis. " + + "Note that in practice, the minimum number of connections created is the minimum of this value and " + + "(numStorageWriteApiStreamAppendClients x num destinations). BigQuery will create this many connections at first " Review Comment: This is the minimum number of connections per region. Note that BigQuery creates the connections on demand. Up to the minimum, there is no connection sharing. Once the minimum has been reached, existing connections will be shared until they are saturated. Once all existing connections are saturated then new connections, up to the maximum, will be created. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
