damccorm commented on code in PR #31721: URL: https://github.com/apache/beam/pull/31721#discussion_r1663730726
########## sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryOptions.java: ########## @@ -109,6 +109,26 @@ public interface BigQueryOptions void setNumStorageWriteApiStreamAppendClients(Integer value); + @Description( + "When using the STORAGE_API_AT_LEAST_ONCE write method with multiplexing (ie. useStorageApiConnectionPool=true), " + + "this option sets the minimum number of connections each pool creates. This is on a per worker, per region basis. " + + "Note that in practice, the minimum number of connections created is the minimum of this value and " + + "(numStorageWriteApiStreamAppendClients x num destinations).") + @Default.Integer(2) + Integer getMinConnectionPoolConnections(); + + void setMinConnectionPoolConnections(Integer value); + + @Description( + "When using the STORAGE_API_AT_LEAST_ONCE write method with multiplexing (ie. useStorageApiConnectionPool=true), " + + "this option sets the maximum number of connections each pool creates. This is on a per worker, per region basis. " + + "This value should be greater than or equal to the total number of dynamic destinations, otherwise a " + + "race condition occurs where append operations compete over streams.") + @Default.Integer(20) Review Comment: A few questions here: 1) Where does 20 come from? I'm noting that from https://cloud.google.com/bigquery/docs/write-api-best-practices#connection_pool_management 20 connections seems more like a floor than a ceiling? This has the potential to degrade existing jobs which write to >20 destinations, so we should be careful here. 2) "otherwise a race condition occurs where append operations compete over streams." - could we be more explicit about the implications of the race condition? Its degraded performance, not correctness, right? 3) Can we detect when this race condition might occur and warn? I know we can't ahead of time, but do we get any warnings or anything like that back? Alternately, can we track the number of dynamic destinations we're writing to? (not sure if there's a clean path to doing this) ########## sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryOptions.java: ########## @@ -109,6 +109,26 @@ public interface BigQueryOptions void setNumStorageWriteApiStreamAppendClients(Integer value); + @Description( + "When using the STORAGE_API_AT_LEAST_ONCE write method with multiplexing (ie. useStorageApiConnectionPool=true), " + + "this option sets the minimum number of connections each pool creates. This is on a per worker, per region basis. " + + "Note that in practice, the minimum number of connections created is the minimum of this value and " + + "(numStorageWriteApiStreamAppendClients x num destinations).") + @Default.Integer(2) + Integer getMinConnectionPoolConnections(); + + void setMinConnectionPoolConnections(Integer value); Review Comment: Is there value in exposing this to a customer? When would they want to tune this? ########## sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryOptions.java: ########## @@ -122,6 +142,9 @@ public interface BigQueryOptions void setStorageWriteMaxInflightBytes(Long value); + @Description( + "Enables multiplexing mode, where multiple tables can share the same connection. Only available when writing with STORAGE_API_AT_LEAST_ONCE" + + " mode. For more information, see https://cloud.google.com/bigquery/docs/write-api-best-practices#connection_pool_management") Review Comment: Could we add that 1) this is recommended when writing to >20 tables - I think this is true based on the linked doc, if not maybe we can add other guidance on when to do this 2) Can we add something like "when using multiplexing mode, consider limiting `MaxConnectionPoolConnections` as well"? ########## sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryOptions.java: ########## @@ -109,6 +109,26 @@ public interface BigQueryOptions void setNumStorageWriteApiStreamAppendClients(Integer value); + @Description( + "When using the STORAGE_API_AT_LEAST_ONCE write method with multiplexing (ie. useStorageApiConnectionPool=true), " + + "this option sets the minimum number of connections each pool creates. This is on a per worker, per region basis. " + + "Note that in practice, the minimum number of connections created is the minimum of this value and " + + "(numStorageWriteApiStreamAppendClients x num destinations).") + @Default.Integer(2) + Integer getMinConnectionPoolConnections(); + + void setMinConnectionPoolConnections(Integer value); + + @Description( + "When using the STORAGE_API_AT_LEAST_ONCE write method with multiplexing (ie. useStorageApiConnectionPool=true), " + + "this option sets the maximum number of connections each pool creates. This is on a per worker, per region basis. " + + "This value should be greater than or equal to the total number of dynamic destinations, otherwise a " + + "race condition occurs where append operations compete over streams.") + @Default.Integer(20) Review Comment: > Where does 20 come from? More specifically - can we make this significantly higher without running into quota issues? How many connections are jobs that are running into problems using? IIRC it is orders of magnitude higher than this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
