[
https://issues.apache.org/jira/browse/BEAM-11330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330931#comment-17330931
]
Beam JIRA Bot commented on BEAM-11330:
--------------------------------------
This issue is assigned but has not received an update in 30 days so it has been
labeled "stale-assigned". If you are still working on the issue, please give an
update and remove the label. If you are no longer working on the issue, please
unassign so someone else may work on it. In 7 days the issue will be
automatically unassigned.
> BigQueryServicesImpl.insertAll evaluates maxRowBatchSize after a row is added
> to the batch
> ------------------------------------------------------------------------------------------
>
> Key: BEAM-11330
> URL: https://issues.apache.org/jira/browse/BEAM-11330
> Project: Beam
> Issue Type: Bug
> Components: io-java-gcp
> Affects Versions: 2.22.0, 2.23.0, 2.24.0, 2.25.0
> Reporter: Liam Haworth
> Assignee: Pablo Estrada
> Priority: P3
> Labels: stale-assigned
>
> When using the {{BigQueryIO.Write}} transformation, a set of pipeline options
> defined in {{BigQueryOptions}} become available to the pipeline.
> Two of these options being:
> * {{maxStreamingRowsToBatch}} - "The maximum number of rows to batch in a
> single streaming insert to BigQuery."
> * {{maxStreamingBatchSize}} - "The maximum byte size of a single streaming
> insert to BigQuery"
> Reading the description of the {{maxStreamingBatchSize}}, I am given the
> impression that the BigQuery sink will ensure that each batch is either on,
> or under, the max byte size configured.
> But after [reviewing the code of the internal sink
> transformation|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryServicesImpl.java#L826],
> I can see that the batching code will first add a row to the batch and then
> compares the new batch size against the maximum configured.
> The description of the option, {{maxStreamingBatchSize}}, gives the end user
> an impression that this will protect them from batches that will exceed the
> size limit of the BigQuery streaming inserts API.
> When in reality it can lead to a situation where a batch is produced that
> massively exceeds the limit and the transformation will get stuck into a loop
> of constantly retrying the request.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)