[jira] [Commented] (BEAM-11330) BigQueryServicesImpl.insertAll evaluates maxRowBatchSize after a row is added to the batch

Pablo Estrada (Jira) Tue, 23 Mar 2021 11:45:03 -0700


    [ 
https://issues.apache.org/jira/browse/BEAM-11330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307350#comment-17307350
 ]


Pablo Estrada commented on BEAM-11330:
--------------------------------------

Thanks [~liamhaworth01] for looking and finding this! I think that's a 
reasonable observation - I've only noticed this issue once before, so I don't 
think it's very common, but it is possible. Do you have time to apply a fix for 
it? : )

If not, I can take a look and fix it later.

> BigQueryServicesImpl.insertAll evaluates maxRowBatchSize after a row is added 
> to the batch
> ------------------------------------------------------------------------------------------
>
>                 Key: BEAM-11330
>                 URL: https://issues.apache.org/jira/browse/BEAM-11330
>             Project: Beam
>          Issue Type: Bug
>          Components: io-java-gcp
>    Affects Versions: 2.22.0, 2.23.0, 2.24.0, 2.25.0
>            Reporter: Liam Haworth
>            Assignee: Pablo Estrada
>            Priority: P3
>
> When using the {{BigQueryIO.Write}} transformation, a set of pipeline options 
> defined in {{BigQueryOptions}} become available to the pipeline. 
> Two of these options being: 
>   * {{maxStreamingRowsToBatch}} - "The maximum number of rows to batch in a 
> single streaming insert to BigQuery." 
>   * {{maxStreamingBatchSize}} - "The maximum byte size of a single streaming 
> insert to BigQuery" 
> Reading the description of the {{maxStreamingBatchSize}}, I am given the 
> impression that the BigQuery sink will ensure that each batch is either on, 
> or under, the max byte size configured. 
> But after [reviewing the code of the internal sink 
> transformation|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryServicesImpl.java#L826],
>  I can see that the batching code will first add a row to the batch and then 
> compares the new batch size against the maximum configured. 
> The description of the option, {{maxStreamingBatchSize}}, gives the end user 
> an impression that this will protect them from batches that will exceed the 
> size limit of the BigQuery streaming inserts API. 
> When in reality it can lead to a situation where a batch is produced that 
> massively exceeds the limit and the transformation will get stuck into a loop 
> of constantly retrying the request.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (BEAM-11330) BigQueryServicesImpl.insertAll evaluates maxRowBatchSize after a row is added to the batch

Reply via email to