[jira] [Created] (BEAM-11330) BigQueryServicesImpl.insertAll evaluates maxRowBatchSize after a row is added to the batch

Liam Haworth (Jira) Mon, 23 Nov 2020 17:50:05 -0800

Liam Haworth created BEAM-11330:
-----------------------------------

             Summary: BigQueryServicesImpl.insertAll evaluates maxRowBatchSize 
after a row is added to the batch
                 Key: BEAM-11330
                 URL: https://issues.apache.org/jira/browse/BEAM-11330
             Project: Beam
          Issue Type: Bug
          Components: io-java-gcp
    Affects Versions: 2.25.0, 2.24.0, 2.23.0, 2.22.0
            Reporter: Liam Haworth



When using the {{BigQueryIO.Write}} transformation, a set of pipeline options 
defined in {{BigQueryOptions}} become available to the pipeline. 

Two of these options being: 
  * {{maxStreamingRowsToBatch}} - "The maximum number of rows to batch in a 
single streaming insert to BigQuery." 
  * {{maxStreamingBatchSize}} - "The maximum byte size of a single streaming 
insert to BigQuery" 

Reading the description of the {{maxStreamingBatchSize}}, I am given the 
impression that the BigQuery sink will ensure that each batch is either on, or 
under, the max byte size configured. 

But after [reviewing the code of the internal sink 
transformation|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryServicesImpl.java#L826],
 I can see that the batching code will first add a row to the batch and then 
compares the new batch size against the maximum configured. 

The description of the option, {{maxStreamingBatchSize}}, gives the end user an 
impression that this will protect them from batches that will exceed the size 
limit of the BigQuery streaming inserts API. 

When in reality it can lead to a situation where a batch is produced that 
massively exceeds the limit and the transformation will get stuck into a loop 
of constantly retrying the request.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (BEAM-11330) BigQueryServicesImpl.insertAll evaluates maxRowBatchSize after a row is added to the batch

Reply via email to