[jira] [Updated] (BEAM-11277) WriteToBigQuery with batch file loads does not respect schema update options when there are multiple load jobs

Chun Yang (Jira) Tue, 17 Nov 2020 00:38:08 -0800


     [ 
https://issues.apache.org/jira/browse/BEAM-11277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chun Yang updated BEAM-11277:
-----------------------------
    Description: 
When multiple load jobs are needed to write data to a destination table, e.g., 
when the data is spread over more than 
[10,000|https://cloud.google.com/bigquery/quotas#load_jobs] URIs, 
WriteToBigQuery in FILE_LOADS mode will write data into temporary tables and 
then copy the temporary tables into the destination table.

When WriteToBigQuery is used with 
{{write_disposition=BigQueryDisposition.WRITE_APPEND}} and 
{{additional_bq_parameters=\{"schemaUpdateOptions": 
["ALLOW_FIELD_ADDITION"]\}}}, the schema update options are not respected by 
the jobs that copy data from temporary tables into the destination table. The 
effect is that for small jobs (<10K source URIs), schema field addition is 
allowed, however, if the job is scaled to >10K source URIs, then schema field 
addition will fail with an error such as:

{code:none}Provided Schema does not match Table project:dataset.table. Cannot 
add fields (field: field_name){code}

I've been able to reproduce this issue with Python 3.7 and DataflowRunner on 
Beam 2.21.0 and Beam 2.25.0. The issue does not manifest when using 
DirectRunner.

  was:
When multiple load jobs are needed to write data to a destination table, e.g., 
when the data is spread over more than 
[10,000|https://cloud.google.com/bigquery/quotas#load_jobs] URIs, 
WriteToBigQuery in FILE_LOADS mode will write data into temporary tables and 
then copy the temporary tables into the destination table.

When WriteToBigQuery is used with 
{{write_disposition=BigQueryDisposition.WRITE_APPEND}} and 
{{additional_bq_parameters=\{"schemaUpdateOptions": 
["ALLOW_FIELD_ADDITION"]\}}}, the schema update options are not respected by 
the jobs that copy data from temporary tables into the destination table. The 
effect is that for small jobs (<10K source URIs), schema field addition is 
allowed, however, if the job is scaled to >10K source URIs, then schema field 
addition will fail with an error such as:

{code:none}Provided Schema does not match Table project:dataset.table. Cannot 
add fields (field: field_name){code}

I've been able to reproduce this issue with Python 2.7 and DataflowRunner on 
Beam 2.21.0 and Beam 2.25.0. The issue does not manifest when using 
DirectRunner.


> WriteToBigQuery with batch file loads does not respect schema update options 
> when there are multiple load jobs
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: BEAM-11277
>                 URL: https://issues.apache.org/jira/browse/BEAM-11277
>             Project: Beam
>          Issue Type: Bug
>          Components: io-py-gcp, runner-dataflow
>    Affects Versions: 2.21.0, 2.25.0
>            Reporter: Chun Yang
>            Priority: P2
>
> When multiple load jobs are needed to write data to a destination table, 
> e.g., when the data is spread over more than 
> [10,000|https://cloud.google.com/bigquery/quotas#load_jobs] URIs, 
> WriteToBigQuery in FILE_LOADS mode will write data into temporary tables and 
> then copy the temporary tables into the destination table.
> When WriteToBigQuery is used with 
> {{write_disposition=BigQueryDisposition.WRITE_APPEND}} and 
> {{additional_bq_parameters=\{"schemaUpdateOptions": 
> ["ALLOW_FIELD_ADDITION"]\}}}, the schema update options are not respected by 
> the jobs that copy data from temporary tables into the destination table. The 
> effect is that for small jobs (<10K source URIs), schema field addition is 
> allowed, however, if the job is scaled to >10K source URIs, then schema field 
> addition will fail with an error such as:
> {code:none}Provided Schema does not match Table project:dataset.table. Cannot 
> add fields (field: field_name){code}
> I've been able to reproduce this issue with Python 3.7 and DataflowRunner on 
> Beam 2.21.0 and Beam 2.25.0. The issue does not manifest when using 
> DirectRunner.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-11277) WriteToBigQuery with batch file loads does not respect schema update options when there are multiple load jobs

Reply via email to