[ https://issues.apache.org/jira/browse/BEAM-11277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chun Yang updated BEAM-11277: ----------------------------- Description: When multiple load jobs are needed to write data to a destination table, e.g., when the data is spread over more than [10,000|https://cloud.google.com/bigquery/quotas#load_jobs] URIs, WriteToBigQuery in FILE_LOADS mode will write data into temporary tables and then copy the temporary tables into the destination table. When WriteToBigQuery is used with {{write_disposition=BigQueryDisposition.WRITE_APPEND}} and {{additional_bq_parameters=\{"schemaUpdateOptions": ["ALLOW_FIELD_ADDITION"]\}}}, the schema update options are not respected by the jobs that copy data from temporary tables into the destination table. The effect is that for small jobs (<10K source URIs), schema field addition is allowed, however, if the job is scaled to >10K source URIs, then schema field addition will fail with an error such as: {code:none}Provided Schema does not match Table project:dataset.table. Cannot add fields (field: field_name){code} I've been able to reproduce this issue with Python 3.7 and DataflowRunner on Beam 2.21.0 and Beam 2.25.0. The issue does not manifest when using DirectRunner. was: When multiple load jobs are needed to write data to a destination table, e.g., when the data is spread over more than [10,000|https://cloud.google.com/bigquery/quotas#load_jobs] URIs, WriteToBigQuery in FILE_LOADS mode will write data into temporary tables and then copy the temporary tables into the destination table. When WriteToBigQuery is used with {{write_disposition=BigQueryDisposition.WRITE_APPEND}} and {{additional_bq_parameters=\{"schemaUpdateOptions": ["ALLOW_FIELD_ADDITION"]\}}}, the schema update options are not respected by the jobs that copy data from temporary tables into the destination table. The effect is that for small jobs (<10K source URIs), schema field addition is allowed, however, if the job is scaled to >10K source URIs, then schema field addition will fail with an error such as: {code:none}Provided Schema does not match Table project:dataset.table. Cannot add fields (field: field_name){code} I've been able to reproduce this issue with Python 2.7 and DataflowRunner on Beam 2.21.0 and Beam 2.25.0. The issue does not manifest when using DirectRunner. > WriteToBigQuery with batch file loads does not respect schema update options > when there are multiple load jobs > -------------------------------------------------------------------------------------------------------------- > > Key: BEAM-11277 > URL: https://issues.apache.org/jira/browse/BEAM-11277 > Project: Beam > Issue Type: Bug > Components: io-py-gcp, runner-dataflow > Affects Versions: 2.21.0, 2.25.0 > Reporter: Chun Yang > Priority: P2 > > When multiple load jobs are needed to write data to a destination table, > e.g., when the data is spread over more than > [10,000|https://cloud.google.com/bigquery/quotas#load_jobs] URIs, > WriteToBigQuery in FILE_LOADS mode will write data into temporary tables and > then copy the temporary tables into the destination table. > When WriteToBigQuery is used with > {{write_disposition=BigQueryDisposition.WRITE_APPEND}} and > {{additional_bq_parameters=\{"schemaUpdateOptions": > ["ALLOW_FIELD_ADDITION"]\}}}, the schema update options are not respected by > the jobs that copy data from temporary tables into the destination table. The > effect is that for small jobs (<10K source URIs), schema field addition is > allowed, however, if the job is scaled to >10K source URIs, then schema field > addition will fail with an error such as: > {code:none}Provided Schema does not match Table project:dataset.table. Cannot > add fields (field: field_name){code} > I've been able to reproduce this issue with Python 3.7 and DataflowRunner on > Beam 2.21.0 and Beam 2.25.0. The issue does not manifest when using > DirectRunner. -- This message was sent by Atlassian Jira (v8.3.4#803005)