[ 
https://issues.apache.org/jira/browse/BEAM-10826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17548905#comment-17548905
 ] 

Danny McCormick commented on BEAM-10826:
----------------------------------------

This issue has been migrated to https://github.com/apache/beam/issues/20515

> Expose BigQuery schema autodetect in Java SDK
> ---------------------------------------------
>
>                 Key: BEAM-10826
>                 URL: https://issues.apache.org/jira/browse/BEAM-10826
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-gcp
>            Reporter: Marcus Truscello
>            Priority: P3
>              Labels: Clarified, bigquery, schema
>
> The Beam Java SDK's BigQueryIO transform currently doesn't expose the [schema 
> autodetect job 
> configuration|https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/java/latest/com/google/api/services/bigquery/model/JobConfigurationLoad.html#setAutodetect-java.lang.Boolean-].
>   The feature is exposed by the current [Python 
> SDK|https://github.com/apache/beam/blob/a99c6826a067f49ebb60e625c8652900c7d0e810/sdks/python/apache_beam/io/gcp/bigquery.py#L1593],
>  but not the Java SDK.
> Although Java is more strict about types and schemas, the BigQueryIO 
> transform supports writing TableRows which don't inherently have a schema. 
> This provides a convenient path for loading JSON data into BigQuery but is 
> massively thwarted by the fact that a schema is required to make use of the 
> SchemaUpdateOption values ALLOW_FIELD_ADDITION and ALLOW_FIELD_RELAXATION.
> The BigQuery schema autodetection feature must be enabled at the 
> JobConfigurationLoad level. The BigQueryIO creates the JobConfigurationLoad 
> in only one place: 
> [WriteTables.java|https://github.com/apache/beam/blob/752bdfd09bc4175dd9f51a096f81c9e5b0805913/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L370].
>  Exposing the autodetection option would mean adding it here, then 
> propagating the change upwards until it's exposed at the BigQueryIO.Write 
> level.
> A big of context on this issue:
>  * Google cloud's blog [has an 
> article|https://cloud.google.com/blog/products/gcp/how-to-handle-mutating-json-schemas-in-a-streaming-pipeline-with-square-enix]
>  on handling mutating JSON schemas in Dataflow using a black-box "Validate 
> and Mutate BQ Schema" step.
>  * Suggested workarounds include creating a stateful DoFn to dynamically 
> generate a schema, load it as a side input to create a PCollectionView, then 
> passing it to BigQuerIO using withSchemaFromView: 
> https://stackoverflow.com/a/58809875/477563
>  * [Entire projects|https://github.com/the-dagger/dataflow-dynamic-schema] 
> have been created to try and work around this issue.
> All of the above would be rendered moot (and many headaches spared!) if only 
> the schema autodetection were exposed in the Java SDK _like it already is_ in 
> the Python SDK.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to