[ https://issues.apache.org/jira/browse/BEAM-10826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17548905#comment-17548905 ]
Danny McCormick commented on BEAM-10826: ---------------------------------------- This issue has been migrated to https://github.com/apache/beam/issues/20515 > Expose BigQuery schema autodetect in Java SDK > --------------------------------------------- > > Key: BEAM-10826 > URL: https://issues.apache.org/jira/browse/BEAM-10826 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp > Reporter: Marcus Truscello > Priority: P3 > Labels: Clarified, bigquery, schema > > The Beam Java SDK's BigQueryIO transform currently doesn't expose the [schema > autodetect job > configuration|https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/java/latest/com/google/api/services/bigquery/model/JobConfigurationLoad.html#setAutodetect-java.lang.Boolean-]. > The feature is exposed by the current [Python > SDK|https://github.com/apache/beam/blob/a99c6826a067f49ebb60e625c8652900c7d0e810/sdks/python/apache_beam/io/gcp/bigquery.py#L1593], > but not the Java SDK. > Although Java is more strict about types and schemas, the BigQueryIO > transform supports writing TableRows which don't inherently have a schema. > This provides a convenient path for loading JSON data into BigQuery but is > massively thwarted by the fact that a schema is required to make use of the > SchemaUpdateOption values ALLOW_FIELD_ADDITION and ALLOW_FIELD_RELAXATION. > The BigQuery schema autodetection feature must be enabled at the > JobConfigurationLoad level. The BigQueryIO creates the JobConfigurationLoad > in only one place: > [WriteTables.java|https://github.com/apache/beam/blob/752bdfd09bc4175dd9f51a096f81c9e5b0805913/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L370]. > Exposing the autodetection option would mean adding it here, then > propagating the change upwards until it's exposed at the BigQueryIO.Write > level. > A big of context on this issue: > * Google cloud's blog [has an > article|https://cloud.google.com/blog/products/gcp/how-to-handle-mutating-json-schemas-in-a-streaming-pipeline-with-square-enix] > on handling mutating JSON schemas in Dataflow using a black-box "Validate > and Mutate BQ Schema" step. > * Suggested workarounds include creating a stateful DoFn to dynamically > generate a schema, load it as a side input to create a PCollectionView, then > passing it to BigQuerIO using withSchemaFromView: > https://stackoverflow.com/a/58809875/477563 > * [Entire projects|https://github.com/the-dagger/dataflow-dynamic-schema] > have been created to try and work around this issue. > All of the above would be rendered moot (and many headaches spared!) if only > the schema autodetection were exposed in the Java SDK _like it already is_ in > the Python SDK. -- This message was sent by Atlassian Jira (v8.20.7#820007)