[
https://issues.apache.org/jira/browse/BEAM-10826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Beam JIRA Bot updated BEAM-10826:
---------------------------------
Labels: Clarified bigquery schema stale-P2 (was: Clarified bigquery schema)
> Expose BigQuery schema autodetect in Java SDK
> ---------------------------------------------
>
> Key: BEAM-10826
> URL: https://issues.apache.org/jira/browse/BEAM-10826
> Project: Beam
> Issue Type: Improvement
> Components: io-java-gcp
> Reporter: Marcus Truscello
> Priority: P2
> Labels: Clarified, bigquery, schema, stale-P2
>
> The Beam Java SDK's BigQueryIO transform currently doesn't expose the [schema
> autodetect job
> configuration|https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/java/latest/com/google/api/services/bigquery/model/JobConfigurationLoad.html#setAutodetect-java.lang.Boolean-].
> The feature is exposed by the current [Python
> SDK|https://github.com/apache/beam/blob/a99c6826a067f49ebb60e625c8652900c7d0e810/sdks/python/apache_beam/io/gcp/bigquery.py#L1593],
> but not the Java SDK.
> Although Java is more strict about types and schemas, the BigQueryIO
> transform supports writing TableRows which don't inherently have a schema.
> This provides a convenient path for loading JSON data into BigQuery but is
> massively thwarted by the fact that a schema is required to make use of the
> SchemaUpdateOption values ALLOW_FIELD_ADDITION and ALLOW_FIELD_RELAXATION.
> The BigQuery schema autodetection feature must be enabled at the
> JobConfigurationLoad level. The BigQueryIO creates the JobConfigurationLoad
> in only one place:
> [WriteTables.java|https://github.com/apache/beam/blob/752bdfd09bc4175dd9f51a096f81c9e5b0805913/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L370].
> Exposing the autodetection option would mean adding it here, then
> propagating the change upwards until it's exposed at the BigQueryIO.Write
> level.
> A big of context on this issue:
> * Google cloud's blog [has an
> article|https://cloud.google.com/blog/products/gcp/how-to-handle-mutating-json-schemas-in-a-streaming-pipeline-with-square-enix]
> on handling mutating JSON schemas in Dataflow using a black-box "Validate
> and Mutate BQ Schema" step.
> * Suggested workarounds include creating a stateful DoFn to dynamically
> generate a schema, load it as a side input to create a PCollectionView, then
> passing it to BigQuerIO using withSchemaFromView:
> https://stackoverflow.com/a/58809875/477563
> * [Entire projects|https://github.com/the-dagger/dataflow-dynamic-schema]
> have been created to try and work around this issue.
> All of the above would be rendered moot (and many headaches spared!) if only
> the schema autodetection were exposed in the Java SDK _like it already is_ in
> the Python SDK.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)