Marcus Truscello created BEAM-10826:
---------------------------------------

             Summary: Expose BigQuery schema autodetect in Java SDK
                 Key: BEAM-10826
                 URL: https://issues.apache.org/jira/browse/BEAM-10826
             Project: Beam
          Issue Type: Improvement
          Components: sdk-java-core
            Reporter: Marcus Truscello


The Beam Java SDK's BigQueryIO transform currently doesn't expose the [schema 
autodetect job 
configuration|https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/java/latest/com/google/api/services/bigquery/model/JobConfigurationLoad.html#setAutodetect-java.lang.Boolean-].
  The feature is exposed by the current [Python 
SDK|https://github.com/apache/beam/blob/a99c6826a067f49ebb60e625c8652900c7d0e810/sdks/python/apache_beam/io/gcp/bigquery.py#L1593],
 but not the Java SDK.

Although Java is more strict about types and schemas, the BigQueryIO transform 
supports writing TableRows which don't inherently have a schema. This provides 
a convenient path for loading JSON data into BigQuery but is massively thwarted 
by the fact that a schema is required to make use of the SchemaUpdateOption 
values ALLOW_FIELD_ADDITION and ALLOW_FIELD_RELAXATION.

The BigQuery schema autodetection feature must be enabled at the 
JobConfigurationLoad level. The BigQueryIO creates the JobConfigurationLoad in 
only one place: 
[WriteTables.java|https://github.com/apache/beam/blob/752bdfd09bc4175dd9f51a096f81c9e5b0805913/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L370].
 Exposing the autodetection option would mean adding it here, then propagating 
the change upwards until it's exposed at the BigQueryIO.Write level.

A big of context on this issue:
 * Google cloud's blog [has an 
article|https://cloud.google.com/blog/products/gcp/how-to-handle-mutating-json-schemas-in-a-streaming-pipeline-with-square-enix]
 on handling mutating JSON schemas in Dataflow using a black-box "Validate and 
Mutate BQ Schema" step.
 * Suggested workarounds include creating a stateful DoFn to dynamically 
generate a schema, load it as a side input to create a PCollectionView, then 
passing it to BigQuerIO using withSchemaFromView: 
https://stackoverflow.com/a/58809875/477563
 * [Entire projects|https://github.com/the-dagger/dataflow-dynamic-schema] have 
been created to try and work around this issue.

All of the above would be rendered moot (and many headaches spared!) if only 
the schema autodetection were exposed in the Java SDK _like it already is_ in 
the Python SDK.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to