[GitHub] [beam] damccorm opened a new issue, #20515: Expose BigQuery schema autodetect in Java SDK


damccorm opened a new issue, #20515:
URL: https://github.com/apache/beam/issues/20515

The Beam Java SDK's BigQueryIO transform currently doesn't expose the
[schema autodetect job
configuration](https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/java/latest/com/google/api/services/bigquery/model/JobConfigurationLoad.html#setAutodetect-java.lang.Boolean-).
The feature is exposed by the current [Python
SDK](https://github.com/apache/beam/blob/a99c6826a067f49ebb60e625c8652900c7d0e810/sdks/python/apache_beam/io/gcp/bigquery.py#L1593),
but not the Java SDK.

Although Java is more strict about types and schemas, the BigQueryIO
transform supports writing TableRows which don't inherently have a schema. This
provides a convenient path for loading JSON data into BigQuery but is massively
thwarted by the fact that a schema is required to make use of the
SchemaUpdateOption values ALLOW_FIELD_ADDITION and ALLOW_FIELD_RELAXATION.

The BigQuery schema autodetection feature must be enabled at the
JobConfigurationLoad level. The BigQueryIO creates the JobConfigurationLoad in
only one place:
[WriteTables.java](https://github.com/apache/beam/blob/752bdfd09bc4175dd9f51a096f81c9e5b0805913/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L370).
Exposing the autodetection option would mean adding it here, then propagating
the change upwards until it's exposed at the BigQueryIO.Write level.

A big of context on this issue:
* Google cloud's blog [has an
article](https://cloud.google.com/blog/products/gcp/how-to-handle-mutating-json-schemas-in-a-streaming-pipeline-with-square-enix)
on handling mutating JSON schemas in Dataflow using a black-box "Validate and
Mutate BQ Schema" step.
* Suggested workarounds include creating a stateful DoFn to dynamically
generate a schema, load it as a side input to create a PCollectionView, then
passing it to BigQuerIO using withSchemaFromView:
https://stackoverflow.com/a/58809875/477563
* [Entire projects](https://github.com/the-dagger/dataflow-dynamic-schema)
have been created to try and work around this issue.

All of the above would be rendered moot (and many headaches spared!) if only
the schema autodetection were exposed in the Java SDK _like it already is_ in
the Python SDK.

Imported from Jira
[BEAM-10826](https://issues.apache.org/jira/browse/BEAM-10826). Original Jira
may contain additional context.
Reported by: mtruscello.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to