Andrea Pierleoni created BEAM-2595:
--------------------------------------
Summary: WriteToBigQuery does not work with nested json schema
Key: BEAM-2595
URL: https://issues.apache.org/jira/browse/BEAM-2595
Project: Beam
Issue Type: Bug
Components: runner-dataflow
Affects Versions: 2.1.0
Environment: mac os local runner, Python
Reporter: Andrea Pierleoni
Assignee: Thomas Groh
Priority: Minor
I am trying to use the new `WriteToBigQuery` PTransform added to
`apache_beam.io.gcp.bigquery` in version 2.1.0-RC1
I need to write to a bigquery table with nested fields.
The only way to specify nested schemas in bigquery is with teh json schema.
None of the classes in `apache_beam.io.gcp.bigquery` are able to parse the json
schema, but they accept a schema as an instance of the class
`apache_beam.io.gcp.internal.clients.bigquery.TableFieldSchema`
I am composing the `TableFieldSchema` as suggested
[here](https://stackoverflow.com/questions/36127537/json-table-schema-to-bigquery-tableschema-for-bigquerysink/45039436#45039436),
and it looks fine when passed to the PTransform `WriteToBigQuery`.
The problem is that the base class `PTransformWithSideInputs` try to [pickle
and unpickle the
function](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/ptransform.py#L515)
(that includes the TableFieldSchema instance) and for some reason when the
class is unpickled some `FieldList` instance are converted to simple lists, and
the pickling validation fails.
Would it be possible to extend the test coverage to nested json objects for
bigquery?
They are also relatively easy to parse into a TableFieldSchema.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)