[jira] [Commented] (BEAM-2595) WriteToBigQuery does not work with nested json schema
[ https://issues.apache.org/jira/browse/BEAM-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088222#comment-16088222 ] ASF GitHub Bot commented on BEAM-2595: -- GitHub user sb2nov opened a pull request: https://github.com/apache/beam/pull/3563 [BEAM-2595] Allow table schema objects in BQ DoFn Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- Cherry pick from master for BEAM-2535 R: @aaltay cc @jbonofre You can merge this pull request into a Git repository by running: $ git pull https://github.com/sb2nov/beam BEAM-2595-cp Alternatively you can review and apply these changes as the patch at: https://github.com/apache/beam/pull/3563.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3563 commit ada4733b02bc38b1ef619fb991c068822a917595 Author: Sourabh BajajDate: 2017-07-13T19:02:31Z [BEAM-2595] Allow table schema objects in BQ DoFn > WriteToBigQuery does not work with nested json schema > - > > Key: BEAM-2595 > URL: https://issues.apache.org/jira/browse/BEAM-2595 > Project: Beam > Issue Type: Bug > Components: sdk-py >Affects Versions: 2.1.0 > Environment: mac os local runner, Python >Reporter: Andrea Pierleoni >Assignee: Sourabh Bajaj >Priority: Minor > Labels: gcp > Fix For: 2.1.0 > > > I am trying to use the new `WriteToBigQuery` PTransform added to > `apache_beam.io.gcp.bigquery` in version 2.1.0-RC1 > I need to write to a bigquery table with nested fields. > The only way to specify nested schemas in bigquery is with teh json schema. > None of the classes in `apache_beam.io.gcp.bigquery` are able to parse the > json schema, but they accept a schema as an instance of the class > `apache_beam.io.gcp.internal.clients.bigquery.TableFieldSchema` > I am composing the `TableFieldSchema` as suggested here > [https://stackoverflow.com/questions/36127537/json-table-schema-to-bigquery-tableschema-for-bigquerysink/45039436#45039436], > and it looks fine when passed to the PTransform `WriteToBigQuery`. > The problem is that the base class `PTransformWithSideInputs` try to pickle > and unpickle the function > [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/ptransform.py#L515] > (that includes the TableFieldSchema instance) and for some reason when the > class is unpickled some `FieldList` instance are converted to simple lists, > and the pickling validation fails. > Would it be possible to extend the test coverage to nested json objects for > bigquery? > They are also relatively easy to parse into a TableFieldSchema. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-2595) WriteToBigQuery does not work with nested json schema
[ https://issues.apache.org/jira/browse/BEAM-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16087712#comment-16087712 ] Sourabh Bajaj commented on BEAM-2595: - [~andrea.pierleoni] can you verify that your pipeline works with the latest master ? > WriteToBigQuery does not work with nested json schema > - > > Key: BEAM-2595 > URL: https://issues.apache.org/jira/browse/BEAM-2595 > Project: Beam > Issue Type: Bug > Components: sdk-py >Affects Versions: 2.1.0 > Environment: mac os local runner, Python >Reporter: Andrea Pierleoni >Assignee: Sourabh Bajaj >Priority: Minor > Labels: gcp > Fix For: 2.1.0 > > > I am trying to use the new `WriteToBigQuery` PTransform added to > `apache_beam.io.gcp.bigquery` in version 2.1.0-RC1 > I need to write to a bigquery table with nested fields. > The only way to specify nested schemas in bigquery is with teh json schema. > None of the classes in `apache_beam.io.gcp.bigquery` are able to parse the > json schema, but they accept a schema as an instance of the class > `apache_beam.io.gcp.internal.clients.bigquery.TableFieldSchema` > I am composing the `TableFieldSchema` as suggested here > [https://stackoverflow.com/questions/36127537/json-table-schema-to-bigquery-tableschema-for-bigquerysink/45039436#45039436], > and it looks fine when passed to the PTransform `WriteToBigQuery`. > The problem is that the base class `PTransformWithSideInputs` try to pickle > and unpickle the function > [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/ptransform.py#L515] > (that includes the TableFieldSchema instance) and for some reason when the > class is unpickled some `FieldList` instance are converted to simple lists, > and the pickling validation fails. > Would it be possible to extend the test coverage to nested json objects for > bigquery? > They are also relatively easy to parse into a TableFieldSchema. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-2595) WriteToBigQuery does not work with nested json schema
[ https://issues.apache.org/jira/browse/BEAM-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086384#comment-16086384 ] ASF GitHub Bot commented on BEAM-2595: -- Github user asfgit closed the pull request at: https://github.com/apache/beam/pull/3556 > WriteToBigQuery does not work with nested json schema > - > > Key: BEAM-2595 > URL: https://issues.apache.org/jira/browse/BEAM-2595 > Project: Beam > Issue Type: Bug > Components: sdk-py >Affects Versions: 2.1.0 > Environment: mac os local runner, Python >Reporter: Andrea Pierleoni >Assignee: Sourabh Bajaj >Priority: Minor > Labels: gcp > Fix For: 2.1.0 > > > I am trying to use the new `WriteToBigQuery` PTransform added to > `apache_beam.io.gcp.bigquery` in version 2.1.0-RC1 > I need to write to a bigquery table with nested fields. > The only way to specify nested schemas in bigquery is with teh json schema. > None of the classes in `apache_beam.io.gcp.bigquery` are able to parse the > json schema, but they accept a schema as an instance of the class > `apache_beam.io.gcp.internal.clients.bigquery.TableFieldSchema` > I am composing the `TableFieldSchema` as suggested here > [https://stackoverflow.com/questions/36127537/json-table-schema-to-bigquery-tableschema-for-bigquerysink/45039436#45039436], > and it looks fine when passed to the PTransform `WriteToBigQuery`. > The problem is that the base class `PTransformWithSideInputs` try to pickle > and unpickle the function > [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/ptransform.py#L515] > (that includes the TableFieldSchema instance) and for some reason when the > class is unpickled some `FieldList` instance are converted to simple lists, > and the pickling validation fails. > Would it be possible to extend the test coverage to nested json objects for > bigquery? > They are also relatively easy to parse into a TableFieldSchema. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-2595) WriteToBigQuery does not work with nested json schema
[ https://issues.apache.org/jira/browse/BEAM-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085278#comment-16085278 ] ASF GitHub Bot commented on BEAM-2595: -- GitHub user sb2nov opened a pull request: https://github.com/apache/beam/pull/3556 [BEAM-2595] Allow table schema objects in BQ DoFn Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- R: @chamikaramj PTAL You can merge this pull request into a Git repository by running: $ git pull https://github.com/sb2nov/beam BEAM-2595-allow-table-schema-bq-dogn Alternatively you can review and apply these changes as the patch at: https://github.com/apache/beam/pull/3556.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3556 commit f89d59db369037c0bdb8fcb5d26fb1fd4b4599f7 Author: Sourabh BajajDate: 2017-07-13T06:50:24Z [BEAM-2595] Allow table schema objects in BQ DoFn > WriteToBigQuery does not work with nested json schema > - > > Key: BEAM-2595 > URL: https://issues.apache.org/jira/browse/BEAM-2595 > Project: Beam > Issue Type: Bug > Components: sdk-py >Affects Versions: 2.1.0 > Environment: mac os local runner, Python >Reporter: Andrea Pierleoni >Assignee: Sourabh Bajaj >Priority: Minor > Labels: gcp > Fix For: 2.1.0 > > > I am trying to use the new `WriteToBigQuery` PTransform added to > `apache_beam.io.gcp.bigquery` in version 2.1.0-RC1 > I need to write to a bigquery table with nested fields. > The only way to specify nested schemas in bigquery is with teh json schema. > None of the classes in `apache_beam.io.gcp.bigquery` are able to parse the > json schema, but they accept a schema as an instance of the class > `apache_beam.io.gcp.internal.clients.bigquery.TableFieldSchema` > I am composing the `TableFieldSchema` as suggested here > [https://stackoverflow.com/questions/36127537/json-table-schema-to-bigquery-tableschema-for-bigquerysink/45039436#45039436], > and it looks fine when passed to the PTransform `WriteToBigQuery`. > The problem is that the base class `PTransformWithSideInputs` try to pickle > and unpickle the function > [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/ptransform.py#L515] > (that includes the TableFieldSchema instance) and for some reason when the > class is unpickled some `FieldList` instance are converted to simple lists, > and the pickling validation fails. > Would it be possible to extend the test coverage to nested json objects for > bigquery? > They are also relatively easy to parse into a TableFieldSchema. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-2595) WriteToBigQuery does not work with nested json schema
[ https://issues.apache.org/jira/browse/BEAM-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083496#comment-16083496 ] Andrea Pierleoni commented on BEAM-2595: yes, sorry forgot the stack trace {code:} Traceback (most recent call last): File "/Users/andreap/work/code/library_dataflow/main.py", line 347, in run() File "/Users/andreap/work/code/library_dataflow/main.py", line 343, in run write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE) File "/Users/andreap/library_dataflow/venv/lib/python2.7/site-packages/apache_beam/pvalue.py", line 100, in __or__ return self.pipeline.apply(ptransform, self) File "/Users/andreap/library_dataflow/venv/lib/python2.7/site-packages/apache_beam/pipeline.py", line 265, in apply pvalueish_result = self.runner.apply(transform, pvalueish) File "/Users/andreap/library_dataflow/venv/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 150, in apply return m(transform, input) File "/Users/andreap/library_dataflow/venv/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 156, in apply_PTransform return transform.expand(input) File "/Users/andreap/work/code/library_dataflow/beam2_1.py", line 213, in expand return pcoll | 'WriteToBigQuery' >> ParDo(bigquery_write_fn) File "/Users/andreap/library_dataflow/venv/lib/python2.7/site-packages/apache_beam/transforms/core.py", line 620, in __init__ super(ParDo, self).__init__(fn, *args, **kwargs) File "/Users/andreap/library_dataflow/venv/lib/python2.7/site-packages/apache_beam/transforms/ptransform.py", line 515, in __init__ self.fn = pickler.loads(pickler.dumps(self.fn)) File "/Users/andreap/library_dataflow/venv/lib/python2.7/site-packages/apache_beam/internal/pickler.py", line 225, in loads return dill.loads(s) File "/Users/andreap/library_dataflow/venv/lib/python2.7/site-packages/dill/dill.py", line 277, in loads return load(file) File "/Users/andreap/library_dataflow/venv/lib/python2.7/site-packages/dill/dill.py", line 266, in load obj = pik.load() File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 864, in load dispatch[key](self) File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1195, in load_appends list.extend(stack[mark + 1:]) File "/Users/andreap/library_dataflow/venv/lib/python2.7/site-packages/apitools/base/protorpclite/messages.py", line 1147, in extend self.__field.validate(sequence) AttributeError: 'FieldList' object has no attribute '_FieldList__field' {code} for some reasons once deserialized the seuquence does not have the '_FieldList__field' attribute. I don't believe it is introduced in 2.1.0 per se, but a (very useful) class using it is coming with 2.1.0 and it may flag as a real problem in production. Definitely for us. > WriteToBigQuery does not work with nested json schema > - > > Key: BEAM-2595 > URL: https://issues.apache.org/jira/browse/BEAM-2595 > Project: Beam > Issue Type: Bug > Components: sdk-py >Affects Versions: 2.1.0 > Environment: mac os local runner, Python >Reporter: Andrea Pierleoni >Assignee: Sourabh Bajaj >Priority: Minor > Labels: gcp > Fix For: 2.1.0 > > > I am trying to use the new `WriteToBigQuery` PTransform added to > `apache_beam.io.gcp.bigquery` in version 2.1.0-RC1 > I need to write to a bigquery table with nested fields. > The only way to specify nested schemas in bigquery is with teh json schema. > None of the classes in `apache_beam.io.gcp.bigquery` are able to parse the > json schema, but they accept a schema as an instance of the class > `apache_beam.io.gcp.internal.clients.bigquery.TableFieldSchema` > I am composing the `TableFieldSchema` as suggested here > [https://stackoverflow.com/questions/36127537/json-table-schema-to-bigquery-tableschema-for-bigquerysink/45039436#45039436], > and it looks fine when passed to the PTransform `WriteToBigQuery`. > The problem is that the base class `PTransformWithSideInputs` try to pickle > and unpickle the function > [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/ptransform.py#L515] > (that includes the TableFieldSchema instance) and for some reason when the > class is unpickled some `FieldList` instance are converted to simple lists, > and the pickling validation fails. > Would it be possible to extend the test coverage to nested json objects for > bigquery? > They are also relatively easy to parse into a TableFieldSchema. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-2595) WriteToBigQuery does not work with nested json schema
[ https://issues.apache.org/jira/browse/BEAM-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083300#comment-16083300 ] Ahmet Altay commented on BEAM-2595: --- [~andrea.pierleoni] Thank you for reporting this. Could you share the error you are getting? [~sb2nov] Could you verify whether this is a regression or not? If this is a regression, can we mitigate before (add a comment/document to use the old way) before the release goes out? In addition to fix, I agree that we need a test if we don't have one. And also update examples (e.g. https://github.com/apache/beam/blob/91c7d3d1f7d72e84e773c1adbffed063aefdff3b/sdks/python/apache_beam/examples/cookbook/bigquery_schema.py#L116) cc: [~chamikara] > WriteToBigQuery does not work with nested json schema > - > > Key: BEAM-2595 > URL: https://issues.apache.org/jira/browse/BEAM-2595 > Project: Beam > Issue Type: Bug > Components: sdk-py >Affects Versions: 2.1.0 > Environment: mac os local runner, Python >Reporter: Andrea Pierleoni >Assignee: Sourabh Bajaj >Priority: Minor > Labels: gcp > Fix For: 2.1.0 > > > I am trying to use the new `WriteToBigQuery` PTransform added to > `apache_beam.io.gcp.bigquery` in version 2.1.0-RC1 > I need to write to a bigquery table with nested fields. > The only way to specify nested schemas in bigquery is with teh json schema. > None of the classes in `apache_beam.io.gcp.bigquery` are able to parse the > json schema, but they accept a schema as an instance of the class > `apache_beam.io.gcp.internal.clients.bigquery.TableFieldSchema` > I am composing the `TableFieldSchema` as suggested here > [https://stackoverflow.com/questions/36127537/json-table-schema-to-bigquery-tableschema-for-bigquerysink/45039436#45039436], > and it looks fine when passed to the PTransform `WriteToBigQuery`. > The problem is that the base class `PTransformWithSideInputs` try to pickle > and unpickle the function > [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/ptransform.py#L515] > (that includes the TableFieldSchema instance) and for some reason when the > class is unpickled some `FieldList` instance are converted to simple lists, > and the pickling validation fails. > Would it be possible to extend the test coverage to nested json objects for > bigquery? > They are also relatively easy to parse into a TableFieldSchema. -- This message was sent by Atlassian JIRA (v6.4.14#64029)