[jira] [Commented] (BEAM-2595) WriteToBigQuery does not work with nested json schema

2017-07-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088222#comment-16088222
 ] 

ASF GitHub Bot commented on BEAM-2595:
--

GitHub user sb2nov opened a pull request:

https://github.com/apache/beam/pull/3563

[BEAM-2595] Allow table schema objects in BQ DoFn

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [ ] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [ ] Make sure tests pass via `mvn clean verify`.
 - [ ] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [ ] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.pdf).

---

Cherry pick from master for BEAM-2535

R: @aaltay 
cc @jbonofre 


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sb2nov/beam BEAM-2595-cp

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/beam/pull/3563.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3563


commit ada4733b02bc38b1ef619fb991c068822a917595
Author: Sourabh Bajaj 
Date:   2017-07-13T19:02:31Z

[BEAM-2595] Allow table schema objects in BQ DoFn




> WriteToBigQuery does not work with nested json schema
> -
>
> Key: BEAM-2595
> URL: https://issues.apache.org/jira/browse/BEAM-2595
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 2.1.0
> Environment: mac os local runner, Python
>Reporter: Andrea Pierleoni
>Assignee: Sourabh Bajaj
>Priority: Minor
>  Labels: gcp
> Fix For: 2.1.0
>
>
> I am trying to use the new `WriteToBigQuery` PTransform added to 
> `apache_beam.io.gcp.bigquery` in version 2.1.0-RC1
> I need to write to a bigquery table with nested fields.
> The only way to specify nested schemas in bigquery is with teh json schema.
> None of the classes in `apache_beam.io.gcp.bigquery` are able to parse the 
> json schema, but they accept a schema as an instance of the class 
> `apache_beam.io.gcp.internal.clients.bigquery.TableFieldSchema`
> I am composing the `TableFieldSchema` as suggested here 
> [https://stackoverflow.com/questions/36127537/json-table-schema-to-bigquery-tableschema-for-bigquerysink/45039436#45039436],
>  and it looks fine when passed to the PTransform `WriteToBigQuery`. 
> The problem is that the base class `PTransformWithSideInputs` try to pickle 
> and unpickle the function 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/ptransform.py#L515]
>   (that includes the TableFieldSchema instance) and for some reason when the 
> class is unpickled some `FieldList` instance are converted to simple lists, 
> and the pickling validation fails.
> Would it be possible to extend the test coverage to nested json objects for 
> bigquery?
> They are also relatively easy to parse into a TableFieldSchema.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2595) WriteToBigQuery does not work with nested json schema

2017-07-14 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16087712#comment-16087712
 ] 

Sourabh Bajaj commented on BEAM-2595:
-

[~andrea.pierleoni] can you verify that your pipeline works with the latest 
master ?

> WriteToBigQuery does not work with nested json schema
> -
>
> Key: BEAM-2595
> URL: https://issues.apache.org/jira/browse/BEAM-2595
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 2.1.0
> Environment: mac os local runner, Python
>Reporter: Andrea Pierleoni
>Assignee: Sourabh Bajaj
>Priority: Minor
>  Labels: gcp
> Fix For: 2.1.0
>
>
> I am trying to use the new `WriteToBigQuery` PTransform added to 
> `apache_beam.io.gcp.bigquery` in version 2.1.0-RC1
> I need to write to a bigquery table with nested fields.
> The only way to specify nested schemas in bigquery is with teh json schema.
> None of the classes in `apache_beam.io.gcp.bigquery` are able to parse the 
> json schema, but they accept a schema as an instance of the class 
> `apache_beam.io.gcp.internal.clients.bigquery.TableFieldSchema`
> I am composing the `TableFieldSchema` as suggested here 
> [https://stackoverflow.com/questions/36127537/json-table-schema-to-bigquery-tableschema-for-bigquerysink/45039436#45039436],
>  and it looks fine when passed to the PTransform `WriteToBigQuery`. 
> The problem is that the base class `PTransformWithSideInputs` try to pickle 
> and unpickle the function 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/ptransform.py#L515]
>   (that includes the TableFieldSchema instance) and for some reason when the 
> class is unpickled some `FieldList` instance are converted to simple lists, 
> and the pickling validation fails.
> Would it be possible to extend the test coverage to nested json objects for 
> bigquery?
> They are also relatively easy to parse into a TableFieldSchema.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2595) WriteToBigQuery does not work with nested json schema

2017-07-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086384#comment-16086384
 ] 

ASF GitHub Bot commented on BEAM-2595:
--

Github user asfgit closed the pull request at:

https://github.com/apache/beam/pull/3556


> WriteToBigQuery does not work with nested json schema
> -
>
> Key: BEAM-2595
> URL: https://issues.apache.org/jira/browse/BEAM-2595
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 2.1.0
> Environment: mac os local runner, Python
>Reporter: Andrea Pierleoni
>Assignee: Sourabh Bajaj
>Priority: Minor
>  Labels: gcp
> Fix For: 2.1.0
>
>
> I am trying to use the new `WriteToBigQuery` PTransform added to 
> `apache_beam.io.gcp.bigquery` in version 2.1.0-RC1
> I need to write to a bigquery table with nested fields.
> The only way to specify nested schemas in bigquery is with teh json schema.
> None of the classes in `apache_beam.io.gcp.bigquery` are able to parse the 
> json schema, but they accept a schema as an instance of the class 
> `apache_beam.io.gcp.internal.clients.bigquery.TableFieldSchema`
> I am composing the `TableFieldSchema` as suggested here 
> [https://stackoverflow.com/questions/36127537/json-table-schema-to-bigquery-tableschema-for-bigquerysink/45039436#45039436],
>  and it looks fine when passed to the PTransform `WriteToBigQuery`. 
> The problem is that the base class `PTransformWithSideInputs` try to pickle 
> and unpickle the function 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/ptransform.py#L515]
>   (that includes the TableFieldSchema instance) and for some reason when the 
> class is unpickled some `FieldList` instance are converted to simple lists, 
> and the pickling validation fails.
> Would it be possible to extend the test coverage to nested json objects for 
> bigquery?
> They are also relatively easy to parse into a TableFieldSchema.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2595) WriteToBigQuery does not work with nested json schema

2017-07-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085278#comment-16085278
 ] 

ASF GitHub Bot commented on BEAM-2595:
--

GitHub user sb2nov opened a pull request:

https://github.com/apache/beam/pull/3556

[BEAM-2595] Allow table schema objects in BQ DoFn

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [ ] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [ ] Make sure tests pass via `mvn clean verify`.
 - [ ] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [ ] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.pdf).

---

R: @chamikaramj PTAL

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sb2nov/beam 
BEAM-2595-allow-table-schema-bq-dogn

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/beam/pull/3556.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3556


commit f89d59db369037c0bdb8fcb5d26fb1fd4b4599f7
Author: Sourabh Bajaj 
Date:   2017-07-13T06:50:24Z

[BEAM-2595] Allow table schema objects in BQ DoFn




> WriteToBigQuery does not work with nested json schema
> -
>
> Key: BEAM-2595
> URL: https://issues.apache.org/jira/browse/BEAM-2595
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 2.1.0
> Environment: mac os local runner, Python
>Reporter: Andrea Pierleoni
>Assignee: Sourabh Bajaj
>Priority: Minor
>  Labels: gcp
> Fix For: 2.1.0
>
>
> I am trying to use the new `WriteToBigQuery` PTransform added to 
> `apache_beam.io.gcp.bigquery` in version 2.1.0-RC1
> I need to write to a bigquery table with nested fields.
> The only way to specify nested schemas in bigquery is with teh json schema.
> None of the classes in `apache_beam.io.gcp.bigquery` are able to parse the 
> json schema, but they accept a schema as an instance of the class 
> `apache_beam.io.gcp.internal.clients.bigquery.TableFieldSchema`
> I am composing the `TableFieldSchema` as suggested here 
> [https://stackoverflow.com/questions/36127537/json-table-schema-to-bigquery-tableschema-for-bigquerysink/45039436#45039436],
>  and it looks fine when passed to the PTransform `WriteToBigQuery`. 
> The problem is that the base class `PTransformWithSideInputs` try to pickle 
> and unpickle the function 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/ptransform.py#L515]
>   (that includes the TableFieldSchema instance) and for some reason when the 
> class is unpickled some `FieldList` instance are converted to simple lists, 
> and the pickling validation fails.
> Would it be possible to extend the test coverage to nested json objects for 
> bigquery?
> They are also relatively easy to parse into a TableFieldSchema.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2595) WriteToBigQuery does not work with nested json schema

2017-07-12 Thread Andrea Pierleoni (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083496#comment-16083496
 ] 

Andrea Pierleoni commented on BEAM-2595:


yes, sorry forgot the stack trace

{code:}
Traceback (most recent call last):
  File "/Users/andreap/work/code/library_dataflow/main.py", line 347, in 

run()
  File "/Users/andreap/work/code/library_dataflow/main.py", line 343, in run
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE)
  File 
"/Users/andreap/library_dataflow/venv/lib/python2.7/site-packages/apache_beam/pvalue.py",
 line 100, in __or__
return self.pipeline.apply(ptransform, self)
  File 
"/Users/andreap/library_dataflow/venv/lib/python2.7/site-packages/apache_beam/pipeline.py",
 line 265, in apply
pvalueish_result = self.runner.apply(transform, pvalueish)
  File 
"/Users/andreap/library_dataflow/venv/lib/python2.7/site-packages/apache_beam/runners/runner.py",
 line 150, in apply
return m(transform, input)
  File 
"/Users/andreap/library_dataflow/venv/lib/python2.7/site-packages/apache_beam/runners/runner.py",
 line 156, in apply_PTransform
return transform.expand(input)
  File "/Users/andreap/work/code/library_dataflow/beam2_1.py", line 213, in 
expand
return pcoll | 'WriteToBigQuery' >> ParDo(bigquery_write_fn)
  File 
"/Users/andreap/library_dataflow/venv/lib/python2.7/site-packages/apache_beam/transforms/core.py",
 line 620, in __init__
super(ParDo, self).__init__(fn, *args, **kwargs)
  File 
"/Users/andreap/library_dataflow/venv/lib/python2.7/site-packages/apache_beam/transforms/ptransform.py",
 line 515, in __init__
self.fn = pickler.loads(pickler.dumps(self.fn))
  File 
"/Users/andreap/library_dataflow/venv/lib/python2.7/site-packages/apache_beam/internal/pickler.py",
 line 225, in loads
return dill.loads(s)
  File 
"/Users/andreap/library_dataflow/venv/lib/python2.7/site-packages/dill/dill.py",
 line 277, in loads
return load(file)
  File 
"/Users/andreap/library_dataflow/venv/lib/python2.7/site-packages/dill/dill.py",
 line 266, in load
obj = pik.load()
  File 
"/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
 line 864, in load
dispatch[key](self)
  File 
"/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
 line 1195, in load_appends
list.extend(stack[mark + 1:])
  File 
"/Users/andreap/library_dataflow/venv/lib/python2.7/site-packages/apitools/base/protorpclite/messages.py",
 line 1147, in extend
self.__field.validate(sequence)
AttributeError: 'FieldList' object has no attribute '_FieldList__field'

{code}

for some reasons once deserialized the seuquence does not have the 
'_FieldList__field' attribute.
I don't believe it is introduced in 2.1.0 per se, but a (very useful) class 
using it is coming with 2.1.0 and it may flag as a real problem in production. 
Definitely for us.

> WriteToBigQuery does not work with nested json schema
> -
>
> Key: BEAM-2595
> URL: https://issues.apache.org/jira/browse/BEAM-2595
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 2.1.0
> Environment: mac os local runner, Python
>Reporter: Andrea Pierleoni
>Assignee: Sourabh Bajaj
>Priority: Minor
>  Labels: gcp
> Fix For: 2.1.0
>
>
> I am trying to use the new `WriteToBigQuery` PTransform added to 
> `apache_beam.io.gcp.bigquery` in version 2.1.0-RC1
> I need to write to a bigquery table with nested fields.
> The only way to specify nested schemas in bigquery is with teh json schema.
> None of the classes in `apache_beam.io.gcp.bigquery` are able to parse the 
> json schema, but they accept a schema as an instance of the class 
> `apache_beam.io.gcp.internal.clients.bigquery.TableFieldSchema`
> I am composing the `TableFieldSchema` as suggested here 
> [https://stackoverflow.com/questions/36127537/json-table-schema-to-bigquery-tableschema-for-bigquerysink/45039436#45039436],
>  and it looks fine when passed to the PTransform `WriteToBigQuery`. 
> The problem is that the base class `PTransformWithSideInputs` try to pickle 
> and unpickle the function 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/ptransform.py#L515]
>   (that includes the TableFieldSchema instance) and for some reason when the 
> class is unpickled some `FieldList` instance are converted to simple lists, 
> and the pickling validation fails.
> Would it be possible to extend the test coverage to nested json objects for 
> bigquery?
> They are also relatively easy to parse into a TableFieldSchema.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2595) WriteToBigQuery does not work with nested json schema

2017-07-11 Thread Ahmet Altay (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083300#comment-16083300
 ] 

Ahmet Altay commented on BEAM-2595:
---

[~andrea.pierleoni] Thank you for reporting this. Could you share the error you 
are getting?

[~sb2nov] Could you verify whether this is a regression or not? If this is a 
regression, can we mitigate before (add a comment/document to use the old way) 
before the release goes out?

In addition to fix, I agree that we need a test if we don't have one. And also 
update examples (e.g. 
https://github.com/apache/beam/blob/91c7d3d1f7d72e84e773c1adbffed063aefdff3b/sdks/python/apache_beam/examples/cookbook/bigquery_schema.py#L116)

cc: [~chamikara]


> WriteToBigQuery does not work with nested json schema
> -
>
> Key: BEAM-2595
> URL: https://issues.apache.org/jira/browse/BEAM-2595
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 2.1.0
> Environment: mac os local runner, Python
>Reporter: Andrea Pierleoni
>Assignee: Sourabh Bajaj
>Priority: Minor
>  Labels: gcp
> Fix For: 2.1.0
>
>
> I am trying to use the new `WriteToBigQuery` PTransform added to 
> `apache_beam.io.gcp.bigquery` in version 2.1.0-RC1
> I need to write to a bigquery table with nested fields.
> The only way to specify nested schemas in bigquery is with teh json schema.
> None of the classes in `apache_beam.io.gcp.bigquery` are able to parse the 
> json schema, but they accept a schema as an instance of the class 
> `apache_beam.io.gcp.internal.clients.bigquery.TableFieldSchema`
> I am composing the `TableFieldSchema` as suggested here 
> [https://stackoverflow.com/questions/36127537/json-table-schema-to-bigquery-tableschema-for-bigquerysink/45039436#45039436],
>  and it looks fine when passed to the PTransform `WriteToBigQuery`. 
> The problem is that the base class `PTransformWithSideInputs` try to pickle 
> and unpickle the function 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/ptransform.py#L515]
>   (that includes the TableFieldSchema instance) and for some reason when the 
> class is unpickled some `FieldList` instance are converted to simple lists, 
> and the pickling validation fails.
> Would it be possible to extend the test coverage to nested json objects for 
> bigquery?
> They are also relatively easy to parse into a TableFieldSchema.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)