[ 
https://issues.apache.org/jira/browse/BEAM-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Lambert updated BEAM-1788:
-------------------------------
    Description: 
When I run a pipeline, I believe everything (params, lexically scoped 
variables) must be pickleable for the individual processing stages.

I have to load a dependent datastore record in one of my processing pipelines. 
(Horribly inefficient, I know, but it's my DB design for now...)

A {{google.cloud.datastore.Client()}} is not serializable due to the 
{{google.cloud.datastore._http.Connection}} it contains, that is using GRPC:

{noformat}
  File "lib/apache_beam/transforms/ptransform.py", line 474, in __init__
    self.args = pickler.loads(pickler.dumps(self.args))
  File "lib/apache_beam/internal/pickler.py", line 212, in loads
    return dill.loads(s)
  File "/Users/me/Library/Python/2.7/lib/python/site-packages/dill/dill.py", 
line 277, in loads
    return load(file)
  File "/Users/me/Library/Python/2.7/lib/python/site-packages/dill/dill.py", 
line 266, in load
    obj = pik.load()
  File 
"/usr/local/Cellar/python/2.7.12_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
 line 864, in load
    dispatch[key](self)
  File 
"/usr/local/Cellar/python/2.7.12_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
 line 1089, in load_newobj
    obj = cls.__new__(cls, *args)
  File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 35, in 
grpc._cython.cygrpc.Channel.__cinit__ 
(src/python/grpcio/grpc/_cython/cygrpc.c:4022)
TypeError: __cinit__() takes at least 2 positional arguments (0 given)
{noformat}

So instead, constructing a Client inside my pipeline...it appears to be jumping 
through hoops to recreate the Client, in that *each* execution of my pipeline 
is printing:
{{DEBUG:google_auth_httplib2:Making request: POST 
https://accounts.google.com/o/oauth2/token}}

I'm sure Google SRE would be very unhappy if I scaled up this mapreduce. :)

This is a tricky cross-team interaction issue (only occurs for those using 
google-cloud-datastore *and* apache-beam google-dataflow), so not sure the 
proper place to file this. I've cross-posted it at 
https://github.com/GoogleCloudPlatform/google-cloud-python/issues/3191

  was:
When I run a pipeline, I believe everything (params, lexically scoped 
variables) must be pickleable for the individual processing stages.

I have to load a dependent datastore record in one of my processing pipelines. 
(Horribly inefficient, I know, but it's my DB design for now...)

A {{google.cloud.datastore.Client()}} is not serializable due to the 
{{google.cloud.datastore._http.Connection}} it contains, that is using GRPC:

{{
  File "lib/apache_beam/transforms/ptransform.py", line 474, in __init__
    self.args = pickler.loads(pickler.dumps(self.args))
  File "lib/apache_beam/internal/pickler.py", line 212, in loads
    return dill.loads(s)
  File "/Users/me/Library/Python/2.7/lib/python/site-packages/dill/dill.py", 
line 277, in loads
    return load(file)
  File "/Users/me/Library/Python/2.7/lib/python/site-packages/dill/dill.py", 
line 266, in load
    obj = pik.load()
  File 
"/usr/local/Cellar/python/2.7.12_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
 line 864, in load
    dispatch[key](self)
  File 
"/usr/local/Cellar/python/2.7.12_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
 line 1089, in load_newobj
    obj = cls.__new__(cls, *args)
  File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 35, in 
grpc._cython.cygrpc.Channel.__cinit__ 
(src/python/grpcio/grpc/_cython/cygrpc.c:4022)
TypeError: __cinit__() takes at least 2 positional arguments (0 given)
}}

So instead, constructing a Client inside my pipeline...it appears to be jumping 
through hoops to recreate the Client, in that *each* execution of my pipeline 
is printing:
{{ DEBUG:google_auth_httplib2:Making request: POST 
https://accounts.google.com/o/oauth2/token }}

I'm sure Google SRE would be very unhappy if I scaled up this mapreduce. :)

This is a tricky cross-team interaction issue (only occurs for those using 
google-cloud-datastore *and* apache-beam google-dataflow), so not sure the 
proper place to file this. I've cross-posted it at 
https://github.com/GoogleCloudPlatform/google-cloud-python/issues/3191


> Using google-cloud-datastore in Beam requires re-authentication
> ---------------------------------------------------------------
>
>                 Key: BEAM-1788
>                 URL: https://issues.apache.org/jira/browse/BEAM-1788
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-py
>            Reporter: Mike Lambert
>            Assignee: Ahmet Altay
>            Priority: Minor
>              Labels: datastore, python
>
> When I run a pipeline, I believe everything (params, lexically scoped 
> variables) must be pickleable for the individual processing stages.
> I have to load a dependent datastore record in one of my processing 
> pipelines. (Horribly inefficient, I know, but it's my DB design for now...)
> A {{google.cloud.datastore.Client()}} is not serializable due to the 
> {{google.cloud.datastore._http.Connection}} it contains, that is using GRPC:
> {noformat}
>   File "lib/apache_beam/transforms/ptransform.py", line 474, in __init__
>     self.args = pickler.loads(pickler.dumps(self.args))
>   File "lib/apache_beam/internal/pickler.py", line 212, in loads
>     return dill.loads(s)
>   File "/Users/me/Library/Python/2.7/lib/python/site-packages/dill/dill.py", 
> line 277, in loads
>     return load(file)
>   File "/Users/me/Library/Python/2.7/lib/python/site-packages/dill/dill.py", 
> line 266, in load
>     obj = pik.load()
>   File 
> "/usr/local/Cellar/python/2.7.12_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
>  line 864, in load
>     dispatch[key](self)
>   File 
> "/usr/local/Cellar/python/2.7.12_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
>  line 1089, in load_newobj
>     obj = cls.__new__(cls, *args)
>   File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 35, in 
> grpc._cython.cygrpc.Channel.__cinit__ 
> (src/python/grpcio/grpc/_cython/cygrpc.c:4022)
> TypeError: __cinit__() takes at least 2 positional arguments (0 given)
> {noformat}
> So instead, constructing a Client inside my pipeline...it appears to be 
> jumping through hoops to recreate the Client, in that *each* execution of my 
> pipeline is printing:
> {{DEBUG:google_auth_httplib2:Making request: POST 
> https://accounts.google.com/o/oauth2/token}}
> I'm sure Google SRE would be very unhappy if I scaled up this mapreduce. :)
> This is a tricky cross-team interaction issue (only occurs for those using 
> google-cloud-datastore *and* apache-beam google-dataflow), so not sure the 
> proper place to file this. I've cross-posted it at 
> https://github.com/GoogleCloudPlatform/google-cloud-python/issues/3191



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to