[ 
https://issues.apache.org/jira/browse/BEAM-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15937891#comment-15937891
 ] 

Mike Lambert commented on BEAM-1788:
------------------------------------

I understand it may not be your issue directly....

But would it be worth setting the above environment variable in BEAM (as part 
of the goal of google-developed suite of SDKs playing well with each other)?

Or is that something you think should be pushed down into each developer  to do 
themselves? (In which case, should it be documented somewhere beyond this 
"issue"?)

> Using google-cloud-datastore in Beam requires re-authentication
> ---------------------------------------------------------------
>
>                 Key: BEAM-1788
>                 URL: https://issues.apache.org/jira/browse/BEAM-1788
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-py
>            Reporter: Mike Lambert
>            Assignee: Ahmet Altay
>            Priority: Minor
>              Labels: datastore, python
>
> When I run a pipeline, I believe everything (params, lexically scoped 
> variables) must be pickleable for the individual processing stages.
> I have to load a dependent datastore record in one of my processing 
> pipelines. (Horribly inefficient, I know, but it's my DB design for now...)
> A {{google.cloud.datastore.Client()}} is not serializable due to the 
> {{google.cloud.datastore._http.Connection}} it contains, that is using GRPC:
> {noformat}
>   File "lib/apache_beam/transforms/ptransform.py", line 474, in __init__
>     self.args = pickler.loads(pickler.dumps(self.args))
>   File "lib/apache_beam/internal/pickler.py", line 212, in loads
>     return dill.loads(s)
>   File "/Users/me/Library/Python/2.7/lib/python/site-packages/dill/dill.py", 
> line 277, in loads
>     return load(file)
>   File "/Users/me/Library/Python/2.7/lib/python/site-packages/dill/dill.py", 
> line 266, in load
>     obj = pik.load()
>   File 
> "/usr/local/Cellar/python/2.7.12_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
>  line 864, in load
>     dispatch[key](self)
>   File 
> "/usr/local/Cellar/python/2.7.12_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
>  line 1089, in load_newobj
>     obj = cls.__new__(cls, *args)
>   File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 35, in 
> grpc._cython.cygrpc.Channel.__cinit__ 
> (src/python/grpcio/grpc/_cython/cygrpc.c:4022)
> TypeError: __cinit__() takes at least 2 positional arguments (0 given)
> {noformat}
> So instead, constructing a Client inside my pipeline...it appears to be 
> jumping through hoops to recreate the Client, in that *each* execution of my 
> pipeline is printing:
> {{DEBUG:google_auth_httplib2:Making request: POST 
> https://accounts.google.com/o/oauth2/token}}
> I'm sure Google SRE would be very unhappy if I scaled up this mapreduce. :)
> This is a tricky cross-team interaction issue (only occurs for those using 
> google-cloud-datastore *and* apache-beam google-dataflow), so not sure the 
> proper place to file this. I've cross-posted it at 
> https://github.com/GoogleCloudPlatform/google-cloud-python/issues/3191



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to