Mike Lambert created BEAM-1788:
----------------------------------
Summary: Using google-cloud-datastore in Beam requires
re-authentication
Key: BEAM-1788
URL: https://issues.apache.org/jira/browse/BEAM-1788
Project: Beam
Issue Type: Improvement
Components: beam-model-runner-api
Reporter: Mike Lambert
Assignee: Kenneth Knowles
Priority: Minor
When I run a pipeline, I believe everything (params, lexically scoped
variables) must be pickleable for the individual processing stages.
I have to load a dependent datastore record in one of my processing pipelines.
(Horribly inefficient, I know, but it's my DB design for now...)
A {{google.cloud.datastore.Client()}} is not serializable due to the
{{google.cloud.datastore._http.Connection}} it contains, that is using GRPC:
{{
File "lib/apache_beam/transforms/ptransform.py", line 474, in __init__
self.args = pickler.loads(pickler.dumps(self.args))
File "lib/apache_beam/internal/pickler.py", line 212, in loads
return dill.loads(s)
File "/Users/me/Library/Python/2.7/lib/python/site-packages/dill/dill.py",
line 277, in loads
return load(file)
File "/Users/me/Library/Python/2.7/lib/python/site-packages/dill/dill.py",
line 266, in load
obj = pik.load()
File
"/usr/local/Cellar/python/2.7.12_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 864, in load
dispatch[key](self)
File
"/usr/local/Cellar/python/2.7.12_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 1089, in load_newobj
obj = cls.__new__(cls, *args)
File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 35, in
grpc._cython.cygrpc.Channel.__cinit__
(src/python/grpcio/grpc/_cython/cygrpc.c:4022)
TypeError: __cinit__() takes at least 2 positional arguments (0 given)
}}
So instead, constructing a Client inside my pipeline...it appears to be jumping
through hoops to recreate the Client, in that *each* execution of my pipeline
is printing:
{{ DEBUG:google_auth_httplib2:Making request: POST
https://accounts.google.com/o/oauth2/token }}
I'm sure Google SRE would be very unhappy if I scaled up this mapreduce. :)
This is a tricky cross-team interaction issue (only occurs for those using
google-cloud-datastore *and* apache-beam google-dataflow), so not sure the
proper place to file this. I've cross-posted it at
https://github.com/GoogleCloudPlatform/google-cloud-python/issues/3191
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)