Udi thank you for the proposal and thank you for sharing it in plain email. My comments are below.
Overall, this is a good plan to get us out of a tough situation with an old dependency. On Tue, Oct 16, 2018 at 6:59 PM, Udi Meiri <[email protected]> wrote: > Hi, > Sadly upgrading googledatastore -> google-cloud-datastore is non-trivial ( > https://issues.apache.org/jira/browse/BEAM-4543). I wrote a doc to > summarize the plan: > https://docs.google.com/document/d/1sL9p7NE5Z0p-5SB5uwpxWrddj_ > UCESKSrsvDTWNKqb4/edit?usp=sharing > > Contents pasted below: > Beam Python SDK: Datastore Client Upgrade > > [email protected] > > public, draft, 2018-10-16 > Objective > > Upgrade Beam's Python SDK dependency to use google-cloud-datastore v1.70 > (or later), replacing googledatastore v7.0.1, providing Beam users a > migration path to a new Datastore PTransform API. > Background > > Beam currently uses the googledatastore package to provide access to > Google Cloud Datastore, however that package doesn't seem to be getting > regular releases (last release in 2017-04 > <https://pypi.org/project/googledatastore/>) and it doesn't officially > support Python 3 <https://issues.apache.org/jira/browse/BEAM-4543>. > > The current Beam API for Datastore queries exposes googledatastore types, > such as using a protobuf to define a query (wordcount example > <https://github.com/apache/beam/blob/79049b02949affe5aa2390dec9b890a04e1fde89/sdks/python/apache_beam/examples/cookbook/datastore_wordcount.py#L159>). > Conversely, google-cloud-datastore hides this implementation detail (query > API > <https://googleapis.github.io/google-cloud-python/latest/datastore/queries.html>). > Since Beam API has to change the data types it accepts, it forces users to > change their code. This makes the migration to google-cloud-datastore > non-trivial. > Proposal > > This proposal includes a period in which two Beam APIs are available to > access Datastore. > > > - > > Add a new PTransforms that use google-cloud-datastore and mark as > deprecated the existing API (ReadFromDatastore, WriteToDatastore, > DeleteFromDatastore). > - > > Implement apache_beam/io/datastore.py using google-cloud-datastore, > taking care to not expose Datastore client internals. > - > > (optional) Remove googledatastore from GCP_REQUIREMENTS > > <https://github.com/apache/beam/blob/79049b02949affe5aa2390dec9b890a04e1fde89/sdks/python/setup.py#L139> > package list, and add it to a separate list, e.g., pip install > apache-beam[gcp,googledatastore]. > > I would like to avoid defining new sets of extra packages. Assuming that these two packages are not incompatible together, we could keep them both in [gcp]. > > - > > Remove googledatastore-based API from Beam after 2 releases. > > The removal needs to wait until next major version by default. Unless, we have a way of asking our users and ensuring that nobody is really using the existing API. Removing a current API in 2 releases (~3 months period) will hurt some users.
