Thanks Udi. Added some comments. On Wed, Oct 17, 2018 at 10:50 AM Ahmet Altay <[email protected]> wrote:
> Udi thank you for the proposal and thank you for sharing it in plain > email. My comments are below. > > Overall, this is a good plan to get us out of a tough situation with an > old dependency. > > On Tue, Oct 16, 2018 at 6:59 PM, Udi Meiri <[email protected]> wrote: > >> Hi, >> Sadly upgrading googledatastore -> google-cloud-datastore is non-trivial ( >> https://issues.apache.org/jira/browse/BEAM-4543). I wrote a doc to >> summarize the plan: >> >> https://docs.google.com/document/d/1sL9p7NE5Z0p-5SB5uwpxWrddj_UCESKSrsvDTWNKqb4/edit?usp=sharing >> >> Contents pasted below: >> Beam Python SDK: Datastore Client Upgrade >> >> [email protected] >> >> public, draft, 2018-10-16 >> Objective >> >> Upgrade Beam's Python SDK dependency to use google-cloud-datastore v1.70 >> (or later), replacing googledatastore v7.0.1, providing Beam users a >> migration path to a new Datastore PTransform API. >> Background >> >> Beam currently uses the googledatastore package to provide access to >> Google Cloud Datastore, however that package doesn't seem to be getting >> regular releases (last release in 2017-04 >> <https://pypi.org/project/googledatastore/>) and it doesn't officially >> support Python 3 <https://issues.apache.org/jira/browse/BEAM-4543>. >> >> The current Beam API for Datastore queries exposes googledatastore types, >> such as using a protobuf to define a query (wordcount example >> <https://github.com/apache/beam/blob/79049b02949affe5aa2390dec9b890a04e1fde89/sdks/python/apache_beam/examples/cookbook/datastore_wordcount.py#L159>). >> Conversely, google-cloud-datastore hides this implementation detail (query >> API >> <https://googleapis.github.io/google-cloud-python/latest/datastore/queries.html>). >> Since Beam API has to change the data types it accepts, it forces users to >> change their code. This makes the migration to google-cloud-datastore >> non-trivial. >> Proposal >> >> This proposal includes a period in which two Beam APIs are available to >> access Datastore. >> >> >> - >> >> Add a new PTransforms that use google-cloud-datastore and mark as >> deprecated the existing API (ReadFromDatastore, WriteToDatastore, >> DeleteFromDatastore). >> - >> >> Implement apache_beam/io/datastore.py using google-cloud-datastore, >> taking care to not expose Datastore client internals. >> - >> >> (optional) Remove googledatastore from GCP_REQUIREMENTS >> >> <https://github.com/apache/beam/blob/79049b02949affe5aa2390dec9b890a04e1fde89/sdks/python/setup.py#L139> >> package list, and add it to a separate list, e.g., pip install >> apache-beam[gcp,googledatastore]. >> >> > I would like to avoid defining new sets of extra packages. Assuming that > these two packages are not incompatible together, we could keep them both > in [gcp]. > I think we might need this since googleclouddatastore package (1) does not seems to be getting upgraded (2) depends on older versions of packages (for example, httplib2). This conflicts with more recent releases of other tools (for example, gsutil). > > >> >> - >> >> Remove googledatastore-based API from Beam after 2 releases. >> >> > The removal needs to wait until next major version by default. Unless, we > have a way of asking our users and ensuring that nobody is really using the > existing API. Removing a current API in 2 releases (~3 months period) will > hurt some users. > +1 > > >
