On Wed, Oct 17, 2018 at 11:49 AM, Chamikara Jayalath <[email protected]> wrote:
> Thanks Udi. Added some comments. > > On Wed, Oct 17, 2018 at 10:50 AM Ahmet Altay <[email protected]> wrote: > >> Udi thank you for the proposal and thank you for sharing it in plain >> email. My comments are below. >> >> Overall, this is a good plan to get us out of a tough situation with an >> old dependency. >> >> On Tue, Oct 16, 2018 at 6:59 PM, Udi Meiri <[email protected]> wrote: >> >>> Hi, >>> Sadly upgrading googledatastore -> google-cloud-datastore is non-trivial >>> (https://issues.apache.org/jira/browse/BEAM-4543). I wrote a doc to >>> summarize the plan: >>> https://docs.google.com/document/d/1sL9p7NE5Z0p-5SB5uwpxWrddj_ >>> UCESKSrsvDTWNKqb4/edit?usp=sharing >>> >>> Contents pasted below: >>> Beam Python SDK: Datastore Client Upgrade >>> >>> [email protected] >>> >>> public, draft, 2018-10-16 >>> Objective >>> >>> Upgrade Beam's Python SDK dependency to use google-cloud-datastore v1.70 >>> (or later), replacing googledatastore v7.0.1, providing Beam users a >>> migration path to a new Datastore PTransform API. >>> Background >>> >>> Beam currently uses the googledatastore package to provide access to >>> Google Cloud Datastore, however that package doesn't seem to be getting >>> regular releases (last release in 2017-04 >>> <https://pypi.org/project/googledatastore/>) and it doesn't officially >>> support Python 3 <https://issues.apache.org/jira/browse/BEAM-4543>. >>> >>> The current Beam API for Datastore queries exposes googledatastore >>> types, such as using a protobuf to define a query (wordcount example >>> <https://github.com/apache/beam/blob/79049b02949affe5aa2390dec9b890a04e1fde89/sdks/python/apache_beam/examples/cookbook/datastore_wordcount.py#L159>). >>> Conversely, google-cloud-datastore hides this implementation detail (query >>> API >>> <https://googleapis.github.io/google-cloud-python/latest/datastore/queries.html>). >>> Since Beam API has to change the data types it accepts, it forces users to >>> change their code. This makes the migration to google-cloud-datastore >>> non-trivial. >>> Proposal >>> >>> This proposal includes a period in which two Beam APIs are available to >>> access Datastore. >>> >>> >>> - >>> >>> Add a new PTransforms that use google-cloud-datastore and mark as >>> deprecated the existing API (ReadFromDatastore, WriteToDatastore, >>> DeleteFromDatastore). >>> - >>> >>> Implement apache_beam/io/datastore.py using google-cloud-datastore, >>> taking care to not expose Datastore client internals. >>> - >>> >>> (optional) Remove googledatastore from GCP_REQUIREMENTS >>> >>> <https://github.com/apache/beam/blob/79049b02949affe5aa2390dec9b890a04e1fde89/sdks/python/setup.py#L139> >>> package list, and add it to a separate list, e.g., pip install >>> apache-beam[gcp,googledatastore]. >>> >>> >> I would like to avoid defining new sets of extra packages. Assuming that >> these two packages are not incompatible together, we could keep them both >> in [gcp]. >> > > I think we might need this since googleclouddatastore package (1) does not > seems to be getting upgraded (2) depends on older versions of packages (for > example, httplib2). > > This conflicts with more recent releases of other tools (for example, > gsutil). > This is fine, if it is the only viable option. But note that it is also a breaking change in the way people install beam in order to use old datastore APIs. > > >> >> >>> >>> - >>> >>> Remove googledatastore-based API from Beam after 2 releases. >>> >>> >> The removal needs to wait until next major version by default. Unless, we >> have a way of asking our users and ensuring that nobody is really using the >> existing API. Removing a current API in 2 releases (~3 months period) will >> hurt some users. >> > +1 > >> >> >>
