[ 
https://issues.apache.org/jira/browse/BEAM-4543?focusedWorklogId=230333&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-230333
 ]

ASF GitHub Bot logged work on BEAM-4543:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 20/Apr/19 00:00
            Start Date: 20/Apr/19 00:00
    Worklog Time Spent: 10m 
      Work Description: chamikaramj commented on pull request #8262: 
[BEAM-4543] Python Datastore IO using google-cloud-datastore
URL: https://github.com/apache/beam/pull/8262#discussion_r276858285
 
 

 ##########
 File path: sdks/python/apache_beam/io/gcp/datastore/v1new/datastoreio.py
 ##########
 @@ -0,0 +1,477 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+A connector for reading from and writing to Google Cloud Datastore.
+
+Uses the newer google-cloud-datastore pip dependency.
+
+For Datastore entities, does not support the property value "meaning" field.
+
+This module is experimental, no backwards compatibility guarantees.
+"""
+from __future__ import absolute_import
+from __future__ import division
+
+import logging
+import time
+from builtins import object
+from builtins import round
+
+from apache_beam import typehints
+from apache_beam.io.gcp.datastore.v1 import util
+from apache_beam.io.gcp.datastore.v1.adaptive_throttler import 
AdaptiveThrottler
+from apache_beam.io.gcp.datastore.v1new import helper
+from apache_beam.io.gcp.datastore.v1new import query_splitter
+from apache_beam.io.gcp.datastore.v1new import types
+from apache_beam.metrics.metric import Metrics
+from apache_beam.transforms import Create
+from apache_beam.transforms import DoFn
+from apache_beam.transforms import ParDo
+from apache_beam.transforms import PTransform
+from apache_beam.transforms import Reshuffle
+
+__all__ = ['QueryDatastore', 'WriteToDatastore', 'DeleteFromDatastore']
+
+
[email protected]_output_types(types.Entity)
+class QueryDatastore(PTransform):
+  """A ``PTransform`` for querying Google Cloud Datastore.
+
+  To read a ``PCollection[Entity]`` from a Cloud Datastore ``Query``, use
+  ``QueryDatastore`` transform by providing a `query` to
+  read from. The project and optional namespace are set in the query.
+  The query will be split into multiple queries to allow for parallelism. The
+  degree of parallelism is automatically determined, but can be overridden by
+  setting `num_splits` to a value of 1 or greater.
+
+  Note: Normally, a runner will read from Cloud Datastore in parallel across
+  many workers. However, when the `query` is configured with a `limit` or if 
the
+  query contains inequality filters like `GREATER_THAN, LESS_THAN` etc., then
+  all the returned results will be read by a single worker in order to ensure
+  correct data. Since data is read from a single worker, this could have
+  significant impact on the performance of the job.
 
 Review comment:
   Please also mention that adding a Reshuffle to prevent fusion with 
succeeding steps might be beneficial for parallelizing the computation (with a 
reference to Reshuffle transform).
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 230333)

> Remove dependency on googledatastore in favor of google-cloud-datastore.
> ------------------------------------------------------------------------
>
>                 Key: BEAM-4543
>                 URL: https://issues.apache.org/jira/browse/BEAM-4543
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-py-core
>            Reporter: Valentyn Tymofieiev
>            Assignee: Udi Meiri
>            Priority: Minor
>              Labels: triaged
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> apache-beam[gcp] package depends [1] on googledatastore package [2]. We 
> should replace this dependency with google-cloud-datastore [3] which is 
> officially supported, has better release cadence and also has Python 3 
> support.
> [1] 
> https://github.com/apache/beam/blob/fad655462f8fadfdfaab0b7a09cab538f076f94e/sdks/python/setup.py#L126
> [2] [https://pypi.org/project/googledatastore/]
> [3] [https://pypi.org/project/google-cloud-datastore/]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to