[jira] [Work logged] (BEAM-10917) Implement a BigQuery bounded source using the BigQuery storage API

ASF GitHub Bot (Jira) Thu, 29 Jul 2021 12:42:04 -0700


     [ 
https://issues.apache.org/jira/browse/BEAM-10917?focusedWorklogId=631303&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-631303
 ]


ASF GitHub Bot logged work on BEAM-10917:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 29/Jul/21 19:41
            Start Date: 29/Jul/21 19:41
    Worklog Time Spent: 10m 
      Work Description: kmjung commented on a change in pull request #15185:
URL: https://github.com/apache/beam/pull/15185#discussion_r679386882



##########
File path: sdks/python/apache_beam/io/gcp/bigquery.py
##########
@@ -883,6 +890,206 @@ def _export_files(self, bq):
     return table.schema, metadata_list
 
 
+class _CustomBigQueryStorageSourceBase(BoundedSource):
+  """A base class for BoundedSource implementations which read from BigQuery
+  using the BigQuery Storage API."""
+  def __init__(
+      self,
+      project=None,
+      dataset=None,
+      table=None,
+      selected_fields=None,
+      row_restriction=None,
+      pipeline_options=None):
+
+    if dataset is None:
+      raise ValueError('A BigQuery dataset must be specified.')
+    elif table is None:
+      raise ValueError('A BigQuery table must be specified.')
+    else:
+      self.table_reference = bigquery_tools.parse_table_reference(
+          table, dataset, project)
+
+    self.project = self.table_reference.projectId
+    self.dataset = self.table_reference.datasetId
+    self.table = self.table_reference.tableId
+    self.selected_fields = selected_fields
+    self.row_restriction = row_restriction
+    self.pipeline_options = pipeline_options
+    self.split_result = None
+    # The maximum number of streams which will be requested when creating a 
read
+    # session, regardless of the desired bundle size.
+    self.MAX_SPLIT_COUNT = 10000
+    # The minimum number of streams which will be requested when creating a 
read
+    # session, regardless of the desired bundle size. Note that the server may
+    # still choose to return fewer than ten streams based on the layout of the
+    # table.
+    self.MIN_SPLIT_COUNT = 10
+
+  def _get_project(self):
+    """Returns the project that will be billed."""
+    project = self.pipeline_options.view_as(GoogleCloudOptions).project
+    if isinstance(project, vp.ValueProvider):
+      project = project.get()
+    if not project:
+      project = self.project
+    return project
+
+  def _get_table_size(self, table, dataset, project):
+    if project is None:
+      project = self._get_project()
+
+    bq = bigquery_tools.BigQueryWrapper()
+    table = bq.get_table(project, dataset, table)
+    return table.numBytes
+
+  def display_data(self):
+    return {
+        'project': str(self.project),
+        'dataset': str(self.dataset),
+        'table': str(self.table)
+    }
+
+  def estimate_size(self):
+    # The size of stream source cannot be estimate due to server-side liquid
+    # sharding
+    return None
+
+  def split(self, desired_bundle_size, start_position=None, 
stop_position=None):
+    requested_session = bq_storage.types.ReadSession()
+    requested_session.table = 'projects/{}/datasets/{}/tables/{}'.format(
+        self.project, self.dataset, self.table)
+    requested_session.data_format = bq_storage.types.DataFormat.AVRO
+
+    if (self.selected_fields is not None or self.row_restriction is not None):
+      table_read_options = requested_session.TableReadOptions()
+      if self.selected_fields is not None:
+        table_read_options.selected_fields = self.selected_fields
+      if self.row_restriction is not None:
+        table_read_options.row_restriction = self.row_restriction
+      requested_session.read_options = table_read_options
+
+    storage_client = bq_storage.BigQueryReadClient()

Review comment:
       Nit: is there any cost to creating multiple client objects? Is it 
possible to have a transient object with respect to 
serialization/deserialization?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 631303)
    Time Spent: 1h 40m  (was: 1.5h)

> Implement a BigQuery bounded source using the BigQuery storage API
> ------------------------------------------------------------------
>
>                 Key: BEAM-10917
>                 URL: https://issues.apache.org/jira/browse/BEAM-10917
>             Project: Beam
>          Issue Type: New Feature
>          Components: io-py-gcp
>            Reporter: Kenneth Jung
>            Priority: P3
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> The Java SDK contains a bounded source implementation which uses the BigQuery 
> storage API to read from BigQuery. We should implement the same for Python.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-10917) Implement a BigQuery bounded source using the BigQuery storage API

Reply via email to