[GitHub] [beam] vachan-shetty commented on a change in pull request #15185: [BEAM-10917] Add support for BigQuery Read API in Python BEAM

GitBox Mon, 09 Aug 2021 15:52:34 -0700


vachan-shetty commented on a change in pull request #15185:
URL: https://github.com/apache/beam/pull/15185#discussion_r685572461




##########
File path: sdks/python/apache_beam/io/gcp/bigquery.py
##########
@@ -883,6 +893,274 @@ def _export_files(self, bq):
     return table.schema, metadata_list
 
 
+class _CustomBigQueryStorageSourceBase(BoundedSource):
+  """A base class for BoundedSource implementations which read from BigQuery
+  using the BigQuery Storage API.
+
+  Args:
+    table (str, TableReference): The ID of the table. The ID must contain only
+      letters ``a-z``, ``A-Z``, numbers ``0-9``, or underscores ``_``  If
+      **dataset** argument is :data:`None` then the table argument must
+      contain the entire table reference specified as:
+      ``'PROJECT:DATASET.TABLE'`` or must specify a TableReference.
+    dataset (str): Optional ID of the dataset containing this table or
+      :data:`None` if the table argument specifies a TableReference.
+    project (str): Optional ID of the project containing this table or
+      :data:`None` if the table argument specifies a TableReference.
+    selected_fields (List[str]): Optional List of names of the fields in the
+      table that should be read. If empty, all fields will be read. If the
+      specified field is a nested field, all the sub-fields in the field will 
be
+      selected. The output field order is unrelated to the order of fields in
+      selected_fields.
+    row_restriction (str): Optional SQL text filtering statement, similar to a
+      WHERE clause in a query. Aggregates are not supported. Restricted to a
+      maximum length for 1 MB.
+  """
+
+  # The maximum number of streams which will be requested when creating a read
+  # session, regardless of the desired bundle size.
+  MAX_SPLIT_COUNT = 10000
+  # The minimum number of streams which will be requested when creating a read
+  # session, regardless of the desired bundle size. Note that the server may
+  # still choose to return fewer than ten streams based on the layout of the
+  # table.
+  MIN_SPLIT_COUNT = 10
+
+  def __init__(
+      self,
+      table: Union[str, TableReference],
+      dataset: Optional[str] = None,
+      project: Optional[str] = None,
+      selected_fields: Optional[List[str]] = None,
+      row_restriction: Optional[str] = None,
+      use_fastavro: Optional[bool] = None,

Review comment:
       Updated `estimate_size()` which should improve autoscaling in [theory]. 
There should be further improvements in autoscaling once we add support dynamic 
work rebalancing. Additionally, I intend to remove the non-fastavro parsing in 
a follow-up PR where the Storage API source and the core `avroio` source are 
combined together.
   
   [theory]:  
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#batch-autoscaling




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] vachan-shetty commented on a change in pull request #15185: [BEAM-10917] Add support for BigQuery Read API in Python BEAM

Reply via email to