[jira] [Work logged] (BEAM-11408) GCP BigQuery sink (streaming inserts) uses runner determined sharding

ASF GitHub Bot (Jira) Tue, 23 Feb 2021 16:43:06 -0800


     [ 
https://issues.apache.org/jira/browse/BEAM-11408?focusedWorklogId=556653&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-556653
 ]


ASF GitHub Bot logged work on BEAM-11408:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 24/Feb/21 00:42
            Start Date: 24/Feb/21 00:42
    Worklog Time Spent: 10m 
      Work Description: pabloem commented on a change in pull request #14033:
URL: https://github.com/apache/beam/pull/14033#discussion_r581499041



##########
File path: sdks/python/apache_beam/io/gcp/bigquery.py
##########
@@ -1467,7 +1511,8 @@ def __init__(
       triggering_frequency=None,
       validate=True,
       temp_file_format=None,
-      ignore_insert_ids=False):
+      ignore_insert_ids=False,
+      with_auto_sharding=False):

Review comment:
       let's have a JIRA issue and a `TODO(BEAM-XXXX)` to track making this the 
default? : )

##########
File path: sdks/python/apache_beam/io/gcp/bigquery.py
##########
@@ -1403,27 +1417,57 @@ def expand(self, input):
         retry_strategy=self.retry_strategy,
         test_client=self.test_client,
         additional_bq_parameters=self.additional_bq_parameters,
-        ignore_insert_ids=self.ignore_insert_ids)
+        ignore_insert_ids=self.ignore_insert_ids,
+        with_batched_input=(
+            not self.ignore_insert_ids and self.with_auto_sharding))

Review comment:
       how come `ignore_insert_ids` affects this parameter?

##########
File path: sdks/python/apache_beam/io/gcp/bigquery.py
##########
@@ -1759,6 +1814,8 @@ def serialize(side_inputs):
         'triggering_frequency': self.triggering_frequency,
         'validate': self._validate,
         'temp_file_format': self._temp_file_format,
+        'ignore_insert_ids': self._ignore_insert_ids,

Review comment:
       ah thanks Siyuan!

##########
File path: sdks/python/apache_beam/io/gcp/bigquery.py
##########
@@ -1254,13 +1257,20 @@ def process(self, element, *schema_side_inputs):
 
     destination = bigquery_tools.get_hashable_destination(destination)
 
-    row_and_insert_id = element[1]
-    self._rows_buffer[destination].append(row_and_insert_id)
-    self._total_buffered_rows += 1
-    if len(self._rows_buffer[destination]) >= self._max_batch_size:
+    if not self.with_batched_input:
+      row_and_insert_id = element[1]
+      self._rows_buffer[destination].append(row_and_insert_id)
+      self._total_buffered_rows += 1
+      if len(self._rows_buffer[destination]) >= self._max_batch_size:
+        return self._flush_batch(destination)
+      elif self._total_buffered_rows >= self._max_buffered_rows:
+        return self._flush_all_batches()
+    else:
+      # The input is already batched per destination, flush the rows now.
+      batched_rows = element[1]
+      for row in batched_rows:
+        self._rows_buffer[destination].append(row)

Review comment:
       ```suggestion
         self._rows_buffer[destination].extend(batched_rows)
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 556653)
    Time Spent: 11h 50m  (was: 11h 40m)

> GCP BigQuery sink (streaming inserts) uses runner determined sharding
> ---------------------------------------------------------------------
>
>                 Key: BEAM-11408
>                 URL: https://issues.apache.org/jira/browse/BEAM-11408
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-gcp
>            Reporter: Siyuan Chen
>            Assignee: Siyuan Chen
>            Priority: P1
>             Fix For: 2.28.0
>
>          Time Spent: 11h 50m
>  Remaining Estimate: 0h
>
> Integrate BigQuery sink with shardable `GroupIntoBatches` (BEAM-10475) to 
> allow runner determined dynamic sharding.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-11408) GCP BigQuery sink (streaming inserts) uses runner determined sharding

Reply via email to