[jira] [Work logged] (BEAM-12865) Allow customising batch duration when streaming with WriteToBigQuery

ASF GitHub Bot (Jira) Wed, 10 Nov 2021 11:19:04 -0800


     [ 
https://issues.apache.org/jira/browse/BEAM-12865?focusedWorklogId=679851&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-679851
 ]


ASF GitHub Bot logged work on BEAM-12865:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 10/Nov/21 19:18
            Start Date: 10/Nov/21 19:18
    Worklog Time Spent: 10m 
      Work Description: quentin-sommer commented on a change in pull request 
#15489:
URL: https://github.com/apache/beam/pull/15489#discussion_r746911914



##########
File path: sdks/python/apache_beam/io/gcp/bigquery.py
##########
@@ -2158,7 +2169,7 @@ def expand(self, pcoll):
           schema=self.schema,
           create_disposition=self.create_disposition,
           write_disposition=self.write_disposition,
-          triggering_frequency=self.triggering_frequency,
+          triggering_frequency=int(self.triggering_frequency),

Review comment:
       This code is the `BigQueryBatchFileLoads` class, the default value is 
`None` and the doc advises to use at least 2 minutes to avoid reaching the per 
project quota of bigquery load jobs. the code errors when it is used with 
`None` in a streaming pipeline.
   
   I think it should be an integer. `BigQueryBatchFileLoads` uses it like this
   
https://github.com/apache/beam/blob/8da177f64d314cf72e89a51e51fb0915f706a784/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L873-L874
   
   [beam 
reference](https://beam.apache.org/releases/pydoc/2.33.0/apache_beam.transforms.trigger.html?highlight=trigger#apache_beam.transforms.trigger.AfterProcessingTime)
 states it's a second delay and I'm not sure what the implementation is doing 
so I'd rather be on the safe side and keep integers.
   
   I added some logic to only cast to int when the value is not `None`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 679851)
    Time Spent: 7.5h  (was: 7h 20m)

> Allow customising batch duration when streaming with WriteToBigQuery
> --------------------------------------------------------------------
>
>                 Key: BEAM-12865
>                 URL: https://issues.apache.org/jira/browse/BEAM-12865
>             Project: Beam
>          Issue Type: New Feature
>          Components: io-py-gcp
>    Affects Versions: Not applicable
>            Reporter: Quentin Sommer
>            Priority: P2
>              Labels: stale-P2
>             Fix For: Not applicable
>
>          Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> Hi,
> We allow customising the {{batch_size}} when streaming to BigQuery but the 
> batch duration (used by {{GroupIntoBatches}}) is set to 
> {{DEFAULT_BATCH_BUFFERING_DURATION_LIMIT_SEC}} (0.2)
> I'd like to add the option to specify the {{batch_duration}} to allow better 
> batching for scenarios with little data throughput.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (BEAM-12865) Allow customising batch duration when streaming with WriteToBigQuery

Reply via email to