[jira] [Work logged] (BEAM-12206) Update Beam BigQuery sink documentation with quota info and runner determined sharding

ASF GitHub Bot (Jira) Fri, 23 Apr 2021 12:27:04 -0700


     [ 
https://issues.apache.org/jira/browse/BEAM-12206?focusedWorklogId=588095&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-588095
 ]


ASF GitHub Bot logged work on BEAM-12206:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 23/Apr/21 19:26
            Start Date: 23/Apr/21 19:26
    Worklog Time Spent: 10m 
      Work Description: nehsyc commented on a change in pull request #14615:
URL: https://github.com/apache/beam/pull/14615#discussion_r619449668



##########
File path: 
website/www/site/content/en/documentation/io/built-in/google-bigquery.md
##########
@@ -602,59 +602,108 @@ as the previous example.
 BigQueryIO supports two methods of inserting data into BigQuery: load jobs and
 streaming inserts. Each insertion method provides different tradeoffs of cost,
 quota, and data consistency. See the BigQuery documentation for
-[load jobs](https://cloud.google.com/bigquery/loading-data) and
-[streaming 
inserts](https://cloud.google.com/bigquery/streaming-data-into-bigquery)
+[different data ingestion 
options](https://cloud.google.com/bigquery/loading-data)
+(specifically, [load 
jobs](https://cloud.google.com/bigquery/docs/batch-loading-data)
+and [streaming 
inserts](https://cloud.google.com/bigquery/streaming-data-into-bigquery))
 for more information about these tradeoffs.
 
+{{< paragraph class="language-java" >}}
 BigQueryIO chooses a default insertion method based on the input `PCollection`.
+You can use `withMethod` to specify the desired insertion method. See
+[`Write.Method`](https://beam.apache.org/releases/javadoc/{{< param 
release_latest 
>}}/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.Method.html)
+for the list of the available methods and their restrictions.
+{{< /paragraph >}}
 
 {{< paragraph class="language-py" >}}
-BigQueryIO uses load jobs when you apply a BigQueryIO write transform to a
-bounded `PCollection`.
+BigQueryIO chooses a default insertion method based on the input `PCollection`.
+You can use `method` to specify the desired insertion method. See
+[`WriteToBigQuery`](https://beam.apache.org/releases/pydoc/{{< param 
release_latest 
>}}/apache_beam.io.gcp.bigquery.html#apache_beam.io.gcp.bigquery.WriteToBigQuery)
+for the list of the available methods and their restrictions.
 {{< /paragraph >}}
 
-{{< paragraph class="language-java" >}}
 BigQueryIO uses load jobs in the following situations:
-{{< /paragraph >}}
 
 {{< paragraph class="language-java" wrap="span" >}}
 * When you apply a BigQueryIO write transform to a bounded `PCollection`.
-* When you apply a BigQueryIO write transform to an unbounded `PCollection` and
-  use `BigQueryIO.write().withTriggeringFrequency()` to set the triggering
-  frequency.
 * When you specify load jobs as the insertion method using
   `BigQueryIO.write().withMethod(FILE_LOADS)`.
 {{< /paragraph >}}
 
-{{< paragraph class="language-py" >}}
-BigQueryIO uses streaming inserts when you apply a BigQueryIO write transform 
to
-an unbounded `PCollection`.
+{{< paragraph class="language-py" wrap="span" >}}
+* When you apply a BigQueryIO write transform to a bounded `PCollection`.
+* When you specify load jobs as the insertion method using
+  `WriteToBigQuery(method='FILE_LOADS')`.
 {{< /paragraph >}}
 
+***Note:*** If you use batch loads in a streaming pipeline:
+
 {{< paragraph class="language-java" >}}
-BigQueryIO uses streaming inserts in the following situations:
+You must use `withTriggeringFrequency` to specify a triggering frequency for
+initiating load jobs. Be careful about setting the frequency such that your
+pipeline doesn't exceed the BigQuery load job [quota 
limit](https://cloud.google.com/bigquery/quotas#load_jobs).
+{{< /paragraph >}}
+
+{{< paragraph class="language-java" >}}
+You can either use `withNumFileShards` to explicitly set the number of file
+shards written, or use `withAutoSharding` to enable dynamic sharding (starting
+2.29.0 release) and the number of shards may be determined and changed at

Review comment:
       Does it make sense to put the release version here?

##########
File path: 
website/www/site/content/en/documentation/io/built-in/google-bigquery.md
##########
@@ -602,59 +602,108 @@ as the previous example.
 BigQueryIO supports two methods of inserting data into BigQuery: load jobs and
 streaming inserts. Each insertion method provides different tradeoffs of cost,
 quota, and data consistency. See the BigQuery documentation for
-[load jobs](https://cloud.google.com/bigquery/loading-data) and
-[streaming 
inserts](https://cloud.google.com/bigquery/streaming-data-into-bigquery)
+[different data ingestion 
options](https://cloud.google.com/bigquery/loading-data)
+(specifically, [load 
jobs](https://cloud.google.com/bigquery/docs/batch-loading-data)
+and [streaming 
inserts](https://cloud.google.com/bigquery/streaming-data-into-bigquery))
 for more information about these tradeoffs.
 
+{{< paragraph class="language-java" >}}
 BigQueryIO chooses a default insertion method based on the input `PCollection`.
+You can use `withMethod` to specify the desired insertion method. See
+[`Write.Method`](https://beam.apache.org/releases/javadoc/{{< param 
release_latest 
>}}/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.Method.html)
+for the list of the available methods and their restrictions.
+{{< /paragraph >}}
 
 {{< paragraph class="language-py" >}}
-BigQueryIO uses load jobs when you apply a BigQueryIO write transform to a
-bounded `PCollection`.
+BigQueryIO chooses a default insertion method based on the input `PCollection`.
+You can use `method` to specify the desired insertion method. See
+[`WriteToBigQuery`](https://beam.apache.org/releases/pydoc/{{< param 
release_latest 
>}}/apache_beam.io.gcp.bigquery.html#apache_beam.io.gcp.bigquery.WriteToBigQuery)
+for the list of the available methods and their restrictions.
 {{< /paragraph >}}
 
-{{< paragraph class="language-java" >}}
 BigQueryIO uses load jobs in the following situations:
-{{< /paragraph >}}
 
 {{< paragraph class="language-java" wrap="span" >}}
 * When you apply a BigQueryIO write transform to a bounded `PCollection`.
-* When you apply a BigQueryIO write transform to an unbounded `PCollection` and

Review comment:
       This doesn't seem to be the case based on me reading the code. Could you 
help double check?
   
   
https://github.com/apache/beam/blob/47cfbcb63f4d0642d26106485bc6fdb894da3086/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L2474




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 588095)
    Time Spent: 40m  (was: 0.5h)

> Update Beam BigQuery sink documentation with quota info and runner determined 
> sharding
> --------------------------------------------------------------------------------------
>
>                 Key: BEAM-12206
>                 URL: https://issues.apache.org/jira/browse/BEAM-12206
>             Project: Beam
>          Issue Type: Improvement
>          Components: website
>            Reporter: Siyuan Chen
>            Assignee: Siyuan Chen
>            Priority: P2
>          Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-12206) Update Beam BigQuery sink documentation with quota info and runner determined sharding

Reply via email to