[ 
https://issues.apache.org/jira/browse/BEAM-8960?focusedWorklogId=362790&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-362790
 ]

ASF GitHub Bot logged work on BEAM-8960:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 23/Dec/19 23:35
            Start Date: 23/Dec/19 23:35
    Worklog Time Spent: 10m 
      Work Description: chamikaramj commented on pull request #10427: 
[BEAM-8960]: Add an option for user to opt out of using insert id for BigQuery 
streaming insert.
URL: https://github.com/apache/beam/pull/10427#discussion_r361026332
 
 

 ##########
 File path: 
sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java
 ##########
 @@ -2241,6 +2246,14 @@ static String getExtractDestinationUri(String 
extractDestinationDir) {
       return toBuilder().setIgnoreUnknownValues(true).build();
     }
 
+    /**
+     * Performs streaming insert without insert id. Insert id is used to offer 
best effort insert
+     * deduplication. Default is false, which always inserts with insert id.
+     */
+    public Write<T> ignoreInsertIds() {
 
 Review comment:
   What kind of guarantees can we provide regarding duplicate data inserts when 
this option is set ? My understanding is that this will prevent BQ from data 
deduplication at all hence users may start observing duplicate data in the 
output table when there are workitem failures and retries. Even though insert 
ID based data deduplication is best-effort, I think, in practice, that is 
enough for most users. 
   
   I think this option at least should come with a warning that says that 
setting this will completely disable data deduplication a when inserting 
records to BigQuery and that users should utilize custom post-insertion data 
duplication mechanisms if needed. 
   
   cc: @reuvenlax 
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

            Worklog Id:     (was: 362790)
    Remaining Estimate: 23h  (was: 23h 10m)
            Time Spent: 1h  (was: 50m)

> Add an option for user to be able to opt out of using insert id for BigQuery 
> streaming insert.
> ----------------------------------------------------------------------------------------------
>
>                 Key: BEAM-8960
>                 URL: https://issues.apache.org/jira/browse/BEAM-8960
>             Project: Beam
>          Issue Type: New Feature
>          Components: io-java-gcp
>            Reporter: Yiru Tang
>            Priority: Minor
>   Original Estimate: 24h
>          Time Spent: 1h
>  Remaining Estimate: 23h
>
> BigQuery streaming insert id offers best effort insert deduplication. If user 
> choose to opt out of using insert ids, they could potentially to be opt into 
> using our current new streaming backend which gives higher speed and more 
> quota. Insert id deduplication is best effort and doesn't have ultimate just 
> once guarantees.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to