[
https://issues.apache.org/jira/browse/BEAM-8960?focusedWorklogId=362790&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-362790
]
ASF GitHub Bot logged work on BEAM-8960:
----------------------------------------
Author: ASF GitHub Bot
Created on: 23/Dec/19 23:35
Start Date: 23/Dec/19 23:35
Worklog Time Spent: 10m
Work Description: chamikaramj commented on pull request #10427:
[BEAM-8960]: Add an option for user to opt out of using insert id for BigQuery
streaming insert.
URL: https://github.com/apache/beam/pull/10427#discussion_r361026332
##########
File path:
sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java
##########
@@ -2241,6 +2246,14 @@ static String getExtractDestinationUri(String
extractDestinationDir) {
return toBuilder().setIgnoreUnknownValues(true).build();
}
+ /**
+ * Performs streaming insert without insert id. Insert id is used to offer
best effort insert
+ * deduplication. Default is false, which always inserts with insert id.
+ */
+ public Write<T> ignoreInsertIds() {
Review comment:
What kind of guarantees can we provide regarding duplicate data inserts when
this option is set ? My understanding is that this will prevent BQ from data
deduplication at all hence users may start observing duplicate data in the
output table when there are workitem failures and retries. Even though insert
ID based data deduplication is best-effort, I think, in practice, that is
enough for most users.
I think this option at least should come with a warning that says that
setting this will completely disable data deduplication a when inserting
records to BigQuery and that users should utilize custom post-insertion data
duplication mechanisms if needed.
cc: @reuvenlax
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 362790)
Remaining Estimate: 23h (was: 23h 10m)
Time Spent: 1h (was: 50m)
> Add an option for user to be able to opt out of using insert id for BigQuery
> streaming insert.
> ----------------------------------------------------------------------------------------------
>
> Key: BEAM-8960
> URL: https://issues.apache.org/jira/browse/BEAM-8960
> Project: Beam
> Issue Type: New Feature
> Components: io-java-gcp
> Reporter: Yiru Tang
> Priority: Minor
> Original Estimate: 24h
> Time Spent: 1h
> Remaining Estimate: 23h
>
> BigQuery streaming insert id offers best effort insert deduplication. If user
> choose to opt out of using insert ids, they could potentially to be opt into
> using our current new streaming backend which gives higher speed and more
> quota. Insert id deduplication is best effort and doesn't have ultimate just
> once guarantees.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)