[
https://issues.apache.org/jira/browse/BEAM-8367?focusedWorklogId=328864&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-328864
]
ASF GitHub Bot logged work on BEAM-8367:
----------------------------------------
Author: ASF GitHub Bot
Created on: 16/Oct/19 00:01
Start Date: 16/Oct/19 00:01
Worklog Time Spent: 10m
Work Description: chamikaramj commented on pull request #9797:
[BEAM-8367] Using insertId for BQ streaming inserts
URL: https://github.com/apache/beam/pull/9797#discussion_r335228241
##########
File path: sdks/python/apache_beam/io/gcp/bigquery_test.py
##########
@@ -425,8 +425,8 @@ def test_dofn_client_process_flush_called(self):
test_client=client)
fn.start_bundle()
- fn.process(('project_id:dataset_id.table_id', {'month': 1}))
- fn.process(('project_id:dataset_id.table_id', {'month': 2}))
+ fn.process(('project_id:dataset_id.table_id', ({'month': 1}, 'insertid1')))
Review comment:
Can we add a test that explicitly checks that inserts IDs (as expected) get
added ? (these tests also check for other things.
Also, optionally, it's good if we can somehow add a unit test that induces a
failure in a direct runner based job (in a step before BQ sink) to confirm
that we do no regenerate the insert ID for the same element.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 328864)
Time Spent: 1h (was: 50m)
> Python BigQuery sink should use unique IDs for mode STREAMING_INSERTS
> ---------------------------------------------------------------------
>
> Key: BEAM-8367
> URL: https://issues.apache.org/jira/browse/BEAM-8367
> Project: Beam
> Issue Type: Bug
> Components: sdk-py-core
> Reporter: Chamikara Madhusanka Jayalath
> Assignee: Pablo Estrada
> Priority: Blocker
> Fix For: 2.17.0
>
> Time Spent: 1h
> Remaining Estimate: 0h
>
> Unique IDs ensure (best effort) that writes to BigQuery are idempotent, for
> example, we don't write the same record twice in a VM failure.
>
> Currently Python BQ sink insert BQ IDs here but they'll be re-generated in a
> VM failure resulting in data duplication.
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L766]
>
> Correct fix is to do a Reshuffle to checkpoint unique IDs once they are
> generated, similar to how Java BQ sink operates.
> [https://github.com/apache/beam/blob/dcf6ad301069e4d2cfaec5db6b178acb7bb67f49/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StreamingWriteTables.java#L225]
>
> Pablo, can you do an initial assessment here ?
> I think this is a relatively small fix but I might be wrong.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)