[ 
https://issues.apache.org/jira/browse/BEAM-8367?focusedWorklogId=328329&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-328329
 ]

ASF GitHub Bot logged work on BEAM-8367:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 15/Oct/19 04:41
            Start Date: 15/Oct/19 04:41
    Worklog Time Spent: 10m 
      Work Description: pabloem commented on pull request #9797: [BEAM-8367] 
Using insertId for BQ streaming inserts
URL: https://github.com/apache/beam/pull/9797#discussion_r334750033
 
 

 ##########
 File path: sdks/python/apache_beam/runners/portability/fn_api_runner_test.py
 ##########
 @@ -1609,7 +1609,7 @@ def test_lull_logging(self):
              | beam.Create([1])
              | beam.Map(time.sleep))
 
-    self.assertRegexpMatches(
+    self.assertRegexp(
 
 Review comment:
   nvm it broke precommits
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 328329)
    Time Spent: 50m  (was: 40m)

> Python BigQuery sink should use unique IDs for mode STREAMING_INSERTS
> ---------------------------------------------------------------------
>
>                 Key: BEAM-8367
>                 URL: https://issues.apache.org/jira/browse/BEAM-8367
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py-core
>            Reporter: Chamikara Madhusanka Jayalath
>            Assignee: Pablo Estrada
>            Priority: Major
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> Unique IDs ensure (best effort) that writes to BigQuery are idempotent, for 
> example, we don't write the same record twice in a VM failure.
>  
> Currently Python BQ sink insert BQ IDs here but they'll be re-generated in a 
> VM failure resulting in data duplication.
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L766]
>  
> Correct fix is to do a Reshuffle to checkpoint unique IDs once they are 
> generated, similar to how Java BQ sink operates.
> [https://github.com/apache/beam/blob/dcf6ad301069e4d2cfaec5db6b178acb7bb67f49/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StreamingWriteTables.java#L225]
>  
> Pablo, can you do an initial assessment here ?
> I think this is a relatively small fix but I might be wrong.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to