[ 
https://issues.apache.org/jira/browse/BEAM-8367?focusedWorklogId=329268&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-329268
 ]

ASF GitHub Bot logged work on BEAM-8367:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 16/Oct/19 17:11
            Start Date: 16/Oct/19 17:11
    Worklog Time Spent: 10m 
      Work Description: chamikaramj commented on issue #9797: [BEAM-8367] Using 
insertId for BQ streaming inserts
URL: https://github.com/apache/beam/pull/9797#issuecomment-542801037
 
 
   LGTM. Thanks!
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 329268)
    Time Spent: 1h 40m  (was: 1.5h)

> Python BigQuery sink should use unique IDs for mode STREAMING_INSERTS
> ---------------------------------------------------------------------
>
>                 Key: BEAM-8367
>                 URL: https://issues.apache.org/jira/browse/BEAM-8367
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py-core
>            Reporter: Chamikara Madhusanka Jayalath
>            Assignee: Pablo Estrada
>            Priority: Blocker
>             Fix For: 2.17.0
>
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Unique IDs ensure (best effort) that writes to BigQuery are idempotent, for 
> example, we don't write the same record twice in a VM failure.
>  
> Currently Python BQ sink insert BQ IDs here but they'll be re-generated in a 
> VM failure resulting in data duplication.
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L766]
>  
> Correct fix is to do a Reshuffle to checkpoint unique IDs once they are 
> generated, similar to how Java BQ sink operates.
> [https://github.com/apache/beam/blob/dcf6ad301069e4d2cfaec5db6b178acb7bb67f49/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StreamingWriteTables.java#L225]
>  
> Pablo, can you do an initial assessment here ?
> I think this is a relatively small fix but I might be wrong.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to