[ 
https://issues.apache.org/jira/browse/BEAM-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles updated BEAM-13132:
-----------------------------------
    Status: Open  (was: Triage Needed)

> WriteToBigQuery submits a duplicate BQ load job if a 503 error code is 
> returned from googleapi
> ----------------------------------------------------------------------------------------------
>
>                 Key: BEAM-13132
>                 URL: https://issues.apache.org/jira/browse/BEAM-13132
>             Project: Beam
>          Issue Type: Bug
>          Components: io-py-gcp
>    Affects Versions: 2.24.0
>         Environment: Apache Beam Python 3.7 SDK 2.24.0 
>            Reporter: James Prillaman
>            Priority: P1
>              Labels: GCP
>
> When running a WriteToBigQuery beam step, a 503 error code is returned from 
> `https://www.googleapis.com/resumable/upload/storage/v1/b/<our_tmp_dataflow_location>`.
>  This is causing duplicated data as the BQ load job is still successfully 
> submitted but the workitem returns "Finished processing workitem with 
> errors". This causes dataflow to resubmit an identical job and thus insert 
> duplicate data into our BigQuery tables.
> Problem you have encountered:
> 1.) WriteToBigQuery step starts and triggers a BQ load job.
> ```
> "Triggering job 
> beam_bq_job_LOAD_AUTOMATIC_JOB_NAME_LOAD_NAME_STEP_650_f2f7eb5ec442aa057357302eb9cb0263_9704d08e74d74e2b9cc743ef8a40c524"
> ```
> 2.) An error occurs in the step, but apparently after the load job was 
> already submitted.
> ```
> "Error in _start_upload while inserting file 
> gs://<censored_bucket_location>.avro: Traceback (most recent call last):
>  File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py", 
> line 644, in _start_upload
>  self._client.objects.Insert(self._insert_request, upload=self._upload)
>  File 
> "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/internal/clients/storage/storage_v1_client.py",
>  line 1156, in Insert
>  upload=upload, upload_config=upload_config)
>  File "/usr/local/lib/python3.7/site-packages/apitools/base/py/base_api.py", 
> line 731, in _RunMethod
>  return self.ProcessHttpResponse(method_config, http_response, request)
>  File "/usr/local/lib/python3.7/site-packages/apitools/base/py/base_api.py", 
> line 737, in ProcessHttpResponse
>  self.__ProcessHttpResponse(method_config, http_response, request))
>  File "/usr/local/lib/python3.7/site-packages/apitools/base/py/base_api.py", 
> line 604, in __ProcessHttpResponse
>  http_response, method_config=method_config, request=request)
> apitools.base.py.exceptions.HttpError: HttpError accessing 
> <https://www.googleapis.com/resumable/upload/storage/v1/b/bqflow_dataflow_tmp/o?alt=json&name=tmp%2F<censored_bucket_location>.avro&uploadType=resumable&upload_id=ADPycdtKO3HR5PjM_lE6lBin-QqIRuTBeiaCe3dPx9gUKAIPI5fzpfuTs4J5XEF9XiayNvMrhGsGe0XP1CJv90xsuBUrZy6mpw>:
>  response: <\{'content-type': 'text/plain; charset=utf-8', 
> 'x-guploader-uploadid': 
> 'ADPycdtKO3HR5PjM_lE6lBin-QqIRuTBeiaCe3dPx9gUKAIPI5fzpfuTs4J5XEF9XiayNvMrhGsGe0XP1CJv90xsuBUrZy6mpw',
>  'content-length': '0', 'date': 'Tue, 05 Oct 2021 18:01:51 GMT', 'server': 
> 'UploadServer', 'status': '503'}>, content <>
> "
> ```
> 3.) workitem finishes with errors
> ```
>  Finished processing workitem X with errors. Reporting status to Dataflow 
> service.
> ```
> 4.) Beam re-runs the workitem which spawns another identical BQ load job.
> ```
> Triggering job 
> beam_bq_job_LOAD_AUTOMATIC_JOB_NAME_LOAD_NAME_STEP_650_f2f7eb5ec442aa057357302eb9cb0263_1247e55bd00041d8b8bd4de491cd7063
> ```
> This causes a single WriteToBigQuery beam step to spawn two identical BQ load 
> jobs. This creates duplicated data in our tables.
> What you expected to happen:
> I would expect the HTTP call to be retried before returning an error. 
> Otherwise, if this did fail, I would expect the same BQ load job to not be 
> successfully submitted twice without cancellation of the first job. A third 
> option would be to implement something similar to "insert_retry_strategy" but 
> for batch files that can allow us to not create another bq load job when a 
> failure occurs. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to