[GitHub] [beam] rizenfrmtheashes commented on issue #20824: BigQuery FILE_LOADS failed with 400 error in streaming mode in Python

GitBox Mon, 17 Oct 2022 09:11:03 -0700


rizenfrmtheashes commented on issue #20824:
URL: https://github.com/apache/beam/issues/20824#issuecomment-1281116718


   sure. We ended up Dumping a LARGE amount of data with a specified schema 
through a Reshuffle and then into a bigquery file loads with dynamic table 
destinations
   
   ```
           bq_file_loads_output = (
               input_data
               | "Fusion Break Pre BQ" >> beam.transforms.util.Reshuffle()
               | "Write All RowsBigQuery"
               >> WriteToBigQuery(
                   table=lambda row: row["table"], # we inlcude the table name 
in the row for easy dynamic table destinations
                   write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
                   schema={"fields": embeddings_frame_schema_list},
                   insert_retry_strategy=RetryStrategy.RETRY_ON_TRANSIENT_ERROR,
                   method=beam.io.WriteToBigQuery.Method.FILE_LOADS,
                   triggering_frequency=120, # low in testing, closer to 600 in 
prod
               )
           )
   ```
   
   We used a similar input like described in [this bug report doc 
here.](https://docs.google.com/document/d/1uIM5JVq0dAh2uDB0HfzQN7PS60U8TsJnalvAvfkfcnM/edit?usp=sharing)
 (The bug in this doc in particular was reported #23104 and mostly fixed in 
#23012 )
   
   When we set this job to draining after writing 10s of thousands of rows, 
this is the stacktrace we get 
   ```
   Traceback (most recent call last):
     File 
"/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py",
 line 284, in _execute
       response = task()
     File 
"/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py",
 line 357, in <lambda>
       lambda: self.create_worker().do_instruction(request), request)
     File 
"/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py",
 line 598, in do_instruction
       getattr(request, request_type), request.instruction_id)
     File 
"/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py",
 line 635, in process_bundle
       bundle_processor.process_bundle(instruction_id))
     File 
"/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/bundle_processor.py",
 line 1004, in process_bundle
       element.data)
     File 
"/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/bundle_processor.py",
 line 227, in process_encoded
       self.output(decoded_value)
     File "apache_beam/runners/worker/operations.py", line 526, in 
apache_beam.runners.worker.operations.Operation.output
     File "apache_beam/runners/worker/operations.py", line 528, in 
apache_beam.runners.worker.operations.Operation.output
     File "apache_beam/runners/worker/operations.py", line 237, in 
apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive
     File "apache_beam/runners/worker/operations.py", line 240, in 
apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive
     File "apache_beam/runners/worker/operations.py", line 907, in 
apache_beam.runners.worker.operations.DoOperation.process
     File "apache_beam/runners/worker/operations.py", line 908, in 
apache_beam.runners.worker.operations.DoOperation.process
     File "apache_beam/runners/common.py", line 1419, in 
apache_beam.runners.common.DoFnRunner.process
     File "apache_beam/runners/common.py", line 1507, in 
apache_beam.runners.common.DoFnRunner._reraise_augmented
     File "apache_beam/runners/common.py", line 1417, in 
apache_beam.runners.common.DoFnRunner.process
     File "apache_beam/runners/common.py", line 837, in 
apache_beam.runners.common.PerWindowInvoker.invoke_process
     File "apache_beam/runners/common.py", line 981, in 
apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
     File "apache_beam/runners/common.py", line 1571, in 
apache_beam.runners.common._OutputHandler.handle_process_outputs
     File 
"/usr/local/lib/python3.7/site-packages/steps/bigquery_file_loads_patch_40.py", 
line 724, in process
       load_job_project_id=self.load_job_project_id,
     File 
"/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery_tools.py", 
line 1019, in perform_load_job
       job_labels=job_labels)
     File "/usr/local/lib/python3.7/site-packages/apache_beam/utils/retry.py", 
line 275, in wrapper
       return fun(*args, **kwargs)
     File 
"/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery_tools.py", 
line 538, in _insert_load_job
       return self._start_job(request, stream=source_stream).jobReference
     File 
"/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery_tools.py", 
line 557, in _start_job
       response = self.client.jobs.Insert(request, upload=upload)
     File 
"/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/internal/clients/bigquery/bigquery_v2_client.py",
 line 345, in Insert
       upload=upload, upload_config=upload_config)
     File 
"/usr/local/lib/python3.7/site-packages/apitools/base/py/base_api.py", line 
731, in _RunMethod
       return self.ProcessHttpResponse(method_config, http_response, request)
     File 
"/usr/local/lib/python3.7/site-packages/apitools/base/py/base_api.py", line 
737, in ProcessHttpResponse
       self.__ProcessHttpResponse(method_config, http_response, request))
     File 
"/usr/local/lib/python3.7/site-packages/apitools/base/py/base_api.py", line 
604, in __ProcessHttpResponse
       http_response, method_config=method_config, request=request)
   RuntimeError: apitools.base.py.exceptions.HttpBadRequestError: HttpError 
accessing 
<https://bigquery.googleapis.com/bigquery/v2/projects/REDACTED_PROJECT_NAME/jobs?alt=json>:
 response: <{'vary': 'Origin, X-Origin, Referer', 'content-type': 
'application/json; charset=UTF-8', 'date': 'Mon, 17 Oct 2022 16:02:07 GMT', 
'server': 'ESF', 'cache-control': 'private', 'x-xss-protection': '0', 
'x-frame-options': 'SAMEORIGIN', 'x-content-type-options': 'nosniff', 
'transfer-encoding': 'chunked', 'status': '400', 'content-length': '318', 
'-content-encoding': 'gzip'}>, content <{
     "error": {
       "code": 400,
       "message": "Load configuration must specify at least one source URI",
       "errors": [
         {
           "message": "Load configuration must specify at least one source URI",
           "domain": "global",
           "reason": "invalid"
         }
       ],
       "status": "INVALID_ARGUMENT"
     }
   }
   ```
   as a note `bigquery_file_loads_patch_40.py` is just a reference to a 
copy/pasted version of the source bigquery_file_loads.py file in the gcp/io 
section of the SDK that we used to backport fixes from newer versions of beam 
(like #23012).  We did dependency checking to make sure the backported fixes 
were okay. 
   
   (also redacting org names in stacktraces)
   
   We are using beam version 2.40 and dataflow v2 runner when this happened.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] rizenfrmtheashes commented on issue #20824: BigQuery FILE_LOADS failed with 400 error in streaming mode in Python

Reply via email to