rizenfrmtheashes commented on issue #20824:
URL: https://github.com/apache/beam/issues/20824#issuecomment-1281116718
sure. We ended up Dumping a LARGE amount of data with a specified schema
through a Reshuffle and then into a bigquery file loads with dynamic table
destinations
```
bq_file_loads_output = (
input_data
| "Fusion Break Pre BQ" >> beam.transforms.util.Reshuffle()
| "Write All RowsBigQuery"
>> WriteToBigQuery(
table=lambda row: row["table"], # we inlcude the table name
in the row for easy dynamic table destinations
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
schema={"fields": embeddings_frame_schema_list},
insert_retry_strategy=RetryStrategy.RETRY_ON_TRANSIENT_ERROR,
method=beam.io.WriteToBigQuery.Method.FILE_LOADS,
triggering_frequency=120, # low in testing, closer to 600 in
prod
)
)
```
We used a similar input like described in [this bug report doc
here.](https://docs.google.com/document/d/1uIM5JVq0dAh2uDB0HfzQN7PS60U8TsJnalvAvfkfcnM/edit?usp=sharing)
(The bug in this doc in particular was reported #23104 and mostly fixed in
#23012 )
When we set this job to draining after writing 10s of thousands of rows,
this is the stacktrace we get
```
Traceback (most recent call last):
File
"/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py",
line 284, in _execute
response = task()
File
"/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py",
line 357, in <lambda>
lambda: self.create_worker().do_instruction(request), request)
File
"/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py",
line 598, in do_instruction
getattr(request, request_type), request.instruction_id)
File
"/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py",
line 635, in process_bundle
bundle_processor.process_bundle(instruction_id))
File
"/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/bundle_processor.py",
line 1004, in process_bundle
element.data)
File
"/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/bundle_processor.py",
line 227, in process_encoded
self.output(decoded_value)
File "apache_beam/runners/worker/operations.py", line 526, in
apache_beam.runners.worker.operations.Operation.output
File "apache_beam/runners/worker/operations.py", line 528, in
apache_beam.runners.worker.operations.Operation.output
File "apache_beam/runners/worker/operations.py", line 237, in
apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive
File "apache_beam/runners/worker/operations.py", line 240, in
apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive
File "apache_beam/runners/worker/operations.py", line 907, in
apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam/runners/worker/operations.py", line 908, in
apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam/runners/common.py", line 1419, in
apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 1507, in
apache_beam.runners.common.DoFnRunner._reraise_augmented
File "apache_beam/runners/common.py", line 1417, in
apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 837, in
apache_beam.runners.common.PerWindowInvoker.invoke_process
File "apache_beam/runners/common.py", line 981, in
apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
File "apache_beam/runners/common.py", line 1571, in
apache_beam.runners.common._OutputHandler.handle_process_outputs
File
"/usr/local/lib/python3.7/site-packages/steps/bigquery_file_loads_patch_40.py",
line 724, in process
load_job_project_id=self.load_job_project_id,
File
"/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery_tools.py",
line 1019, in perform_load_job
job_labels=job_labels)
File "/usr/local/lib/python3.7/site-packages/apache_beam/utils/retry.py",
line 275, in wrapper
return fun(*args, **kwargs)
File
"/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery_tools.py",
line 538, in _insert_load_job
return self._start_job(request, stream=source_stream).jobReference
File
"/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery_tools.py",
line 557, in _start_job
response = self.client.jobs.Insert(request, upload=upload)
File
"/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/internal/clients/bigquery/bigquery_v2_client.py",
line 345, in Insert
upload=upload, upload_config=upload_config)
File
"/usr/local/lib/python3.7/site-packages/apitools/base/py/base_api.py", line
731, in _RunMethod
return self.ProcessHttpResponse(method_config, http_response, request)
File
"/usr/local/lib/python3.7/site-packages/apitools/base/py/base_api.py", line
737, in ProcessHttpResponse
self.__ProcessHttpResponse(method_config, http_response, request))
File
"/usr/local/lib/python3.7/site-packages/apitools/base/py/base_api.py", line
604, in __ProcessHttpResponse
http_response, method_config=method_config, request=request)
RuntimeError: apitools.base.py.exceptions.HttpBadRequestError: HttpError
accessing
<https://bigquery.googleapis.com/bigquery/v2/projects/REDACTED_PROJECT_NAME/jobs?alt=json>:
response: <{'vary': 'Origin, X-Origin, Referer', 'content-type':
'application/json; charset=UTF-8', 'date': 'Mon, 17 Oct 2022 16:02:07 GMT',
'server': 'ESF', 'cache-control': 'private', 'x-xss-protection': '0',
'x-frame-options': 'SAMEORIGIN', 'x-content-type-options': 'nosniff',
'transfer-encoding': 'chunked', 'status': '400', 'content-length': '318',
'-content-encoding': 'gzip'}>, content <{
"error": {
"code": 400,
"message": "Load configuration must specify at least one source URI",
"errors": [
{
"message": "Load configuration must specify at least one source URI",
"domain": "global",
"reason": "invalid"
}
],
"status": "INVALID_ARGUMENT"
}
}
```
as a note `bigquery_file_loads_patch_40.py` is just a reference to a
copy/pasted version of the source bigquery_file_loads.py file in the gcp/io
section of the SDK that we used to backport fixes from newer versions of beam
(like #23012). We did dependency checking to make sure the backported fixes
were okay.
(also redacting org names in stacktraces)
We are using beam version 2.40 and dataflow v2 runner when this happened.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]