Hi everyone,

Load jobs into BigQuery are subject to various quotas and limitations.
In the Python SDK, the BigQuery sink that uses file loads does not work
well with these quotas and limitations.
Improvements are needed in the following areas:

   1. Handle per load job limitations on the total size.
   2. Decide when to use temp_tables to load data atomically, at pipeline
   execution time.

I have documented proposed changes in a design doc[1] as well as a draft
pull request[2].

*TL;DR:*
Partition the written files based on:

   1. Total size of files
   2. Number of files

If multiple load jobs are needed to write to a single destination, data
will be loaded to temp tables first. Once all data is loaded to these temp
tables, data in the temp table will be copied to the destination table to
ensure data is loaded into BigQuery atomically.

Would love to get feedback on the proposed changes.

Regards,
- Tanay

[1] https://s.apache.org/beam-bqfl-hardening
[2] https://github.com/apache/beam/pull/9242

Reply via email to