mohamedawnallah commented on code in PR #37108: URL: https://github.com/apache/beam/pull/37108#discussion_r2618969396
########## website/www/site/content/en/documentation/io/built-in/google-bigquery.md: ########## @@ -659,6 +659,11 @@ runtime. The sharding behavior depends on the runners. You must use `triggering_frequency` to specify a triggering frequency for initiating load jobs. Be careful about setting the frequency such that your pipeline doesn't exceed the BigQuery load job [quota limit](https://cloud.google.com/bigquery/quotas#load_jobs). + +> **Note:** When using file load-based BigQuery writes with dynamic destinations and a non-zero +> `triggering_frequency`, temporary tables may be created repeatedly and loads +> are not finalized into destination tables. This is a known limitation (see BEAM-9917). Review Comment: If we can reproduce this issue locally, we are halfway towards the resolution. A reproducibility can be something along those lines: - We can see where `BigQueryBatchFileLoads` is located in the codebase (using keyword-based search in the IDE) - Once we know where it is located, we can see how it has been tested e.g with single table/multiple tables - If there are tests for multiple tables, we can see if there are ongoing residual temporary tables (as mentioned in the issue) - If there are no tests for multiple tables or not feasible to be integration-tested, we can test it with a free tier version of our real GCP Once we can reproduce that issue, we can see the relevant codepaths and start to tweak them intentionally with relevant tests so that issue doesn't happen again -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
