Hello all, I recently worked on a transform to load data into BigQuery by writing files to GCS, and issuing Load File jobs to BQ. I did this for the Python SDK[1].
This option requires the user to provide a GCS bucket to write the files: - If the user provides a bucket to the transform, the SDK will use that bucket. - If the user does not provide a bucket: - When running in Dataflow, the SDK will borrow the temp_location of the pipeline. - When running in other runners, the pipeline will fail. The Java SDK has had functionality for File Loads into BQ for a long time; and particularly, when users do not provide a bucket, it attempts to create a default bucket[2]; and this bucket is used as temp_location (which then is used by the BQ File Loads transform). I do not really like creating GCS buckets on behalf of users. In Java, the outcome is that users will not have to pass a --tempLocation parameter when submitting jobs to Dataflow - which is a nice convenience, but I'm not sure that this is in-line with users' expectations. Currently, the options are: - Adding support for bucket autocreation for Python SDK - Deprecating support for bucket autocreation in Java SDK, and printing a warning. I am personally inclined for #1. But what do others think? Best -P. [1] https://github.com/apache/beam/pull/7892 [2] https://github.com/apache/beam/blob/5b3807be717277e3e6880a760b036fecec3bc95d/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcpOptions.java#L294-L343