Hello all,
I recently worked on a transform to load data into BigQuery by writing
files to GCS, and issuing Load File jobs to BQ. I did this for the Python
SDK[1].
This option requires the user to provide a GCS bucket to write the files:
- If the user provides a bucket to the transform, the SDK will use that
bucket.
- If the user does not provide a bucket:
- When running in Dataflow, the SDK will borrow the temp_location of
the pipeline.
- When running in other runners, the pipeline will fail.
The Java SDK has had functionality for File Loads into BQ for a long time;
and particularly, when users do not provide a bucket, it attempts to create
a default bucket[2]; and this bucket is used as temp_location (which then
is used by the BQ File Loads transform).
I do not really like creating GCS buckets on behalf of users. In Java, the
outcome is that users will not have to pass a --tempLocation parameter when
submitting jobs to Dataflow - which is a nice convenience, but I'm not sure
that this is in-line with users' expectations.
Currently, the options are:
- Adding support for bucket autocreation for Python SDK
- Deprecating support for bucket autocreation in Java SDK, and printing
a warning.
I am personally inclined for #1. But what do others think?
Best
-P.
[1] https://github.com/apache/beam/pull/7892
[2]
https://github.com/apache/beam/blob/5b3807be717277e3e6880a760b036fecec3bc95d/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcpOptions.java#L294-L343