Hello all,
I recently worked on a transform to load data into BigQuery by writing
files to GCS, and issuing Load File jobs to BQ. I did this for the Python
SDK[1].

This option requires the user to provide a GCS bucket to write the files:

   - If the user provides a bucket to the transform, the SDK will use that
   bucket.
   - If the user does not provide a bucket:
      - When running in Dataflow, the SDK will borrow the temp_location of
      the pipeline.
      - When running in other runners, the pipeline will fail.

The Java SDK has had functionality for File Loads into BQ for a long time;
and particularly, when users do not provide a bucket, it attempts to create
a default bucket[2]; and this bucket is used as temp_location (which then
is used by the BQ File Loads transform).

I do not really like creating GCS buckets on behalf of users. In Java, the
outcome is that users will not have to pass a --tempLocation parameter when
submitting jobs to Dataflow - which is a nice convenience, but I'm not sure
that this is in-line with users' expectations.

Currently, the options are:

   - Adding support for bucket autocreation for Python SDK
   - Deprecating support for bucket autocreation in Java SDK, and printing
   a warning.

I am personally inclined for #1. But what do others think?

Best
-P.

[1] https://github.com/apache/beam/pull/7892
[2]
https://github.com/apache/beam/blob/5b3807be717277e3e6880a760b036fecec3bc95d/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcpOptions.java#L294-L343

Reply via email to