Hello all,
I recently worked on a transform to load data into BigQuery by writing
files to GCS, and issuing Load File jobs to BQ. I did this for the Python

This option requires the user to provide a GCS bucket to write the files:

   - If the user provides a bucket to the transform, the SDK will use that
   - If the user does not provide a bucket:
      - When running in Dataflow, the SDK will borrow the temp_location of
      the pipeline.
      - When running in other runners, the pipeline will fail.

The Java SDK has had functionality for File Loads into BQ for a long time;
and particularly, when users do not provide a bucket, it attempts to create
a default bucket[2]; and this bucket is used as temp_location (which then
is used by the BQ File Loads transform).

I do not really like creating GCS buckets on behalf of users. In Java, the
outcome is that users will not have to pass a --tempLocation parameter when
submitting jobs to Dataflow - which is a nice convenience, but I'm not sure
that this is in-line with users' expectations.

Currently, the options are:

   - Adding support for bucket autocreation for Python SDK
   - Deprecating support for bucket autocreation in Java SDK, and printing
   a warning.

I am personally inclined for #1. But what do others think?


[1] https://github.com/apache/beam/pull/7892

Reply via email to