Do we clean up auto created GCS buckets ?

If there's no good way to cleanup, I think it might be better to make this
opt-in.

Thanks,
Cham

On Tue, Jul 23, 2019 at 3:25 AM Robert Bradshaw <rober...@google.com> wrote:

> I think having a single, default, auto-created temporary bucket per
> project for use in GCP (when running on Dataflow, or running elsewhere
> but using GCS such as for this BQ load files example), though not
> ideal, is the best user experience. If we don't want to be
> automatically creating such things for users by default, another
> option would be a single flag that opts-in to such auto-creation
> (which could include other resources in the future).
>
> On Tue, Jul 23, 2019 at 1:08 AM Pablo Estrada <pabl...@google.com> wrote:
> >
> > Hello all,
> > I recently worked on a transform to load data into BigQuery by writing
> files to GCS, and issuing Load File jobs to BQ. I did this for the Python
> SDK[1].
> >
> > This option requires the user to provide a GCS bucket to write the files:
> >
> > If the user provides a bucket to the transform, the SDK will use that
> bucket.
> > If the user does not provide a bucket:
> >
> > When running in Dataflow, the SDK will borrow the temp_location of the
> pipeline.
> > When running in other runners, the pipeline will fail.
> >
> > The Java SDK has had functionality for File Loads into BQ for a long
> time; and particularly, when users do not provide a bucket, it attempts to
> create a default bucket[2]; and this bucket is used as temp_location (which
> then is used by the BQ File Loads transform).
> >
> > I do not really like creating GCS buckets on behalf of users. In Java,
> the outcome is that users will not have to pass a --tempLocation parameter
> when submitting jobs to Dataflow - which is a nice convenience, but I'm not
> sure that this is in-line with users' expectations.
> >
> > Currently, the options are:
> >
> > Adding support for bucket autocreation for Python SDK
> > Deprecating support for bucket autocreation in Java SDK, and printing a
> warning.
> >
> > I am personally inclined for #1. But what do others think?
> >
> > Best
> > -P.
> >
> > [1] https://github.com/apache/beam/pull/7892
> > [2]
> https://github.com/apache/beam/blob/5b3807be717277e3e6880a760b036fecec3bc95d/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcpOptions.java#L294-L343
>

Reply via email to