Re: On Auto-creating GCS buckets on behalf of users

Kyle Weaver Tue, 23 Jul 2019 13:10:51 -0700

I agree with David that at least clearer log statements should be added.

Udi, that's an interesting idea, but I imagine the sheer number of existing
flags (including many SDK-specific flags) would make it difficult to
implement. In addition, uniform argument names wouldn't necessarily ensure
uniform implementation.


Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com


On Tue, Jul 23, 2019 at 11:56 AM Udi Meiri <eh...@google.com> wrote:

> Java SDK creates one regional bucket per project and region combination
> <https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcpOptions.java#L316-L318>
> .
> So it's not a lot of buckets - no need to auto-clean.
>
> I agree with Robert that having less flags is better.
> Perhaps what we need a unifying interface for SDKs that simplifies
> launching?
>
> So instead of:
> mvn compile exec:java -Dexec.mainClass=<class>
> -Dexec.args="--runner=DataflowRunner --project=<project>
> --gcpTempLocation=gs://<bucket>/tmp <user flags>" -Pdataflow-runner
> or
> python -m <module> --runner DataflowRunner --project
> <project> --temp_location gs://<bucket>/tmp/ <user flags>
>
> We could have:
> ./beam java run <class> --runner=DataflowRunner <user flags>
> ./beam python run <module> --runner=DataflowRunner <user flags>
>
> where GCP project and temp_location are optional.
>
> On Tue, Jul 23, 2019 at 10:31 AM David Cavazos <dcava...@google.com>
> wrote:
>
>> I would go for #1 since it's a better user experience. Especially for new
>> users who don't understand every step involved on staging/deploying. It's
>> just another (unnecessary) mental concept they don't have to be aware of.
>> Anything that makes it closer to only providing the `--runner` flag without
>> any additional flags (by default, but configurable if necessary) is a good
>> thing in my opinion.
>>
>> AutoML already auto-creates a GCS bucket (not configurable, with a global
>> name which has its own downfalls). Other products are already doing this to
>> simplify user experience. I think as long as there's an explicit logging
>> statement it should be fine.
>>
>> If the bucket was not specified and was created: "No --temp_location
>> specified, created gs://..."
>>
>> If the bucket was not specified and was found: "No --temp_location
>> specified, found gs://..."
>>
>> If the bucket was specified, the logging could be omitted since it's
>> already explicit from the command line arguments.
>>
>> On Tue, Jul 23, 2019 at 10:25 AM Chamikara Jayalath <chamik...@google.com>
>> wrote:
>>
>>> Do we clean up auto created GCS buckets ?
>>>
>>> If there's no good way to cleanup, I think it might be better to make
>>> this opt-in.
>>>
>>> Thanks,
>>> Cham
>>>
>>> On Tue, Jul 23, 2019 at 3:25 AM Robert Bradshaw <rober...@google.com>
>>> wrote:
>>>
>>>> I think having a single, default, auto-created temporary bucket per
>>>> project for use in GCP (when running on Dataflow, or running elsewhere
>>>> but using GCS such as for this BQ load files example), though not
>>>> ideal, is the best user experience. If we don't want to be
>>>> automatically creating such things for users by default, another
>>>> option would be a single flag that opts-in to such auto-creation
>>>> (which could include other resources in the future).
>>>>
>>>> On Tue, Jul 23, 2019 at 1:08 AM Pablo Estrada <pabl...@google.com>
>>>> wrote:
>>>> >
>>>> > Hello all,
>>>> > I recently worked on a transform to load data into BigQuery by
>>>> writing files to GCS, and issuing Load File jobs to BQ. I did this for the
>>>> Python SDK[1].
>>>> >
>>>> > This option requires the user to provide a GCS bucket to write the
>>>> files:
>>>> >
>>>> > If the user provides a bucket to the transform, the SDK will use that
>>>> bucket.
>>>> > If the user does not provide a bucket:
>>>> >
>>>> > When running in Dataflow, the SDK will borrow the temp_location of
>>>> the pipeline.
>>>> > When running in other runners, the pipeline will fail.
>>>> >
>>>> > The Java SDK has had functionality for File Loads into BQ for a long
>>>> time; and particularly, when users do not provide a bucket, it attempts to
>>>> create a default bucket[2]; and this bucket is used as temp_location (which
>>>> then is used by the BQ File Loads transform).
>>>> >
>>>> > I do not really like creating GCS buckets on behalf of users. In
>>>> Java, the outcome is that users will not have to pass a --tempLocation
>>>> parameter when submitting jobs to Dataflow - which is a nice convenience,
>>>> but I'm not sure that this is in-line with users' expectations.
>>>> >
>>>> > Currently, the options are:
>>>> >
>>>> > Adding support for bucket autocreation for Python SDK
>>>> > Deprecating support for bucket autocreation in Java SDK, and printing
>>>> a warning.
>>>> >
>>>> > I am personally inclined for #1. But what do others think?
>>>> >
>>>> > Best
>>>> > -P.
>>>> >
>>>> > [1] https://github.com/apache/beam/pull/7892
>>>> > [2]
>>>> https://github.com/apache/beam/blob/5b3807be717277e3e6880a760b036fecec3bc95d/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcpOptions.java#L294-L343
>>>>
>>>

Re: On Auto-creating GCS buckets on behalf of users

Reply via email to