Re: On Auto-creating GCS buckets on behalf of users

Luke Cwik Mon, 22 Jun 2020 13:49:03 -0700

I think creating the bucket makes sense since it is an improvement in the
users experience and simplifies first time users setup needs. We should be
clear to tell users that we are doing this on their behalf.


On Mon, Jun 22, 2020 at 1:26 PM Pablo Estrada <pabl...@google.com> wrote:

> Hi everyone,
> I've gotten around to making this change, and Udi has been gracious to
> review it[1].
>
> I figured we have not fully answered the larger question of whether we
> would truly like to make this change. Here are some thoughts giving me
> pause:
>
> 1. Appropriate defaults - We are not sure we can select appropriate
> defaults on behalf of users. (We are erroring out in case of KMS keys, but
> how about other properties?)
> 2. Users have been using Beam's Python SDK the way it is for a long time
> now: Supplying temp_location when running on Dataflow, without a problem.
> 3. This has billing implications that users may not be fully aware of
>
> The behavior in [1] matches the behavior of the Java SDK (create a bucket
> when none is supplied AND running on Dataflow); but it still doesn't solve
> the problem of ReadFromBQ/WriteToBQ from non-Dataflow runners (this can be
> done in a follow up change using the Default Bucket functionality).
>
> My bias in this case is: If it isn't broken, why fix it? I do not know of
> anyone complaining about the required temp_location flag on Dataflow.
>
> I think we can create a default bucket when dealing with BQ outside of
> Dataflow, but for Dataflow, I think we don't need to fix what's not broken.
> What do others think?
>
> Best
> -P.
>
> [1] https://github.com/apache/beam/pull/11982
>
> On Tue, Jul 23, 2019 at 5:02 PM Ahmet Altay <al...@google.com> wrote:
>
>> I agree with the benefits of auto-creating buckets from an ease of use
>> perspective. My counter argument is that the auto created buckets may not
>> have the right settings for the users. A bucket has multiple settings, some
>> required as (name, storage class) and some optional (acl policy,
>> encryption, retention policy, labels). As the number of options increase
>> our chances of having a good enough default goes down. For example, if a
>> user wants to enable CMEK mode for encryption, they will enable it for
>> their sources, sinks, and will instruct Dataflow runner encrypt its
>> in-flight data. Creating a default (non-encrpyted) temp bucket for this
>> user would be against user's intentions. We would not be able to create a
>> bucket either, because we would not know what encryption keys to use for
>> such a bucket. Our options would be to either not create a bucket at all,
>> or fail if a temporary bucket was not specified and a CMEK mode is enabled.
>>
>> There is a similar issue with the region flag. If unspecified it defaults
>> to us-central1. This is convenient for new users, but not making that flag
>> required will expose a larger proportion of Dataflow users to events in
>> that specific region.
>>
>> Robert's suggestion of having a flag for opt-in to a default set of GCP
>> convenience flags sounds reasonable. At least users will explicitly
>> acknowledge that certain things are auto managed for them.
>>
>> On Tue, Jul 23, 2019 at 4:28 PM Udi Meiri <eh...@google.com> wrote:
>>
>>> Another idea would be to put default bucket preferences in a .beamrc
>>> file so you don't have to remember to pass it every time (this could also
>>> contain other default flag values).
>>>
>>
>> IMO, the first question is whether auto-creation based on some
>> unconfigurable defaults would happen or not. Once we agree on that, having
>> an rc file vs flags vs supporting both would be a UX question.
>>
>>
>>>
>>>
>>
>>>
>>>
>>> On Tue, Jul 23, 2019 at 1:43 PM Robert Bradshaw <rober...@google.com>
>>> wrote:
>>>
>>>> On Tue, Jul 23, 2019 at 10:26 PM Chamikara Jayalath
>>>> <chamik...@google.com> wrote:
>>>> >
>>>> > On Tue, Jul 23, 2019 at 1:10 PM Kyle Weaver <kcwea...@google.com>
>>>> wrote:
>>>> >>
>>>> >> I agree with David that at least clearer log statements should be
>>>> added.
>>>> >>
>>>> >> Udi, that's an interesting idea, but I imagine the sheer number of
>>>> existing flags (including many SDK-specific flags) would make it difficult
>>>> to implement. In addition, uniform argument names wouldn't necessarily
>>>> ensure uniform implementation.
>>>> >>
>>>> >> Kyle Weaver | Software Engineer | github.com/ibzib |
>>>> kcwea...@google.com
>>>> >>
>>>> >>
>>>> >> On Tue, Jul 23, 2019 at 11:56 AM Udi Meiri <eh...@google.com> wrote:
>>>> >>>
>>>> >>> Java SDK creates one regional bucket per project and region
>>>> combination.
>>>> >>> So it's not a lot of buckets - no need to auto-clean.
>>>> >
>>>> >
>>>> > Agree that cleanup is not a bit issue if we are only creating a
>>>> single bucket per project and region. I assume we are creating temporary
>>>> folders for each pipeline with the same region and project so that they
>>>> don't conclifc (which we clean up).
>>>> > As others mentioned we should clearly document this (including the
>>>> naming of the bucket) and produce a log during pipeline creating.
>>>> >
>>>> >>>
>>>> >>>
>>>> >>> I agree with Robert that having less flags is better.
>>>> >>> Perhaps what we need a unifying interface for SDKs that simplifies
>>>> launching?
>>>> >>>
>>>> >>> So instead of:
>>>> >>> mvn compile exec:java -Dexec.mainClass=<class>
>>>> -Dexec.args="--runner=DataflowRunner --project=<project>
>>>> --gcpTempLocation=gs://<bucket>/tmp <user flags>" -Pdataflow-runner
>>>> >>> or
>>>> >>> python -m <module> --runner DataflowRunner --project <project>
>>>> --temp_location gs://<bucket>/tmp/ <user flags>
>>>> >
>>>> > Interesting, probably this should be extended to a generalized CLI
>>>> for Beam that can be easily installed to execute Beam pipelines ?
>>>>
>>>> This is starting to get somewhat off-topic from the original question,
>>>> but I'm not sure the benefits of providing a wrapper to the end user
>>>> would outweigh the costs of having to learn the wrapper. For Python
>>>> developers, python -m module, or even python -m path/to/script.py is
>>>> pretty standard. Java is a bit harder, because one needs to coordinate
>>>> a build as well, but I don't know how a "./beam java ..." script would
>>>> gloss over whether one is using maven, gradle, ant, or just has a pile
>>>> of pre-compiled jara (and would probably have to know a bit about the
>>>> project layout as well to invoke the right commands).
>>>>
>>>

Re: On Auto-creating GCS buckets on behalf of users

Reply via email to