Re: On Auto-creating GCS buckets on behalf of users

Ahmet Altay Mon, 22 Jun 2020 15:35:23 -0700

I do not have a strong opinion about this either way. I think this is
fundamentally a UX tradeoff between making it easier to get started and
potentially creating unwanted/misconfigured items. I do not have data about
what would be more preferable for most users. I believe either option would
be fine as long as we are clear with our messaging, logs, errors.


On Mon, Jun 22, 2020 at 1:48 PM Luke Cwik <[email protected]> wrote:

> I think creating the bucket makes sense since it is an improvement in the
> users experience and simplifies first time users setup needs. We should be
> clear to tell users that we are doing this on their behalf.
>
> On Mon, Jun 22, 2020 at 1:26 PM Pablo Estrada <[email protected]> wrote:
>
>> Hi everyone,
>> I've gotten around to making this change, and Udi has been gracious to
>> review it[1].
>>
>> I figured we have not fully answered the larger question of whether we
>> would truly like to make this change. Here are some thoughts giving me
>> pause:
>>
>> 1. Appropriate defaults - We are not sure we can select appropriate
>> defaults on behalf of users. (We are erroring out in case of KMS keys, but
>> how about other properties?)
>> 2. Users have been using Beam's Python SDK the way it is for a long time
>> now: Supplying temp_location when running on Dataflow, without a problem.
>> 3. This has billing implications that users may not be fully aware of
>>
>> The behavior in [1] matches the behavior of the Java SDK (create a bucket
>> when none is supplied AND running on Dataflow); but it still doesn't solve
>> the problem of ReadFromBQ/WriteToBQ from non-Dataflow runners (this can be
>> done in a follow up change using the Default Bucket functionality).
>>
>> My bias in this case is: If it isn't broken, why fix it? I do not know of
>> anyone complaining about the required temp_location flag on Dataflow.
>>
>> I think we can create a default bucket when dealing with BQ outside of
>> Dataflow, but for Dataflow, I think we don't need to fix what's not broken.
>> What do others think?
>>
>> Best
>> -P.
>>
>> [1] https://github.com/apache/beam/pull/11982
>>
>> On Tue, Jul 23, 2019 at 5:02 PM Ahmet Altay <[email protected]> wrote:
>>
>>> I agree with the benefits of auto-creating buckets from an ease of use
>>> perspective. My counter argument is that the auto created buckets may not
>>> have the right settings for the users. A bucket has multiple settings, some
>>> required as (name, storage class) and some optional (acl policy,
>>> encryption, retention policy, labels). As the number of options increase
>>> our chances of having a good enough default goes down. For example, if a
>>> user wants to enable CMEK mode for encryption, they will enable it for
>>> their sources, sinks, and will instruct Dataflow runner encrypt its
>>> in-flight data. Creating a default (non-encrpyted) temp bucket for this
>>> user would be against user's intentions. We would not be able to create a
>>> bucket either, because we would not know what encryption keys to use for
>>> such a bucket. Our options would be to either not create a bucket at all,
>>> or fail if a temporary bucket was not specified and a CMEK mode is enabled.
>>>
>>> There is a similar issue with the region flag. If unspecified it
>>> defaults to us-central1. This is convenient for new users, but not making
>>> that flag required will expose a larger proportion of Dataflow users to
>>> events in that specific region.
>>>
>>> Robert's suggestion of having a flag for opt-in to a default set of GCP
>>> convenience flags sounds reasonable. At least users will explicitly
>>> acknowledge that certain things are auto managed for them.
>>>
>>> On Tue, Jul 23, 2019 at 4:28 PM Udi Meiri <[email protected]> wrote:
>>>
>>>> Another idea would be to put default bucket preferences in a .beamrc
>>>> file so you don't have to remember to pass it every time (this could also
>>>> contain other default flag values).
>>>>
>>>
>>> IMO, the first question is whether auto-creation based on some
>>> unconfigurable defaults would happen or not. Once we agree on that, having
>>> an rc file vs flags vs supporting both would be a UX question.
>>>
>>>
>>>>
>>>>
>>>
>>>>
>>>>
>>>> On Tue, Jul 23, 2019 at 1:43 PM Robert Bradshaw <[email protected]>
>>>> wrote:
>>>>
>>>>> On Tue, Jul 23, 2019 at 10:26 PM Chamikara Jayalath
>>>>> <[email protected]> wrote:
>>>>> >
>>>>> > On Tue, Jul 23, 2019 at 1:10 PM Kyle Weaver <[email protected]>
>>>>> wrote:
>>>>> >>
>>>>> >> I agree with David that at least clearer log statements should be
>>>>> added.
>>>>> >>
>>>>> >> Udi, that's an interesting idea, but I imagine the sheer number of
>>>>> existing flags (including many SDK-specific flags) would make it difficult
>>>>> to implement. In addition, uniform argument names wouldn't necessarily
>>>>> ensure uniform implementation.
>>>>> >>
>>>>> >> Kyle Weaver | Software Engineer | github.com/ibzib |
>>>>> [email protected]
>>>>> >>
>>>>> >>
>>>>> >> On Tue, Jul 23, 2019 at 11:56 AM Udi Meiri <[email protected]>
>>>>> wrote:
>>>>> >>>
>>>>> >>> Java SDK creates one regional bucket per project and region
>>>>> combination.
>>>>> >>> So it's not a lot of buckets - no need to auto-clean.
>>>>> >
>>>>> >
>>>>> > Agree that cleanup is not a bit issue if we are only creating a
>>>>> single bucket per project and region. I assume we are creating temporary
>>>>> folders for each pipeline with the same region and project so that they
>>>>> don't conclifc (which we clean up).
>>>>> > As others mentioned we should clearly document this (including the
>>>>> naming of the bucket) and produce a log during pipeline creating.
>>>>> >
>>>>> >>>
>>>>> >>>
>>>>> >>> I agree with Robert that having less flags is better.
>>>>> >>> Perhaps what we need a unifying interface for SDKs that simplifies
>>>>> launching?
>>>>> >>>
>>>>> >>> So instead of:
>>>>> >>> mvn compile exec:java -Dexec.mainClass=<class>
>>>>> -Dexec.args="--runner=DataflowRunner --project=<project>
>>>>> --gcpTempLocation=gs://<bucket>/tmp <user flags>" -Pdataflow-runner
>>>>> >>> or
>>>>> >>> python -m <module> --runner DataflowRunner --project <project>
>>>>> --temp_location gs://<bucket>/tmp/ <user flags>
>>>>> >
>>>>> > Interesting, probably this should be extended to a generalized CLI
>>>>> for Beam that can be easily installed to execute Beam pipelines ?
>>>>>
>>>>> This is starting to get somewhat off-topic from the original question,
>>>>> but I'm not sure the benefits of providing a wrapper to the end user
>>>>> would outweigh the costs of having to learn the wrapper. For Python
>>>>> developers, python -m module, or even python -m path/to/script.py is
>>>>> pretty standard. Java is a bit harder, because one needs to coordinate
>>>>> a build as well, but I don't know how a "./beam java ..." script would
>>>>> gloss over whether one is using maven, gradle, ant, or just has a pile
>>>>> of pre-compiled jara (and would probably have to know a bit about the
>>>>> project layout as well to invoke the right commands).
>>>>>
>>>>

Re: On Auto-creating GCS buckets on behalf of users

Reply via email to